caclassfit: Software Alchemy for Machine Learning

View source: R/Class.R

caclassfit,caclasspred,vote,re_codeR Documentation

Software Alchemy for Machine Learning

Description

Parallelization of machine learning algorithms.

Usage

caclassfit(cls,fitcmd) 
caclasspred(fitobjs,newdata,yidx=NULL,...)
vote(preds)
re_code(x)

Arguments

cls

A cluster run under the parallel package.

fitcmd

A string containing a model-fitting command to be run on each cluster node. This will typically include specification of the distributed data set.

fitobjs

An R list of objects returned by the fitcmd calls.

newdata

Data to be predicted from the fit computed by caclassfit.

yidx

If provided, index of the true class values in newdata, typically in a cross-validation setting.

...

Arguments to be passed to the underlying prediction function for the given method, e.g. predict.rpart.

preds

A vector of predicted classes, from which the "winner" will be selected by voting.

x

A vector of integers, in this context class codes.

Details

This should work for almost any classification code that has a “fit” function and a predict method.

The method assumes i.i.d. data. If your data set had been stored in some sorted order, it must be randomized first, say using the scramble option in distribsplit or by calling readnscramble, depending on whether your data is already in memory or still in a file.

It is assumed that class labels are 1,2,... If not, use re_code.

Value

The caclassfit function returns an R list of objects as in fitobjs above.

The caclasspred function returns an R list with these components:

  • predmat, a matrix of predicted classes for newdata, one row per cluster node

  • preds, the final predicted classes, after using vote to resolve possible differences in predictions among nodes

  • consensus, the proportion of cases for which all nodes gave the same predictions (higher values indicating more stability)

  • acc, if yidx is non-NULL, the proportion of cases in which preds is correct

  • confusion, if yidx is non-NULL, the confusion matrix

Author(s)

Norm Matloff

Examples

## Not run: 
# set up 'parallel' cluster
cls <- makeCluster(2)
setclsinfo(cls)
# data prep
data(prgeng)
prgeng$occ <- re_code(prgeng$occ)
prgeng$bs <- as.integer(prgeng$educ == 13)
prgeng$ms <- as.integer(prgeng$educ == 14)
prgeng$phd <- as.integer(prgeng$educ == 15)
prgeng$sex <- prgeng$sex - 1
pe <- prgeng[,c(1,7,8,9,12,13,14,5)]
pe$occ <- as.factor(pe$occ)   # needed for rpart!
# go
distribsplit(cls,'pe')
library(rpart)
clusterEvalQ(cls,library(rpart))
fit <- caclassfit(cls,"rpart(occ ~ .,data=pe)")
predout <- caclasspred(fit,pe,8,type='class')
predout$acc  # 0.36 

stopCluster(cls)

## End(Not run)

matloff/partools documentation built on Oct. 20, 2022, 2:52 p.m.