caclassfit: Software Alchemy for Machine Learning
In partools: Tools for the 'Parallel' Package

Description Usage Arguments Details Value Author(s) Examples

View source: R/Class.R

Parallelization of machine learning algorithms.

caclassfit(cls,fitcmd) 
caclasspred(fitobjs,newdata,yidx=NULL,...)
vote(preds)
re_code(x)

`cls`	A cluster run under the parallel package.
`fitcmd`	A string containing a model-fitting command to be run on each cluster node. This will typically include specification of the distributed data set.
`fitobjs`	An R list of objects returned by the `fitcmd` calls.
`newdata`	Data to be predicted from the fit computed by `caclassfit`.
`yidx`	If provided, index of the true class values in `newdata`, typically in a cross-validation setting.
`...`	Arguments to be passed to the underlying prediction function for the given method, e.g. `predict.rpart`.
`preds`	A vector of predicted classes, from which the "winner" will be selected by voting.
`x`	A vector of integers, in this context class codes.

This should work for almost any classification code that has a “fit” function and a predict method.

The method assumes i.i.d. data. If your data set had been stored in some sorted order, it must be randomized first, say using the scramble option in distribsplit or by calling readnscramble, depending on whether your data is already in memory or still in a file.

It is assumed that class labels are 1,2,... If not, use re_code.

The caclassfit function returns an R list of objects as in fitobjs above.

The caclasspred function returns an R list with these components:

predmat, a matrix of predicted classes for newdata, one row per cluster node
preds, the final predicted classes, after using vote to resolve possible differences in predictions among nodes
consensus, the proportion of cases for which all nodes gave the same predictions (higher values indicating more stability)
acc, if yidx is non-NULL, the proportion of cases in which preds is correct
confusion, if yidx is non-NULL, the confusion matrix

Norm Matloff

## Not run: 
# set up 'parallel' cluster
cls <- makeCluster(2)
setclsinfo(cls)
# data prep
data(prgeng)
prgeng$occ <- re_code(prgeng$occ)
prgeng$bs <- as.integer(prgeng$educ == 13)
prgeng$ms <- as.integer(prgeng$educ == 14)
prgeng$phd <- as.integer(prgeng$educ == 15)
prgeng$sex <- prgeng$sex - 1
pe <- prgeng[,c(1,7,8,9,12,13,14,5)]
pe$occ <- as.factor(pe$occ)   # needed for rpart!
# go
distribsplit(cls,'pe')
library(rpart)
clusterEvalQ(cls,library(rpart))
fit <- caclassfit(cls,"rpart(occ ~ .,data=pe)")
predout <- caclasspred(fit,pe,8,type='class')
predout$acc  # 0.36 

stopCluster(cls)

## End(Not run)