ca: Software Alchemy: Turning Complex Statistical Computations...
In partools: Tools for the 'Parallel' Package

Description Usage Arguments Details Value Author(s) References Examples

View source: R/CA.R

Easy parallelization of most statistical computations.

ca(cls,z,ovf,estf,estcovf=NULL,findmean=TRUE,scramble=FALSE)
cabase(cls,ovf,estf,estcovf=NULL,findmean=TRUE,cacall=FALSE,z=NULL,scramble=FALSE)
calm(cls,lmargs) 
caglm(cls,glmargs) 
caprcomp(cls,prcompargs, p)
cakm(cls,mtdf,ncenters,p)
cameans(cls,cols,na.rm=FALSE) 
caquantile(cls,vec, probs = c(0.25, 0.5, 0.75),na.rm=FALSE) 
caagg(cls,ynames,xnames,dataname,FUN)
caknn(cls, yname, k, xname='')

`cls`	A cluster run under the parallel package.
`z`	A data frame, matrix or vector, one observation per row/element.
`ovf`	Overall statistical function, say `glm`.
`estf`	Function to extract the point estimate (typically vector-valued) from the output of `ovf`.
`estcovf`	If provided, function to extract the estimated covariance matrix of the output of `estf`

.

`findmean`	If TRUE, output the average of the estimates from the chunks; otherwise, output only the estimates themselves.
`lmargs`	Quoted string representing arguments to `lm`, e.g. R formula, `data` specification.
`glmargs`	Quoted string representing arguments to `glm`, e.g. R formula, `data` specification, and `family` argument.
`prcompargs`	Quoted string representing arguments to `prcomp`.
`p`	Number of columns in data
`na.rm`	If TRUE, remove NA values from the analysis.
`mtdf`	Quoted name of a distributed matrix or data frame.
`ncenters`	Number of clusters to find.
`cacall`	If TRUE, indicates that `cabase` had been called by `ca`
`scramble`	If this and `cacall` are TRUE, randomize the data before distributing.
`cols`	A quoted string that evaluates to a data frame or matrix.
`vec`	A quoted string that evaluates to a vector.
`yname`	A quoted variable name, for the Y vector.
`k`	Number of nearest neighbors.
`xname`	A quoted variable name, for the X matrix/data frame. If empty, it is assumed that `preprocessx` has already been run on the nodes; if nonempty, that function is run on this X data.
`ynames`	A vector of quoted variable names.
`xnames`	A vector of quoted variable names.
`dataname`	Quoted name of a data frame or matrix.
`probs`	As in the argument with the same name in `quantile`. Should not be 0.00 or 1.00, as asymptotic normality doesn't hold.
`FUN`	Quoted name of a function.

Implements the “Software Alchemy” (SA) method for parallelizing statistical computations (N. Matloff, Parallel Computation for Data Science, Chapman and Hall, 2015, with further details in N. Matloff, Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones, Journal of Statistical Software, 2016.) This can result in substantial speedups in computation, as well as address limits on physical memory.

The method involves breaking the data into chunks, and then applying the given estimator to each one. The results are averaged, and an estimated covariance matrix computed (optional).

Except for ca, it is assumed that the chunking has already been done, say via distribsplit or readnscramble.

Note that in cabase, the data object is not specified explicitly in the argument list. This is done through the function ovf.

Key point: The SA estimator is statistically equivalent to the original, nonparallel one, in the sense that they have the SAME asymptotic statistical accuracy. Neither the non-SA nor the SA estimator is "better" than the other, and usually they will be quite close to each other anyway. Since we would use SA only with large data sets anyway (otherwise, parallel computation would not be needed for speed), the asymptotic aspect should not be an issue. In other words, with SA we achieve the same statistical accuracy while possibly attaining much faster computation.

It is vital to keep in mind that The memory space issue can be just as important as run time. Even if the problem is run on many cores, if the total memory space needed exceeds that of the machine, the run may fail.

Wrapper functions, applying SA to the corresponding R function (or function elsewere in this package):

calm: Wrapper for lm.
caglm: Wrapper for glm.
caprcomp: Wrapper for prcomp.
cakm: Wrapper for kmeans.
cameans: Wrapper for colMeans.
caquantile: Wrapper for quantile.
caagg: Like distribagg, but finds the average value of FUN across the cluster nodes.

A note on NA values: Some R functions such as lm, glm and prcomp have an na.action argument. The default is na.omit, which means that cases with at least one NA value will be discarded. (This is also settable via options().) However, na.omit seems to have no effect in prcomp unless that function's formula option is used. When in doubt, apply the function na.omit directly; e.g. na.omit(d) for a data frame d returns a data frame consisting of only the intact rows of d.

The method assumes that the base estimator is asymptotically normal, and assumes i.i.d. data. If your data set had been stored in some sorted order, it must be randomized first, say using the scramble option in distribsplit or by calling readnscramble, depending on whether your data is already in memory or still in a file.

R list with these components:

thts, the results of applying the requested estimator to the chunks; the estimator from chunk i is in row i
tht, the chunk-averaged overall estimator, if requested
thtcov, the estimated covariance matrix of tht, if available

The wrapper functions return the following list elements:

calm, caglm: estimated regression coefficients and their estimated covariance matrix
caprcomp: sdev (square roots of the eigenvalues) and rotation, as with prcomp; thts is returned as well.
cakm: centers and size, as with kmeans; thts is returned as well.

The wrappers that return thts are useful for algorithms that may expose some instability in the original (i.e. non-SA) algorithm. With prcomp, for instance, the eigenvectors corresponding to the smaller eigenvalues may have high variances in the nonparallel version, which will be reflected in large differences from chunk to chunk in SA, visible in thts. Note that this reflects a fundamental problem with the algorithm on the given data set, not due to Software Alchemy; on the contrary, an important advantage of the SA approach is to expose such problems.

Norm Matloff

N. Matloff N (2016). "Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones." Journal of Statistical Software, 71(4), 1-15.

# set up 'parallel' cluster
cls <- makeCluster(2)
setclsinfo(cls)

# generate simulated test data, as distributed data frame
n <- 10000
p <- 2
tmp <- matrix(rnorm((p+1)*n),nrow=n)
u <- tmp[,1:p]  # "X" values
# add a "Y" col
u <- cbind(u,u %*% rep(1,p) + tmp[,p+1])
# now in u, cols 1,2 are the "X" variables, and col 3 is "Y", 
# with regress coefs (0,1,1), with tmp[,p+1] being the error term
distribsplit(cls,"u")  # form distributed d.f.
# apply the function
#### calm(cls,"u[,3] ~ u[,1]+u[,2]")$tht
calm(cls,"V3 ~ .,data=u")$tht
# check; results should be approximately the same
lm(u[,3] ~ u[,1]+u[,2])
# without the wrapper
ovf <- function(dummy=NULL) lm(V3 ~ .,data=z168)
ca(cls,u,ovf,estf=coef,estcovf=vcov)$tht

## Not run: 
# Census data on programmers and engineers; include a quadratic term for
# age, due to nonmonotone relation to income
data(prgeng) 
distribsplit(cls,"prgeng") 
caout <- calm(cls,"wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng")
caout$tht
# compare to nonparallel
lm(wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng)
# get standard errors of the beta-hats
sqrt(diag(caout$thtcov))

# find mean age for all combinations of the cit and sex variables
caagg(cls,"age",c("cit","sex"),"prgeng","mean") 
# compare to nonparallel
aggregate(age ~ cit+sex,data=prgeng,mean)  

data(newadult) 
distribsplit(cls,"newadult") 
caglm(cls," gt50 ~ ., family = binomial,data=newadult")$tht 

caprcomp(cls,'newadult,scale=TRUE',5)$sdev
prcomp(newadult,scale=TRUE)$sdev

cameans(cls,"prgeng")
cameans(cls,"prgeng[,c('age','wageinc')]")
caquantile(cls,'prgeng$age')

pe <- prgeng[,c(1,3,8)] 
distribsplit(cls,"pe") 
z1 <- cakm(cls,'pe',3,3); z1$size; z1$centers 
# check algorithm unstable
z1$thts  # looks unstable

pe <- prgeng 
pe$ms <- as.integer(pe$educ == 14) 
pe$phd <- as.integer(pe$educ == 16) 
pe <- pe[,c(1,7,8,9,12,13)] 
distribsplit(cls,'pe',scramble=TRUE)
kout <- caknn(cls,'pe[,3]',50,'pe[,-3]') 

## End(Not run)

stopCluster(cls)