Description Usage Arguments Details Value Author(s) References Examples
Easy parallelization of most statistical computations.
1 2 3 4 5 6 7 8 9 10 11  ca(cls,z,ovf,estf,estcovf=NULL,findmean=TRUE,scramble=FALSE)
cabase(cls,ovf,estf,estcovf=NULL,findmean=TRUE,cacall=FALSE,z=NULL,scramble=FALSE)
calm(cls,lmargs)
caglm(cls,glmargs)
caprcomp(cls,prcompargs, p)
cakm(cls,mtdf,ncenters,p)
cameans(cls,cols,na.rm=FALSE)
caquantile(cls,vec, probs = c(0.25, 0.5, 0.75),na.rm=FALSE)
caagg(cls,ynames,xnames,dataname,FUN)
caknn(cls, yname, k, xname='')
carq(cls,rqargs)

cls 
A cluster run under the parallel package. 
z 
A data frame, matrix or vector, one observation per row/element. 
ovf 
Overall statistical function, say 
estf 
Function to extract the point estimate (typically
vectorvalued) from the output of 
estcovf 
If provided, function to extract the estimated
covariance matrix of the output of 
.
findmean 
If TRUE, output the average of the estimates from the chunks; otherwise, output only the estimates themselves. 
lmargs 
Quoted string representing arguments to 
glmargs 
Quoted string representing arguments to 
rqargs 
Quoted string representing arguments to 
prcompargs 
Quoted string representing arguments to

p 
Number of columns in data 
na.rm 
If TRUE, remove NA values from the analysis. 
mtdf 
Quoted name of a distributed matrix or data frame. 
ncenters 
Number of clusters to find. 
cacall 
If TRUE, indicates that 
scramble 
If this and 
cols 
A quoted string that evaluates to a data frame or matrix. 
vec 
A quoted string that evaluates to a vector. 
yname 
A quoted variable name, for the Y vector. 
k 
Number of nearest neighbors. 
xname 
A quoted variable name, for the X matrix/data frame. If
empty, it is assumed that 
ynames 
A vector of quoted variable names. 
xnames 
A vector of quoted variable names. 
dataname 
Quoted name of a data frame or matrix. 
probs 
As in the argument with the same name in

FUN 
Quoted name of a function. 
Implements the “Software Alchemy” (SA) method for parallelizing statistical computations (N. Matloff, Parallel Computation for Data Science, Chapman and Hall, 2015, with further details in N. Matloff, Software Alchemy: Turning Complex Statistical Computations into EmbarrassinglyParallel Ones, Journal of Statistical Software, 2016.) This can result in substantial speedups in computation, as well as address limits on physical memory.
The method involves breaking the data into chunks, and then applying the given estimator to each one. The results are averaged, and an estimated covariance matrix computed (optional).
Except for ca
, it is assumed that the chunking has already been
done, say via distribsplit
or readnscramble
.
Note that in cabase
, the data object is not specified explicitly
in the argument list. This is done through the function ovf
.
Key point: The SA estimator is statistically equivalent to the original, nonparallel one, in the sense that they have the SAME asymptotic statistical accuracy. Neither the nonSA nor the SA estimator is "better" than the other, and usually they will be quite close to each other anyway. Since we would use SA only with large data sets anyway (otherwise, parallel computation would not be needed for speed), the asymptotic aspect should not be an issue. In other words, with SA we achieve the same statistical accuracy while possibly attaining much faster computation.
It is vital to keep in mind that The memory space issue can be just as important as run time. Even if the problem is run on many cores, if the total memory space needed exceeds that of the machine, the run may fail.
Wrapper functions, applying SA to the corresponding R function (or function elsewere in this package):
calm
: Wrapper for lm
.
caglm
: Wrapper for glm
.
caprcomp
: Wrapper for prcomp
.
cakm
: Wrapper for kmeans
.
cameans
: Wrapper for colMeans
.
caquantile
: Wrapper for quantile
.
caagg
: Like distribagg
, but finds the
average value of FUN
across the cluster nodes.
A note on NA values: Some R functions such as lm
, glm
and
prcomp
have an na.action
argument. The default is
na.omit
, which means that cases with at least one NA value will
be discarded. (This is also settable via options()
.) However,
na.omit
seems to have no effect in prcomp
unless that
function's formula
option is used. When in doubt, apply the
function na.omit
directly; e.g. na.omit(d)
for a data
frame d
returns a data frame consisting of only the intact rows of
d
.
The method assumes that the base estimator is asymptotically normal, and
assumes i.i.d. data. If your data set had been stored in some sorted
order, it must be randomized first, say using the scramble
option
in distribsplit
or by calling readnscramble
, depending on
whether your data is already in memory or still in a file.
R list with these components:
thts
, the results of applying the requested estimator to
the chunks; the estimator from chunk i is in row i
tht
, the chunkaveraged overall estimator, if requested
thtcov
, the estimated covariance matrix of tht
,
if available
The wrapper functions return the following list elements:
calm
, caglm
: estimated regression coefficients
and their estimated covariance matrix
caprcomp
: sdev
(square roots of the
eigenvalues) and rotation
, as with prcomp
;
thts
is returned as well.
cakm
: centers
and size
, as with
kmeans
; thts
is returned as well.
The wrappers that return thts
are useful for algorithms that may
expose some instability in the original (i.e. nonSA) algorithm. With
prcomp
, for instance, the eigenvectors corresponding to the
smaller eigenvalues may have high variances in the nonparallel version,
which will be reflected in large differences from chunk to chunk in SA,
visible in thts
. Note that this reflects a fundamental problem
with the algorithm on the given data set, not due to Software Alchemy;
on the contrary, an important advantage of the SA approach is to expose
such problems.
Norm Matloff
N. Matloff N (2016). "Software Alchemy: Turning Complex Statistical Computations into EmbarrassinglyParallel Ones." Journal of Statistical Software, 71(4), 115.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67  # set up 'parallel' cluster
cls < makeCluster(2)
setclsinfo(cls)
# generate simulated test data, as distributed data frame
n < 10000
p < 2
tmp < matrix(rnorm((p+1)*n),nrow=n)
u < tmp[,1:p] # "X" values
# add a "Y" col
u < cbind(u,u %*% rep(1,p) + tmp[,p+1])
# now in u, cols 1,2 are the "X" variables, and col 3 is "Y",
# with regress coefs (0,1,1), with tmp[,p+1] being the error term
distribsplit(cls,"u") # form distributed d.f.
# apply the function
#### calm(cls,"u[,3] ~ u[,1]+u[,2]")$tht
calm(cls,"V3 ~ .,data=u")$tht
# check; results should be approximately the same
lm(u[,3] ~ u[,1]+u[,2])
# without the wrapper
ovf < function(dummy=NULL) lm(V3 ~ .,data=z168)
ca(cls,u,ovf,estf=coef,estcovf=vcov)$tht
## Not run:
# Census data on programmers and engineers; include a quadratic term for
# age, due to nonmonotone relation to income
data(prgeng)
distribsplit(cls,"prgeng")
caout < calm(cls,"wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng")
caout$tht
# compare to nonparallel
lm(wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng)
# get standard errors of the betahats
sqrt(diag(caout$thtcov))
# find mean age for all combinations of the cit and sex variables
caagg(cls,"age",c("cit","sex"),"prgeng","mean")
# compare to nonparallel
aggregate(age ~ cit+sex,data=prgeng,mean)
data(newadult)
distribsplit(cls,"newadult")
caglm(cls," gt50 ~ ., family = binomial,data=newadult")$tht
caprcomp(cls,'newadult,scale=TRUE',5)$sdev
prcomp(newadult,scale=TRUE)$sdev
cameans(cls,"prgeng")
cameans(cls,"prgeng[,c('age','wageinc')]")
caquantile(cls,'prgeng$age')
pe < prgeng[,c(1,3,8)]
distribsplit(cls,"pe")
z1 < cakm(cls,'pe',3,3); z1$size; z1$centers
# check algorithm unstable
z1$thts # looks unstable
pe < prgeng
pe$ms < as.integer(pe$educ == 14)
pe$phd < as.integer(pe$educ == 16)
pe < pe[,c(1,7,8,9,12,13)]
distribsplit(cls,'pe',scramble=TRUE)
kout < caknn(cls,'pe[,3]',50,'pe[,3]')
## End(Not run)
stopCluster(cls)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.