formrowchunks: Utilities for 'parallel' cluster code.
In matloff/partools: Tools for the 'Parallel' Package

View source: R/SnowUtils.R

formrowchunks,addlists,matrixtolist,setclsinfo,getpte,distribsplit,distribcat,distribagg,distribrange,distribcounts,distribgetrows,docmd,doclscmd,geteltis,distribmeans,distribisdt,ipstrcat,dwhich.min,dwhich.max

R Documentation

Utilities for parallel cluster code.

Description

Miscellaneous code snippets for use with the parallel package, including “Snowdoop.”

Usage

formrowchunks(cls,m,mchunkname,scramble=FALSE) 
matrixtolist(rc,m) 
addlists(lst1,lst2,add)
setclsinfo(cls)
getpte()
exportlibpaths(cls)
distribsplit(cls,dfname,scramble=FALSE)
distribcat(cls,dfname) 
distribagg(cls,ynames,xnames,dataname,FUN,FUNdim=1,FUN1=FUN) 
distribrange(cls,vec,na.rm=FALSE) 
distribcounts(cls,xnames,dataname)
distribmeans(cls,ynames,xnames,dataname,saveni=FALSE)
dwhich.min(cls,vecname)
dwhich.max(cls,vecname)
distribgetrows(cls,cmd)
distribisdt(cls,dataname) 
docmd(toexec) 
doclscmd(cls,toexec) 
geteltis(lst,i)
ipstrcat(str1 = stop("str1 not supplied"), ..., outersep = "", innersep = "")

Arguments

`cls`	A cluster for the parallel package.
`scramble`	If TRUE, randomize the row order in the resulting data frame.
`rc`	Set to 1 for rows, other for columns.
`m`	A matrix or data frame.
`mchunkname`	Quoted name to be given to the created chunks.
`lst1`	An R list.
`lst2`	An R list.
`add`	“Addition” function, which could be summation, concatenation and so on.
`dfname`	Quoted name of a data frame, either centralized or distributed.
`ynames`	Vector of quoted names of variables on which `FUN` is to be applied.
`vecname`	Quoted name of a vector.
`...`	One of more vectors of character strings, where the vectors are typically of length 1.
`xnames`	Vector of quoted names of variables that define the grouping.
`dataname`	Quoted name of a distributed data frame or data.table.
`saveni`	If TRUE, save the chunk sizes.
`FUN`	Quoted name of a single-argument function to be used in aggregating within cluster nodes. If `dataname` is the name of a data.table, `FUN` must be a vector of function names, of length equal to that of `ynames`.
`FUNdim`	Number of elements in the return value of `FUN`. Must be 1 for data.tables.
`FUN1`	Quoted name of function to be used in aggregation between cluster nodes.
`vec`	Quoted expression that evaluates to a vector.
`na.rm`	Remove NA values.
`cmd`	An R command.
`toexec`	Quoted string containing command to be executed.
`lst`	An R list of vectors.
`i`	A column index
`str1`	A character string.
`outersep`	Separator, e.g. a comma, between strings specified in ...
`innersep`	Separator, e.g. a comma, within string vectors specified in ...

Details

The setclsinfo function does initialization needed for use of the tools in the package.

formrowchunks splits m into chunks of rows and puts each chunk into a global variable called mchunkname in the global space of the worker.

A call to matrixtolist extracts the rows or columns of a matrix or data frame and forms an R list from them.

The function addlists does the following: Say we have two lists, with numeric values. We wish to form a new list, with all the keys (names) from the two input lists appearing in the new list. In the case of a key in common to the two lists, the value in the new list will be the sum of the two individual values for that key. (Here “sum” means the result of applying add.) For a key appearing in one list and not the other, the value in the new list will be the value in the input list.

The function exportlibpaths, invoked from the manager, exports the manager's R search path to the workers.

The function distribsplit splits a data frame dfname into approximately equal-sized chunks of rows, placing the chunks on the cluster nodes, as global variables of the same name. The opposite action is taken by distribcat, coalsecing variables of the given name in the cluster nodes into one grand data frame as the calling (i.e. manager) node.

The package's distribagg function is a distributed (and somewhat restricted) form of aggregate. The latter is called to each distributed chunk with the function FUN. The manager collects the results and calls FUN1.

The special cases of aggregating counts and means is handled by the wrappers distribcounts and distribmeans. In each case, cells are defined by xnames, and aggregation done first within workers and then across workers.

The distribrange function is a distributed form of range.

The dwhich.min and dwhich.max functions are distributed analogs of R's which.min and which.max.

The distribgetrows function is useful in a variety of situations. It can be used, for instance, as a distributed form of select. In the latter case, the specified rows will be selected at each cluster node, then rbind-ed together at the caller.

The docmd function executes the quoted command, useful for building up complex command for remote execution. The doclscmd function does that directly.

An R formula will be constructed from the arguments ynames and xnames, with the latter put on the left side of the ~ sign, with cbind for combining, and the latter put on the right side, with + signs as delimiters.

The geteltis function extracts from an R list of vectors element i from each.

Value

In the case of addlists, the return value is the new list.

The distribcat function returns the concatenated data frame; distribgetrows works similarly.

The distribagg function returns a data frame, the same as would a call to aggregate, though possibly in different row order; distribcounts works similarly.

The dwhich.min and dwhich.max functions each return a two-tuple, consisting of the node number and row number which node at which the min or max occurs.

Author(s)

Norm Matloff

Examples

# examples of addlists()
l1 <- list(a=2, b=5, c=1)
l2 <- list(a=8, c=12, d=28)
addlists(l1,l2,sum)  # list with a=10, b=5, c=13, d=28
z1 <- list(x = c(5,12,13), y = c(3,4,5))
z2 <- list(y = c(8,88))
addlists(z1,z2,c)  # list with x=(5,12,13), y=(3,4,5,8,88)

# need 'parallel' cluster for the remaining examples
cls <- makeCluster(2)
setclsinfo(cls)

# check it
clusterEvalQ(cls,partoolsenv$myid)  # returns 1, 2
clusterEvalQ(cls,partoolsenv$ncls)  # returns 2, 2

# formrowchunks example; see up a matrix to be distributed first
m <- rbind(1:2,3:4,5:6)
# apply the function
formrowchunks(cls,m,"mc")
# check results
clusterEvalQ(cls,mc)  # list of a 1x2 and a 2x2 matrix

matrixtolist(1,m)  # 3-component list, first is (1,2)

# test of of distribagg(): 
# form and distribute test data
x <- sample(1:3,10,replace=TRUE)
y <- sample(0:1,10,replace=TRUE)
u <- runif(10)
v <- runif(10)
d <- data.frame(x,y,u,v)
distribsplit(cls,"d")
# check that it's there at the cluster nodes, in distributed form
clusterEvalQ(cls,d) 
d
# try the aggregation function
distribagg(cls,c("u","v"), c("x","y"),"d","max")
# check result
aggregate(cbind(u,v) ~ x+y,d,max)

# real data
mtc <- mtcars
distribsplit(cls,"mtc")

distribagg(cls,c("mpg","disp","hp"),c("cyl","gear"),"mtc","max")
# check
aggregate(cbind(mpg,disp,hp) ~ cyl+gear,data=mtcars,FUN=max)

distribcounts(cls,c("cyl","gear"),"mtc")
# check
table(mtc$cyl,mtc$gear)

# find mean mpg, hp for each cyl/gear combination
distribmeans(cls,c('mpg','hp'),c('cyl','gear'),'mtc')

# extract and collect all the mtc rows in which the number of cylinders is 8
distribgetrows(cls,'mtc[mtc$cyl == 8,]')
# check
mtc[mtc$cyl == 8,]

# same for data.tables
mtc <- as.data.table(mtc)
setkey(mtc,cyl)
distribsplit(cls,'mtc')
distribcounts(cls,c("cyl","gear"),"mtc")
distribmeans(cls,c('mpg','hp'),c('cyl','gear'),'mtc')

dwhich.min(cls,'mtc$mpg')  # smallest is at node 1, row 15
dwhich.max(cls,'mtc$mpg')  # largest is at node 2, row 4

stopCluster(cls)

matloff/partools documentation built on Oct. 20, 2022, 2:52 p.m.