chunkApply: Apply Functions Over Chunks of a List, Vector, or Matrix

View source: R/apply.R

chunkApplyR Documentation

Apply Functions Over Chunks of a List, Vector, or Matrix

Description

Perform equivalents of apply, lapply, and mapply, but over parallelized chunks of data. This is most useful if accessing the data is potentially time-consuming, such as for file-based matter objects. Operating on chunks reduces the number of I/O operations.

Usage

## Operate on elements/rows/columns
chunkApply(X, MARGIN, FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())

chunkLapply(X, FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())

chunkMapply(FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())


## Operate on complete chunks
chunk_rowapply(X, FUN, ...,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

chunk_colapply(X, FUN, ...,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

chunk_lapply(X, FUN, ...,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

chunk_mapply(FUN, ..., MoreArgs = NULL,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

Arguments

X

A matrix for chunkApply(), a list or vector for chunkLapply(), or lists for chunkMapply(). These may be any class that implements suitable methods for [, [[, dim, and length().

MARGIN

If the object is matrix-like, which dimension to iterate over. Must be 1 or 2, where 1 indicates rows and 2 indicates columns. The dimension names can also be used if X has dimnames set.

FUN

The function to be applied.

MoreArgs

A list of other arguments to FUN.

...

Additional arguments to be passed to FUN.

simplify

Should the result be simplified into a vector, matrix, or higher dimensional array?

outpath

If non-NULL, a file path where the results should be written as they are processed. If specified, FUN must return a 'raw', 'logical', 'integer', or 'numeric' vector. The result will be returned as a matter object.

verbose

Should user messages be printed with the current chunk being processed? If NA (the default), this is taken from getOption("matter.default.verbose").

chunkopts

An (optional) list of chunk options including nchunks, chunksize, and serialize. See "Details".

depends

A list with length equal to the extent of X. Each element of depends should give a vector of indices which correspond to other elements of X on which each computation depends. These elements are passed to FUN. For time efficiency, no attempt is made to verify these indices are valid.

permute

Should the order of items be randomized? This may be useful for iterating over random subsets. No attempt is made to re-order the results.

RNG

Should the local random seed (as set by set.seed) be forwarded to the worker processes? If RNGkind is set to "L'Ecuyer-CMRG", then the random seed will be set to appropriate substreams for each chunk or for each element/row/column. Note that forwarding the local random seed incurs additional overhead.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

Details

For chunkApply(), chunkLapply(), and chunkMapply():

For vectors and lists, the vector is broken into some number of chunks according to chunks. The individual elements of the chunk are then passed to FUN.

For matrices, the matrix is chunked along rows or columns, based on the number of chunks. The individual rows or columns of the chunk are then passed to FUN.

In this way, the first argument of FUN is analogous to using the base apply, lapply, and mapply functions.

For chunk_rowapply(), chunk_colapply(), chunk_lapply(), and chunk_mapply():

In this situation, the entire chunk is passed to FUN, and FUN is responsible for knowing how to handle a sub-vector or sub-matrix of the original object. This may be useful if FUN is already a function that could be applied to the whole object such as rowSums or colSums.

When this is the case, it may be useful to provide a custom simplify function.

For convenience to the programmer, several attributes are made available when operating on a chunk.

  • "chunkid": The index of the chunk currently being processed by FUN.

  • "chunklen": The number of elements in the chunk that should be processed.

  • "index": The indices of the elements of the chunk, as elements/rows/columns in the original matrix/vector.

  • "depends" (optional): If depends is given, then this is a list of indices within the chunk. The length of the list is equal to the number of elements/rows/columns in the chunk. Each list element is either NULL or a vector of indices giving the elements/rows/columns of the chunk that should be processed for that index. The indices that should be processed will be non-NULL, and indices that should be ignored will be NULL.

The depends argument can be used to iterate over dependent elements of a vector, or dependent rows/columns of a matrix. This can be useful if the calculation for a particular row/column/element depends on the values of others.

When depends is provided, multiple rows/columns/elements will be passed to FUN. Each element of the depends list should be a vector giving the indices that should be passed to FUN.

For example, this can be used to implement a rolling apply function.

Several options are supported by chunkopts to override the global options:

  • nchunks: The number of chunks to use. If omitted, this is taken from getOption("matter.default.nchunks"). For IO-bound operations, using fewer chunks will often be faster, but use more memory.

  • chunksize: The approximate chunk size in bytes. If omitted, this is taken from getOption("matter.default.chunksize"). For IO-bound operations, using larger chunks will often be faster, but use more memory. If set to NA_real_, then the chunk size is determined by the number of chunks.

  • serialize: Whether data in virtual memory should be realized on the manager and serialized to the workers (TRUE), passed to the workers in virtual memory as-is (FALSE), or if matter should decide the behavior based on the cluster configuration (NA). If omitted, this is taken from getOption("matter.default.serialize"). If all workers have access to the same virtual memory resources (whether file storage or shared memory), then it can be significantly faster to avoid serializing the data.

Value

Typically, a list if simplify=FALSE. Otherwise, the results may be coerced to a vector or array.

Author(s)

Kylie A. Bemis

See Also

apply, lapply, mapply, RNGkind, RNGStreams, SnowfastParam

Examples

register(SerialParam())

set.seed(1)
x <- matrix(rnorm(1000^2), nrow=1000, ncol=1000)

out <- chunkApply(x, 1L, mean, chunkopts=list(nchunks=10))
head(out)

kuwisdelu/matter documentation built on Oct. 19, 2024, 10:31 a.m.