chunkApply: Apply Functions Over Chunks of a List, Vector, or Matrix

View source: R/apply.R

chunkApplyR Documentation

Apply Functions Over Chunks of a List, Vector, or Matrix

Description

Perform equivalents of apply, lapply, and mapply, but over parallelized chunks of data. This is most useful if accessing the data is potentially time-consuming, such as for file-based matter objects. Operating on chunks reduces the number of I/O operations.

Usage

## Operate on elements/rows/columns
chunkApply(X, MARGIN, FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())

chunkLapply(X, FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())

chunkMapply(FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())


## Operate on complete chunks
chunk_rowapply(X, FUN, ...,
    simplify = "c", nchunks = NA, depends = NULL,
    seeds = NULL, verbose = NA, BPPARAM = bpparam())

chunk_colapply(X, FUN, ...,
    simplify = "c", nchunks = NA, depends = NULL,
    seeds = NULL, verbose = NA, BPPARAM = bpparam())

chunk_lapply(X, FUN, ...,
    simplify = "c", nchunks = NA, depends = NULL,
    seeds = NULL, verbose = NA, BPPARAM = bpparam())

chunk_mapply(FUN, ..., MoreArgs = NULL,
    simplify = "c", nchunks = NA, depends = NULL,
    seeds = NULL, verbose = NA, BPPARAM = bpparam())

Arguments

X

A matrix for chunkApply(), a list or vector for chunkLapply(), or lists for chunkMapply(). These may be any class that implements suitable methods for [, [[, dim, and length().

MARGIN

If the object is matrix-like, which dimension to iterate over. Must be 1 or 2, where 1 indicates rows and 2 indicates columns. The dimension names can also be used if X has dimnames set.

FUN

The function to be applied.

MoreArgs

A list of other arguments to FUN.

...

Additional arguments to be passed to FUN.

simplify

Should the result be simplified into a vector, matrix, or higher dimensional array?

nchunks

The number of chunks to use. If NA (the default), this is taken from getOption("matter.default.nchunks"). For IO-bound operations, using fewer chunks will often be faster, but use more memory.

depends

A list with length equal to the extent of X. Each element of depends should give a vector of indices which correspond to other elements of X on which each computation depends. These elements are passed to FUN. For time efficiency, no attempt is made to verify these indices are valid.

seeds

A list of RNG seeds such such as those returned by RNGStreams. If specified, must provide as many seeds as the number of chunks. Seeds are set per-chunk. Must have RNGkind set to "L'Ecuyer-CMRG" to ensure parallel-safe RNG, otherwise results may not be as expected.

outpath

If non-NULL, a file path where the results should be written as they are processed. If specified, FUN must return a 'raw', 'logical', 'integer', or 'numeric' vector. The result will be returned as a matter object.

verbose

Should user messages be printed with the current chunk being processed? If NA (the default), this is taken from getOption("matter.default.verbose").

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

Details

For chunkApply(), chunkLapply(), and chunkMapply():

For vectors and lists, the vector is broken into some number of chunks according to chunks. The individual elements of the chunk are then passed to FUN.

For matrices, the matrix is chunked along rows or columns, based on the number of chunks. The individual rows or columns of the chunk are then passed to FUN.

In this way, the first argument of FUN is analogous to using the base apply, lapply, and mapply functions.

For chunk_rowapply(), chunk_colapply(), chunk_lapply(), and chunk_mapply():

In this situation, the entire chunk is passed to FUN, and FUN is responsible for knowing how to handle a sub-vector or sub-matrix of the original object. This may be useful if FUN is already a function that could be applied to the whole object such as rowSums or colSums.

When this is the case, it may be useful to provide a custom simplify function.

For convenience to the programmer, several attributes are made available when operating on a chunk.

  • "chunkid": The index of the chunk currently being processed by FUN.

  • "index": The indices of the elements of the chunk, as elements/rows/columns in the original matrix/vector.

  • "depends" (optional): If depends is given, then this is a list of indices within the chunk. The length of the list is equal to the number of elements/rows/columns in the chunk. Each list element either NULL or a vector of indices giving the elements/rows/columns of the chunk that should be processed for that index. The indices that should be processed will be non-NULL, and indices that should be ignored will be NULL.

The depends argument can be used to iterate over dependent elements of a vector, or dependent rows/columns of a matrix. This can be useful if the calculation for a particular row/column/element depends on the values of others.

When depends is provided, multiple rows/columns/elements will be passed to FUN. Each element of the depends list should be a vector giving the indices that should be passed to FUN.

For example, this can be used to implement a rolling apply function.

Value

Typically, a list if simplify=FALSE. Otherwise, the results may be coerced to a vector or array.

Author(s)

Kylie A. Bemis

See Also

apply, lapply, mapply, RNGkind, RNGStreams

Examples

register(SerialParam())

set.seed(1)
x <- matrix(rnorm(1000^2), nrow=1000, ncol=1000)

out <- chunkApply(x, 1L, mean, nchunks=10)

kuwisdelu/matter documentation built on May 1, 2024, 5:17 a.m.