big_parallelize: Split-parApply-Combine
In bigstatsr: Statistical Tools for Filebacked Big Matrices

big_parallelize

R Documentation

Split-parApply-Combine

Description

A Split-Apply-Combine strategy to parallelize the evaluation of a function.

Usage

big_parallelize(
  X,
  p.FUN,
  p.combine = NULL,
  ind = cols_along(X),
  ncores = nb_cores(),
  ...
)

Arguments

`X`	An object of class FBM.
`p.FUN`	The function to be applied to each subset matrix. It must take a Filebacked Big Matrix as first argument and `ind`, a vector of indices, which are used to split the data. For example, if you want to apply a function to `X[ind.row, ind.col]`, you may use `X[ind.row, ind.col[ind]]` in `a.FUN`.
`p.combine`	Function to combine the results with `do.call`. This function should accept multiple arguments (`...`). For example, you can use `c`, `cbind`, `rbind`. This package also provides function `plus` to add multiple arguments together. The default is `NULL`, in which case the results are not combined and are returned as a list, each element being the result of a block.
`ind`	Initial vector of subsetting indices. Default is the vector of all column indices.
`ncores`	Number of cores used. Default doesn't use parallelism. You may use nb_cores.
`...`	Extra arguments to be passed to `p.FUN`.

Details

This function splits indices in parts, then apply a given function to each part and finally combine the results.

Value

Return a list of ncores elements, each element being the result of one of the cores, computed on a block. The elements of this list are then combined with do.call(p.combine, .) if p.combined is given.

Examples

## Not run:  # CRAN is super slow when parallelism.
  X <- big_attachExtdata()

  ### Computation on all the matrix
  true <- big_colstats(X)

  big_colstats_sub <- function(X, ind) {
    big_colstats(X, ind.col = ind)
  }
  # 1. the computation is split along all the columns
  # 2. for each part the computation is done, using `big_colstats`
  # 3. the results (data.frames) are combined via `rbind`.
  test <- big_parallelize(X, p.FUN = big_colstats_sub,
                          p.combine = 'rbind', ncores = 2)
  all.equal(test, true)

  ### Computation on a part of the matrix
  n <- nrow(X)
  m <- ncol(X)
  rows <- sort(sample(n, n/2)) # sort to provide some locality in accesses
  cols <- sort(sample(m, m/2)) # idem

  true2 <- big_colstats(X, ind.row = rows, ind.col = cols)

  big_colstats_sub2 <- function(X, ind, rows, cols) {
    big_colstats(X, ind.row = rows, ind.col = cols[ind])
  }
  # This doesn't work because, by default, the computation is spread
  # along all columns. We must explictly specify the `ind` parameter.
  tryCatch(big_parallelize(X, p.FUN = big_colstats_sub2,
                           p.combine = 'rbind', ncores = 2,
                           rows = rows, cols = cols),
           error = function(e) message(e))

  # This now works, using `ind = seq_along(cols)`.
  test2 <- big_parallelize(X, p.FUN = big_colstats_sub2,
                           p.combine = 'rbind', ncores = 2,
                           ind = seq_along(cols),
                           rows = rows, cols = cols)
  all.equal(test2, true2)


## End(Not run)

bigstatsr documentation built on Sept. 11, 2024, 7:08 p.m.