knitr::opts_chunk$set(echo = TRUE) options(max.print = 20)
Function big_apply()
enables to apply standard R functions to an FBM, by relying on split-apply-combine strategy where only a subset of the matrix is accessed at once. This allows for a compromise between the amount of maximum memory needed, and the speed of the operation.
In this tutorial, I showcase two examples, the imputation of a 'double' FBM and the multiplication of two FBMs.
Some data to work with:
library(bigstatsr) X <- FBM(2000, 2000, init = rnorm(2000^2)) X[1, ] <- NA
To impute by the mean of each column, you can use the following code:
big_apply(X, function(X, ind) { # have an idea of progress print(ind[1]) # access a subset of columns as a standard R matrix X.sub <- X[, ind, drop = FALSE] # get the location (i, j) of missing values ind_na <- which(is.na(X.sub), arr.ind = TRUE) # compute the corresponding mean for each column j means <- colMeans(X.sub, na.rm = TRUE)[ind_na[, 2]] # update j (relative to subset) to global 'ind' ind_na[, 2] <- ind[ind_na[, 2]] # fill positions with corresponding means X[ind_na] <- means # here we don't want to return anything, so `NULL` NULL }, a.combine = 'c', block.size = 500)
What is going on:
The function split indices (cols_along(X)
by default, can be changed using parameter ind
) to have maximum size of block.size
.
Then, this is passed as ind
into the function and can be used to access a subset of columns of the FBM with X[, ind, drop = FALSE]
(which is then a standard R matrix).
Then you can compute the colmeans by applying a standard R function on this subset, using e.g. colMeans(X.sub, na.rm = TRUE)
.
Then, the missing values can be filled with the corresponding column means and assigned back to the FBM (be aware of the relative column indices ind
versus the global ones cols_along(X)
).
Then, we return NULL
because we are not interested in returning anything here. We also use a.combine = 'c'
to combine all NULL
corresponding to each block to return only one NULL
instead of a list of NULL
.
If you would like to return e.g. the column means that you used for imputation, you could e.g. replace NULL
by the result of colMeans(X.sub, na.rm = TRUE)
, and keep a.combine = 'c'
.
# Verification X[1, ] big_scale()(X)$center
Imagine you have two large FBMs
# here they are not that large but it is just for the example N <- 5000 M1 <- 1000 M2 <- 2000 library(bigstatsr) X1 <- FBM(N, M1, init = 1) X2 <- FBM(N, M2, init = 2)
How to compute the cross-product $X_1^T X_2$?
The first thing is to ask whether you really want to do this, and questioned the dimension of the resulting matrix (here M1
$\times$ M2
).
There are many solutions to this problem, that depends mainly of the size of your matrices.
The first simple solution when X2
is small is to access it as a standard R matrix and to use.
cprod1 <- big_cprodMat(X1, X2[]) dim(cprod1)
If X1
is small and the resulting product is small, then you can also use
cprod2 <- t(big_cprodMat(X2, X1[])) all.equal(cprod2, cprod1)
If the matrices are larger, especially X2
, then you can compute the cross-product only for a subset of columns of X2
, which gives you only a subset of columns of the result, which you can then combine.
This can be implemented using
cprod3 <- big_apply(X2, function(X, ind) { print(ind[1]) big_cprodMat(X1, X[, ind, drop = FALSE]) }, a.combine = "cbind", block.size = 500) all.equal(cprod3, cprod1)
To use parallelism, you can e.g. use bigparallelr::set_blas_ncores()
to allow for parallel matrix operations if R is linked to some parallel matrix library such as MKL or OpenBLAS.
You can also use the parallelism from big_apply()
; this won't work as is:
big_apply(X2, function(X, ind) { print(ind[1]) big_cprodMat(X1, X[, ind, drop = FALSE]) }, a.combine = "cbind", block.size = 500, ncores = nb_cores())
You need two things here, which is basically telling the parallel clusters what are big_cprodMat
and X1
; you can do
cprod4 <- big_apply(X2, function(X, ind, X1) { print(ind[1]) bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE]) }, a.combine = "cbind", block.size = 500, ncores = nb_cores(), X1 = X1) all.equal(cprod4, cprod1)
Basically, you need to tell explicitly in which package to find the function, and to pass the other variables as parameters in the function and in big_apply()
. For more information on these matters, please have a look at this tutorial on parallelism with R.
Also note that each core will use at most a block size of 500 here, so at most 500 x ncores, so you might need to reduce the block size a bit when using parallelism (this is done by default when not specifying block.size
).
Another strategy, to save a bit of memory, is to preallocate a resulting FBM and fill it:
cprod5 <- FBM(ncol(X1), ncol(X2)) big_apply(X2, function(X, ind, X1, res) { res[, ind] <- bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE]) NULL }, a.combine = "c", block.size = 500, ncores = nb_cores(), X1 = X1, res = cprod5) all.equal(cprod5[], cprod1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.