Introduction

The aim of this document is to measure the performance of the r Biocpkg("HDF5Array") package for normalization and PCA of on-disk single cell data, two computationally intensive operations at the core of single cell analysis.

The benchmarks presented in the document were specifically designed to observe the impact of two critical parameters on performance:

  1. data representation (i.e. sparse vs dense);
  2. size of the blocks used for block-processed operations.

Hopefully these benchmarks will also facilitate comparing performance of single cell analysis workflows based on r Biocpkg("HDF5Array") with workflows based on other tools like Seurat or Scanpy.

Install and load the required packages

Let's install and load r Biocpkg("HDF5Array") as well as the other packages used in this vignette:

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)

Load the packages:

library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)
## Needed for the make_timings_table() function.
path <- system.file(package="HDF5Array",
                    "scripts", "make_timings_table.R", mustWork=TRUE)
source(path, verbose=FALSE)

The test datasets

Sparse vs dense representation

The datasets that we will use for our benchmarks are subsets of the 1.3 Million Brain Cell Dataset from 10x Genomics.

The 1.3 Million Brain Cell Dataset is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It's available via the r Biocpkg("ExperimentHub") package in two forms, one that uses a sparse representation and one that uses a dense representation:

hub <- ExperimentHub()
hub["EH1039"]$description  # sparse representation
hub["EH1040"]$description  # dense representation

The two datasets are big HDF5 files stored on a remote location. Let's download them to the local r Biocpkg("ExperimentHub") cache if they are not there yet:

## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while!
full_sparse_h5 <- hub[["EH1039"]]
full_dense_h5  <- hub[["EH1040"]]

TENxMatrix vs HDF5Matrix objects

We use the TENxMatrix() and HDF5Array() constructors to bring the sparse and dense datasets in R as DelayedArray derivatives. Note that this does not load the matrix data in memory.

Bring the sparse dataset in R:

## Use 'h5ls(full_sparse_h5)' to find out the group.
full_sparse <- TENxMatrix(full_sparse_h5, group="mm10")

Note that full_sparse is a 27,998 x 1,306,127 TENxMatrix object:

class(full_sparse)
dim(full_sparse)

See ?TENxMatrix in the r Biocpkg("HDF5Array") package for more information about TENxMatrix objects.

Bring the dense dataset in R:

## Use 'h5ls(full_dense_h5)' to find out the name of the dataset.
full_dense <- HDF5Array(full_dense_h5, name="counts")

Note that full_dense is a 27,998 x 1,306,127 HDF5Matrix object that contains the same data as full_sparse:

class(full_dense)
dim(full_dense)

See ?HDF5Matrix in the r Biocpkg("HDF5Array") package for more information about HDF5Matrix objects.

Finally note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them to full_sparse:

dimnames(full_dense) <- dimnames(full_sparse)

Create the test datasets

For our benchmarks below, we create subsets of the 1.3 Million Brain Cell Dataset of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells. Note that subsetting a TENxMatrix or HDF5Matrix object with [ is a delayed operation so has virtually no cost:

sparse1 <- full_sparse[ , 1:12500]
dense1  <- full_dense[ , 1:12500]

sparse2 <- full_sparse[ , 1:25000]
dense2  <- full_dense[ , 1:25000]

sparse3 <- full_sparse[ , 1:50000]
dense3  <- full_dense[ , 1:50000]

sparse4 <- full_sparse[ , 1:100000]
dense4  <- full_dense[ , 1:100000]

sparse5 <- full_sparse[ , 1:200000]
dense5  <- full_dense[ , 1:200000]

Block-processed normalization and PCA

Code used for normalization and PCA

We'll use the following code for normalization:

## Also does variable gene selection by keeping the 1000 most variable
## genes by default.
simple_normalize <- function(mat, num_variable_genes=1000)
{
    stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
    mat <- mat[rowSums(mat) > 0, ]
    mat <- t(t(mat) * 10000 / colSums(mat))
    row_vars <- rowVars(mat)
    rv_order <- order(row_vars, decreasing=TRUE)
    variable_idx <- head(rv_order, n=num_variable_genes)
    mat <- log1p(mat[variable_idx, ])
    mat / rowSds(mat)
}

and the following code for PCA:

simple_PCA <- function(mat, k=25)
{
    stopifnot(length(dim(mat)) == 2)
    row_means <- rowMeans(mat)
    Ax <- function(x, args)
        (as.numeric(mat %*% x) - row_means * sum(x))
    Atx <- function(x, args)
        (as.numeric(x %*% mat) - as.vector(row_means %*% x))
    RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}

Block processing and block size

Note that the implementations of simple_normalize() and simple_PCA() are expected to work on any matrix-like object regardless of its exact type/representation e.g. it can be an ordinary matrix, a SparseMatrix object from the r Biocpkg("SparseArray") package, a dgCMatrix object from the r CRANpkg("Matrix") package, a DelayedMatrix derivative (TENxMatrix, HDF5Matrix, TileDBMatrix), etc...

However, when the input is a DelayedMatrix object or derivative, it's important to be aware that:

For our benchmarks below, we'll use the following block sizes:

Monitoring memory usage

While manually running our benchmarks below on a Linux or macOS system, we will also monitor memory usage at the command line in a terminal with:

(while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &

where <PID> is the process id of our R session. This will allow us to measure the maximum amount of memory used by the calls to simple_normalize() or simple_PCA().

Normalization benchmarks

In this section we run simple_normalize() on the two smaller test datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense), and we report timings and memory usage.

See the Timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

Normalizing the sparse datasets

Set block size to 250 Mb:

DelayedArray::setAutoBlockSize(2.5e8)

27,998 x 12,500 sparse dataset

dim(sparse1)
system.time(sparse1n <- simple_normalize(sparse1))
gc()
dim(sparse1n)

Note that sparse1n is a DelayedMatrix object that carrries delayed operations. These can be displayed with showtree():

class(sparse1n)
showtree(sparse1n)

Let's write sparse1n to a temporary HDF5 file so we can do PCA on it later:

sparse1n_path <- tempfile()
sparse1n <- writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0)

A note about delayed operations and on-disk realization

Writing a DelayedMatrix object to an HDF5 file with writeTENxMatrix() has a significant cost, but, on the other hand, it has the advantage of triggering on-disk realization of the object. This means that all the delayed operations carried by the object get realized on-the-fly before the matrix data lands on the disk:

class(sparse1n)
showtree(sparse1n)  # "pristine" object (i.e. no more delayed operations)

This will make this new TENxMatrix object more efficient for whatever block-processed operations will come next.

27,998 x 25,000 sparse dataset

dim(sparse2)
system.time(sparse2n <- simple_normalize(sparse2))
gc()
dim(sparse2n)

Let's write sparse2n to a temporary HDF5 file so we can do PCA on it later:

sparse2n_path <- tempfile()
writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0)

About memory usage

With this block size (250 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 3.7 Gb at all time.

Normalizing the dense datasets

Set block size to 40 Mb:

DelayedArray::setAutoBlockSize(4e7)

27,998 x 12,500 dense dataset

dim(dense1)
system.time(dense1n <- simple_normalize(dense1))
gc()
dim(dense1n)

Note that, like sparse1n and sparse2n above, dense1n is a DelayedMatrix object that carrries delayed operations. These can be displayed with showtree():

class(dense1n)
showtree(dense1n)

Let's write dense1n to a temporary HDF5 file so we can do PCA on it later:

dense1n_path <- tempfile()
dense1n <- writeHDF5Array(dense1n, dense1n_path, name="normalized_counts", level=0)

Note that, like with writeTENxMatrix(), writing a DelayedMatrix object to an HDF5 file with writeHDF5Array() has a significant cost but also has the advantage of triggering on-disk realization of the object:

class(dense1n)
showtree(dense1n)  # "pristine" object (i.e. no more delayed operations)

This will make this new HDF5Array object more efficient for whatever block-processed operations will come next.

27,998 x 25,000 dense dataset

dim(dense2)
system.time(dense2n <- simple_normalize(dense2))
gc()
dim(dense2n)

Let's write dense2n to a temporary HDF5 file so we can do PCA on it later:

dense2n_path <- tempfile()
writeHDF5Array(dense2n, dense2n_path, name="normalized_counts", level=0)

About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.8 Gb at all time.

PCA benchmarks

In this section we run simple_pca() on the two normalized datasets obtained in the previous section (1000 x 12,500 and 1000 x 25,000, sparse and dense), and we report timings and memory usage.

See the Timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

PCA on the normalized sparse datasets

Set block size to 40 Mb:

DelayedArray::setAutoBlockSize(4e7)

1000 x 12,500 sparse dataset

sparse1n <- TENxMatrix(sparse1n_path)
dim(sparse1n)
system.time(pca1s <- simple_PCA(sparse1n))
gc()

1000 x 25,000 sparse dataset

sparse2n <- TENxMatrix(sparse2n_path)
dim(sparse2n)
system.time(pca2s <- simple_PCA(sparse2n))
gc()

About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.4 Gb at all time.

PCA on the normalized dense datasets

Set block size to 100 Mb (the default):

DelayedArray::setAutoBlockSize()

1000 x 12,500 dense dataset

dense1n <- HDF5Array(dense1n_path, name="normalized_counts")
dim(dense1n)
system.time(pca1d <- simple_PCA(dense1n))
gc()

1000 x 25,000 dense dataset

dense2n <- HDF5Array(dense2n_path, name="normalized_counts")
dim(dense2n)
system.time(pca2d <- simple_PCA(dense2n))
gc()

About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.7 Gb at all time.

Sanity checks

stopifnot(all.equal(pca1s, pca1d))
stopifnot(all.equal(pca2s, pca2d))

Timings obtained on various systems

Here we report normalization & PCA timings obtained on various systems. For each system, the results are summarized in a table that shows the simple_normalize() and simple_pca() timings obtained on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. The times with light green background correspond to the best time amongst the three different block sizes for a given operation.

DELL XPS 15 laptop (model 9520)

make_timings_table("xps15")

Note: "max. mem. used" columns to be populated soon.

Supermicro SuperServer 1029GQ-TRT

make_timings_table("rex3")

Note: "max. mem. used" columns to be populated soon.

Mac Pro (Apple M2 Ultra)

make_timings_table("kjohnson3")

Note: "max. mem. used" columns to be populated soon.

Final remarks

The Supermicro SuperServer 1029GQ-TRT machine is significantly slower than the other machines. This is most likely due to the less performant disk.

For PCA, choosing the sparse representation (TENxMatrix) and using small block sizes (40 Mb) is a clear winner.

For normalization, there's no clear best choice between the parse and dense representations. More precisely, for this operation, the sparse representation tends to give the best times when using bigger blocks (250 Mb), whereas the dense representation tends to give the best times when using smaller blocks (40 Mb). However, based on the above benchmarks, there's no clear best choice between "sparse with big blocks" and "dense with small blocks" in terms of speed. Maybe extending the benchmarks to include more extreme block sizes (e.g. 20 Mb and 500 Mb) could help break the tie.

Normalization and PCA are roughly linear in time, regardless of representation (sparse or dense) or block size.

[Needs confirmation] Normalization and PCA both perform at almost constant memory, regardless of representation (sparse or dense).

Session information

sessionInfo()


Bioconductor/HDF5Array documentation built on Jan. 21, 2025, 1:33 a.m.