README.md
In sparseMatrixStats: Summary Statistics for Rows and Columns of Sparse Matrices

sparseMatrixStats

The goal of sparseMatrixStats is to make the API of matrixStats available for sparse matrices.

You can install the release version of sparseMatrixStats from BioConductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sparseMatrixStats")

Alternatively, you can get the development version of the package from GitHub with:

# install.packages("devtools")
devtools::install_github("const-ae/sparseMatrixStats")

library(sparseMatrixStats)

mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
#> 10 x 6 sparse Matrix of class "dgCMatrix"
#>                  
#>  [1,] 4 . . . . .
#>  [2,] . . . . . .
#>  [3,] . . . . . .
#>  [4,] 2 . . . . .
#>  [5,] . . . . . .
#>  [6,] . . . . . .
#>  [7,] . . . . . 1
#>  [8,] . . . . . .
#>  [9,] . . . 3 . .
#> [10,] . . . . . .

The package provides an interface to quickly do common operations on the rows or columns. For example calculate the variance:

apply(mat, 2, var)
#> [1] 1.822222 0.000000 0.000000 0.900000 0.000000 0.100000
matrixStats::colVars(mat)
#> [1] 1.822222 0.000000 0.000000 0.900000 0.000000 0.100000
sparseMatrixStats::colVars(sparse_mat)
#> [1] 1.822222 0.000000 0.000000 0.900000 0.000000 0.100000

On this small example data, all methods are basically equally fast, but if we have a much larger dataset, the optimizations for the sparse data start to show.

I generate a dataset with 10,000 rows and 50 columns that is 99% empty

big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")

I use the bench package to benchmark the performance difference:

bench::mark(
  sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
  matrixStats=matrixStats::colVars(big_mat),
  apply=apply(big_mat, 2, var)
)
#> # A tibble: 3 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sparseMatrixStats  36.15µs  40.09µs   24419.     2.93KB    14.7 
#> 2 matrixStats         1.42ms   1.45ms     677.    156.8KB     2.03
#> 3 apply               8.89ms  10.56ms      94.6    9.54MB    53.0

As you can see sparseMatrixStats is ca. 35 times fast than matrixStats, which in turn is 7 times faster than the apply() version.

API

The package now supports all functions from the matrixStats API for column sparse matrices (dgCMatrix). And thanks to the MatrixGenerics it can be easily integrated along-side matrixStats and DelayedMatrixStats. Note that the rowXXX() functions are called by transposing the input and calling the corresponding colXXX() function. Special optimized implementations are available for rowSums2(), rowMeans2(), and rowVars().

| Method | :------------------- | colAlls() | ✔ | colAnyMissings() | ✔ | colAnyNAs() | ✔ | colAnys() | ✔ | colAvgsPerRowSet() | ✔ | colCollapse() | ✔ | colCounts() | ✔ | colCummaxs() | ✔ | colCummins() | ✔ | colCumprods() | ✔ | colCumsums() | ✔ | colDiffs() | ✔ | colIQRDiffs() | ✔ | colIQRs() | ✔ | colLogSumExps() | ✔ | colMadDiffs() | ✔ | colMads() | ✔ | colMaxs() | ✔ | colMeans2() | ✔ | colMedians() | ✔ | colMins() | ✔ | colOrderStats() | ✔ | colProds() | ✔ | colQuantiles() | ✔ | colRanges() | ✔ | colRanks() | ✔ | colSdDiffs() | ✔ | colSds() | ✔ | colsum() | ✔ | colSums2() | ✔ | colTabulates() | ✔ | colVarDiffs() | ✔ | colVars() | ✔ | colWeightedMads() | ✔ | colWeightedMeans() | ✔ | colWeightedMedians() | ✔ | colWeightedSds() | ✔ | colWeightedVars() | ✔ | rowAlls() | ✔ | rowAnyMissings() | ✔ | rowAnyNAs() | ✔ | rowAnys() | ✔ | rowAvgsPerColSet() | ✔ | rowCollapse() | ✔ | rowCounts() | ✔ | rowCummaxs() | ✔ | rowCummins() | ✔ | rowCumprods() | ✔ | rowCumsums() | ✔ | rowDiffs() | ✔ | rowIQRDiffs() | ✔ | rowIQRs() | ✔ | rowLogSumExps() | ✔ | rowMadDiffs() | ✔ | rowMads() | ✔ | rowMaxs() | ✔ | rowMeans2() | ✔ | rowMedians() | ✔ | rowMins() | ✔ | rowOrderStats() | ✔ | rowProds() | ✔ | rowQuantiles() | ✔ | rowRanges() | ✔ | rowRanks() | ✔ | rowSdDiffs() | ✔ | rowSds() | ✔ | rowsum() | ✔ | rowSums2() | ✔ | rowTabulates() | ✔ | rowVarDiffs() | ✔ | rowVars() | ✔ | rowWeightedMads() | ✔ | rowWeightedMeans() | ✔ | rowWeightedMedians() | ✔ | rowWeightedSds() | ✔ | rowWeightedVars() | ✔ | matrixStats | sparseMatrixStats | Notes | | :---------- | :---------------- | :--------------------------------------------------------------------------------------- | | ✔ | | | ❌ | Not implemented because it is deprecated in favor of colAnyNAs() | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ❌ | Base R function | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | Sparse version behaves slightly differently, because it always uses interpolate=FALSE. | | ✔ | | | ✔ | Only equivalent if interpolate=FALSE | | ✔ | | | ✔ | | | ✔ | | | ❌ | Not implemented because it is deprecated in favor of rowAnyNAs() | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ❌ | Base R function | | ✔ | | | ✔ | | | ✔ | | | ✔ | | | ✔ | Sparse version behaves slightly differently, because it always uses interpolate=FALSE. | | ✔ | | | ✔ | Only equivalent if interpolate=FALSE | | ✔ | | | ✔ | |