sparseMatrixStats

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Installation

You can install the release version of r BiocStyle::Biocpkg("sparseMatrixStats") from BioConductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sparseMatrixStats")

Introduction

The sparseMatrixStats package implements a number of summary functions for sparse matrices from the r BiocStyle::CRANpkg("Matrix") package.

Let us load the package and create a synthetic sparse matrix.

library(sparseMatrixStats)
# Matrix defines the sparse Matrix class
# dgCMatrix that we will use
library(Matrix)
# For reproducibility
set.seed(1)

Create a synthetic table with customers, items, and how often they bought that item.

customer_ids <- seq_len(100)
item_ids <-  seq_len(30)
n_transactions <- 1000
customer <- sample(customer_ids, size = n_transactions, replace = TRUE,
                    prob = runif(100))
item <- sample(item_ids, size = n_transactions, replace = TRUE,
               prob = runif(30))

tmp <- table(paste0(customer, "-", item))
tmp2 <- strsplit(names(tmp), "-")
purchase_table <- data.frame(
  customer = as.numeric(sapply(tmp2, function(x) x[1])),
  item = as.numeric(sapply(tmp2, function(x) x[2])),
  n = as.numeric(tmp)
)

head(purchase_table, n = 10)

Let us turn the table into a matrix to simplify the analysis:

purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item, 
                x = purchase_table$n,
                dims = c(100, 30),
                dimnames = list(customer = paste0("Customer_", customer_ids),
                                item = paste0("Item_", item_ids)))
purchase_matrix[1:10, 1:15]

We can see that some customers did not buy anything, where as some bought a lot.

sparseMatrixStats can help us to identify interesting patterns in this data:

# How often was each item bough in total?
colSums2(purchase_matrix)

# What is the range of number of items each 
# customer bought?
head(rowRanges(purchase_matrix))

# What is the variance in the number of items
# each customer bought?
head(rowVars(purchase_matrix))

# How many items did a customer not buy at all, one time, 2 times,
# or exactly 4 times?
head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4)))

Alternative Matrix Creation

In the previous section, I demonstrated how to create a sparse matrix from scratch using the sparseMatrix() function. However, often you already have an existing matrix and want to convert it to a sparse representation.

mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat

The sparseMatrixStats package is a derivative of the r BiocStyle::CRANpkg("matrixStats") package and implements it's API for sparse matrices. For example, to calculate the variance for each column of mat you can do

apply(mat, 2, var)

However, this is quite inefficient and matrixStats provides the direct function

matrixStats::colVars(mat)

Now for sparse matrices, you can also just call

sparseMatrixStats::colVars(sparse_mat)

Benchmark

If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations.

I generate a dataset with 10,000 rows and 50 columns that is 99% empty

big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")

I use the bench package to benchmark the performance difference:

bench::mark(
  sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
  matrixStats=matrixStats::colVars(big_mat),
  apply=apply(big_mat, 2, var)
)

As you can see sparseMatrixStats is ca. 50 times fast than matrixStats, which in turn is 7 times faster than the apply() version.

Session Info

sessionInfo()


Try the sparseMatrixStats package in your browser

Any scripts or data that you put into this service are public.

sparseMatrixStats documentation built on Feb. 4, 2021, 2 a.m.