knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial shows how to use
ondisc_matrix, the core class implemented by
ondisc_matrix is an R object that represents an expression matrix stored on-disk rather than in-memory. We cover the topics of initialization, querying basic information, subsetting, and pulling submatrices into memory. We begin by loading the
ondisc ships with several example datasets, stored in the "extdata" subdirectory of the package.
raw_data_dir <- system.file("extdata", package = "ondisc") list.files(raw_data_dir)
The files "gene_expression.mtx", "cell_barcodes.tsv," and "genes.tsv" together define a gene-by-cell expression matrix. We save the full paths to these files in the variables
mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx") barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv") features_fp <- paste0(raw_data_dir, "/genes.tsv")
ondisc_matrix consists of two parts: an HDF5 (i.e., .h5) file that stores the expression data on-disk in a novel format, and an in-memory object that allows us to interact with the expression data from within R. The easiest way to initialize an
ondisc_matrix is by calling the function
create_ondisc_matrix_from_mtx. We pass to this function (i) a file path to the .mtx file storing the expression data, (ii) a file path to the .tsv file storing the cell barcodes, and (iii) a file path to the .tsv file storing the feature IDs and human-readable feature names. We optionally can specify the directory in which to store the initialized .h5 file, which in this tutorial we will take to be the temporary directory.
temp_dir <- tempdir() exp_mat_list <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp, barcodes_fp = barcodes_fp, features_fp = features_fp, on_disk_dir = temp_dir)
create_ondisc_matrix_from_mtx returns a list of three elements: (i) an
ondisc_matrix representing the expression data, (ii) a cell-wise covariate matrix, and (iii) a feature-wise covariate matrix. The exact cell-wise and feature-wise covariate matrices that are computed depend on the inputs to
create_ondisc_matrix_from_mtx (see documentation via ?create_ondisc_matrix_from_mtx for full details). The advantage to computing the cell-wise and feature-wise covariates at initialization is that it obviates the need to load the entire dataset into memory a second time.
expression_mat <- exp_mat_list$ondisc_matrix head(expression_mat) cell_covariates <- exp_mat_list$cell_covariates head(cell_covariates) feature_covariates <- exp_mat_list$feature_covariates head(feature_covariates)
The initialized HDF5 file is named
ondisc_matrix_1.h5 and is located in the temporary directory.
"ondisc_matrix_1.h5" %in% list.files(temp_dir)
A strength of
create_ondisc_matrix_from_mtx is that it does not assume that entire expression matrix fits into memory. The optional argument
n_lines_per_chunk can be used to specify the number of lines to read from the .mtx file at a time. Additionally,
create_ondisc_matrix_from_mtx is fast: the novel algorithm that underlies this function is highly efficient and implemented in C++ for maximum speed. Typically,
create_ondisc_matrix_from_mtx takes aboout 4-8 minutes/GB to run. Finally, for a given dataset,
create_ondisc_matrix_from_mtx only needs to be run once, even after closing and opening new R sessions.
We can use the functions
get_cell_barcodes to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of an
feature_ids <- get_feature_ids(expression_mat) feature_names <- get_feature_names(expression_mat) cell_barcodes <- get_cell_barcodes(expression_mat) head(feature_ids) head(feature_names) head(cell_barcodes)
Additionally, we can use
ncol to obtain the dimension, number of rows (i.e., number of features), and number of columns (i.e., number of cells) of an
dim(expression_mat) nrow(expression_mat) ncol(expression_mat)
We can subset an
ondisc_matrix to obtain a new
ondisc_matrix that is a submatrix of the original. To subset an
ondisc_matrix, apply the
[ operator and pass a numeric, logical, or character vector indicating the cells or features to keep. Character vectors are assumed to refer to feature IDs (for rows) and cell barcodes (for columns).
# numeric vector examples # keep genes 100-110 x <- expression_mat[100:110,] # keep all cells except 10 and 20 x <- expression_mat[,-c(10,20)] # keep genes 50-100 and 200-250 and cells 300-500 x <- expression_mat[c(50:100, 200:250), 300:500] # character vector examples # keep genes ENSG00000107581, ENSG00000286857, and ENSG00000266371 x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),] # keep cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1 x <- expression_mat[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")] # logical vector example # keep all genes except ENSG00000237832 and ENSG00000229637 x <- expression_mat[!(get_feature_ids(expression_mat) %in% c("ENSG00000237832", "ENSG00000229637")),]
ondisc_matrix leaves the original object unchanged.
This important property, called object persistence, makes programming with
ondisc_matrices intuitive. The underlying HDF5 file is not copied upon subset; instead, information is shared across
ondisc_matrix objects, making subsets fast.
We can pull a submatrix of an
ondisc_matrix into memory, allowing us to perform computations on a subset of the data. To pull a submatrix into memory, use the
[[ operator, passing a numeric, character, or logical vector indicating the cells or features to access. The data structure that underlies an
ondisc_matrix enables fast access to both rows and columns of the matrix.
# numeric vector examples # pull gene 6 m <- expression_mat[[6,]] # pull cells 200 - 250 m <- expression_mat[[,200:250]] # pull genes 50 - 100 and cells 200 - 250 m <- expression_mat[[50:100, 200:250]] # character vector examples # pull genes ENSG00000107581 and ENSG00000286857 m <- expression_mat[[c("ENSG00000107581", "ENSG00000286857"),]] # pull cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1 m <- expression_mat[[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]] # logical vector examples # subset the matrix, keeping genes ENSG00000107581, ENSG00000286857, and ENSG00000266371 x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),] # pull all genes except ENSG00000107581 m <- x[[get_feature_ids(x) != "ENSG00000107581",]]
The last example demonstrates that we can pull a submatrix of an
ondisc_matrix into memory after having subset the matrix.
One can remember the difference between
[[ by recalling R lists:
[ is used to subset a list, and
[[ is used to access elements stored within a list. Similarly,
[ is used to subset an
[[ is used to access a submatrix stored within an
As discussed previously, there are two components to an
ondisc_matrix: the HDF5 file stored on-disk, and the R object stored in memory. The latter contains a file path to the former, allowing us to interact with the expression data from within R.
To save an
ondisc_matrix, simply call
saveRDS on the
ondisc_matrix R object to create an .rds file.
saveRDS(object = expression_mat, file = paste0(temp_dir, "/expression_matrix.rds")) rm(expression_mat)
We then can load the
ondisc_matrix by calling
readRDS on the .rds file.
expression_mat <- readRDS(paste0(temp_dir, "/expression_matrix.rds"))
We also can use the constructor of the
ondisc_matrix class to create an
ondisc_matrix from an already-initialized HDF5 file.
h5_file <- paste0(temp_dir, "/ondisc_matrix_1.h5") expression_mat <- ondisc_matrix(h5_file)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.