Introduction

The advent of single cell transcriptomics and epigenomics induces novel and challenging requirements for bioinformatic preprocessing, analysis throughput, and storage. It seems safe to say that biological variability at the single cell level will necessitate assays of hundreds to thousands of cells per individual to answer basic questions of interest. Consequently one can anticipate needs for storage and analysis infrastructure that are several orders of magnitude larger and faster than those currently in use.

As computational and statistical methods for analysis of "big data" evolve, biologists and statisticians must weigh various options for basic aspects of their work:

There is no standard toolbox to address all these concerns. We know of no clear criteria to help practitioners choose among available tools to solve new problems arising with high volume single cell genomics.

The Bioconductor project (cite orchestrating paper) has provided tools for preprocessing genome-scale assays, archiving collections of assays for interactive refinement and analysis, and using and adding to institutional archives of genome-scale annotation and experiments. In this report we review early responses of the project to opportunities and challenges arising in single-cell genomics.

Exemplar: The 10x million-neuron dataset

The 1.3 million neuron single cell dataset published by 10x genomics (url) helps set the stage for strategic considerations. The data consist of counts of reads for 27998 genes. The native representation is sparse -- all zero cells in the gene x cell matrix are omitted, and vectors of sample id, gene id, and count for gene in sample are assembled in an HDF5 archive.

Specifically, the 2.624 billion non-zero counts obtained from the 1.3 million cells form an integer vector with elements $c_i$. These are indexed using two vectors: $g$ (conformant with $c$) and $s$ (one entry per cell). Element $g_i$ tells which gene gives rise to count $c_i$, and the cell giving rise to this count is cell $k$ satisfying $s_k \leq i < s_{k+1}$.

The TENxGenomics package

The TENxGenomics package works with local access to the the 4.2 GB HDF5 file distributed by 10x. The file is maintained through BiocFileCache

suppressPackageStartupMessages({
    library(TENxGenomics)
    library(restfulSE)
    library(BiocSklearn)
})
library(TENxGenomics)
library(BiocFileCache)
bfc <- BiocFileCache()

oneM <- paste0(
    "https://s3-us-west-2.amazonaws.com/10x.files/",
    "samples/cell/1M_neurons/",
    "1M_neurons_filtered_gene_bc_matrices_h5.h5"
)
h5path <- bfcrpath(bfc, oneM)

txg = TENxGenomics(h5path)
txg

The txg object can be subsetted as if it were an R matrix:

txg[1:10,1:10]

The indexes used with bracket need not be numerical, as we will illustrate shortly.

Counts can be retrieved using the as.matrix method. To demonstrate this, we will obtain a vector of Ensembl gene identifiers that are reported by Tasic et al. 2016 as constitutively expressed in the cerebral cortex of the adult mouse.

library(restfulSE)
cn = tasicCortex()
head(cn)
as.matrix(txg[cn$GENEID[1:5],1:5])

A hallmark of Bioconductor is support for semantically rich, coordinated containers for genome-scale data. The SummarizedExperiment object design is discussed in some detail in [orch paper]. The neuron data can be represented in this form:

txse = tenxSummarizedExperiment(h5path)
txse

[NB at this time assay(txse) fails for me.]

The restfulSE package

We have utilized the HDF5 server technology [cite] to provide access to a dense re-expression of the million neuron dataset. Here we have 'rows' corresponding to cells and 'columns' corresponding to genes, and the $r, c$ element of the HDF5 array is the count for sample $c$ of gene $r$. Zeroes are present, but thanks to HDF5's compression, the archive size is 5.96 GB, an inflation of 1.9 GB over the sparse format.

The SummarizedExperiment representation for this remote resource is:

library(restfulSE)
remse = se1.3M()
remse

For targeted queries, the remote resource can be more performant than the local sparse resource, although this comparison will depend on network performance.

system.time(print(sum(as.matrix(txg[cn[1,2], 1:1000]))))
system.time(print(sum(assay(remse[cn[1,2], 1:1000]))))


mtmorgan/TENxGenomics documentation built on May 23, 2019, 8:19 a.m.