downsampleReads: Downsample reads in a 10X Genomics dataset
In DropletUtils: Utilities for Handling Single-Cell Droplet Data

Description Usage Arguments Details Value Author(s) See Also Examples

Generate a UMI count matrix after downsampling reads from the molecule information file produced by CellRanger for 10X Genomics data.

downsampleReads(
  sample,
  prop,
  barcode.length = NULL,
  bycol = FALSE,
  features = NULL,
  use.library = NULL
)

`sample`	A string containing the path to the molecule information HDF5 file.
`prop`	A numeric scalar or, if `bycol=TRUE`, a vector of length `ncol(x)`. All values should lie in [0, 1] specifying the downsampling proportion for the matrix or for each cell.
`barcode.length`	An integer scalar specifying the length of the cell barcode, see `read10xMolInfo`.
`bycol`	A logical scalar indicating whether downsampling should be performed on a column-by-column basis.
`features`	A character vector containing the names of the features on which to perform downsampling.
`use.library`	An integer vector specifying the library indices for which to extract molecules from `sample`. Alternatively, a character vector specifying the library type(s), e.g., `"Gene expression"`.

This function downsamples the reads for each molecule by the specified prop, using the information in sample. It then constructs a UMI count matrix based on the molecules with non-zero read counts. The aim is to eliminate differences in technical noise that can drive clustering by batch, as described in downsampleMatrix.

Subsampling the reads with downsampleReads recapitulates the effect of differences in sequencing depth per cell. This provides an alternative to downsampling with the CellRanger aggr function or subsampling with the 10X Genomics R kit. Note that this differs from directly subsampling the UMI count matrix with downsampleMatrix.

If bycol=FALSE, downsampling without replacement is performed on all reads from the entire dataset. The total number of reads for each cell after downsampling may not be exactly equal to prop times the original value. Note that this is the more natural approach and is the default, which differs from the default used in downsampleMatrix.

If bycol=TRUE, sampling without replacement is performed on the reads for each cell. The total number of reads for each cell after downsampling is guaranteed to be prop times the original total (rounded to the nearest integer). Different proportions can be specified for different cells by setting prop to a vector, where each proportion corresponds to a cell/GEM combination in the order returned by get10xMolInfoStats.

The use.library argument is intended for studies with multiple feature types, e.g., antibody capture or CRISPR tags. As the reads for each feature type are generated in a separate sequencing library, it is generally most appropriate to downsample reads for each feature type separately. This can be achieved by setting use.library to the name or index of the desired feature set. The features of interest can also be directly specified with features. (This will be intersected with any use.library choice if both are specified.)

A numeric sparse matrix containing the downsampled UMI counts for each gene (row) and barcode (column). If features is set, only the rows with names in features are returned.

Aaron Lun

downsampleMatrix, for more general downsampling of the count matrix.

read10xMolInfo, to read the contents of the molecule information file.

# Mocking up some 10X HDF5-formatted data.
out <- DropletUtils:::simBasicMolInfo(tempfile())

# Downsampling by the reads.
downsampleReads(out, barcode.length=4, prop=0.5)