emptyDropsCellRanger: CellRanger's emptyDrops variant

emptyDropsCellRangerR Documentation

CellRanger's emptyDrops variant

Description

An approximate implementation of the --soloCellFilter EmptyDrops_CR filtering approach, which itself was reverse-engineered from the behavior of CellRanger 3.

Usage

emptyDropsCellRanger(m, ...)

## S4 method for signature 'ANY'
emptyDropsCellRanger(
  m,
  n.expected.cells = 3000,
  max.percentile = 0.99,
  max.min.ratio = 10,
  umi.min = 500,
  umi.min.frac.median = 0.01,
  cand.max.n = 20000,
  ind.min = 45000,
  ind.max = 90000,
  round = TRUE,
  niters = 10000,
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
emptyDropsCellRanger(m, ..., assay.type = "counts")

Arguments

m

A numeric matrix-like object containing counts, where columns represent barcoded droplets and rows represent features. The matrix should only contain barcodes for an individual sample, prior to any filtering for barcodes. Alternatively, a SummarizedExperiment containing such an object.

...

Further arguments to pass to individual methods. Specifically, for the SummarizedExperiment method, further arguments to pass to the ANY method.

n.expected.cells

An integer scalar specifying the number of expected cells in a sample. Corresponds to the nExpectedCells argument in STARsolo.

max.percentile

A numeric scalar between 0 and 1 used to define the maximum UMI count in the simple filtering algorithm. Corresponds to the maxPercentile argument in STARsolo.

max.min.ratio

An integer scalar specifying the ratio of the maximum and minimum UMI count in the simple filtering algorithm. Corresponds to the maxMinRatio argument in STARsolo.

umi.min

An integer scalar specifying the minimum UMI count for inclusion of a barcode in the cell candidate pool. Corresponds to the umiMin argument in STARsolo.

umi.min.frac.median

A numeric scalar between 0 and 1 used to define the minimum UMI count for inclusion of a barcode in the cell candidate pool. Specifically, the minimum is defined as umi.min.frac.median times the median UMI count of the real cells assigned by the simple filtering algorithm. Corresponds to the umiMinFracMedian argument in STARsolo.

cand.max.n

An integer scalar specifying the maximum number of barcodes that can be included in the cell candidate pool. In effect, this applies a minimum threshold that is defined as the cand.max.n-th largest UMI count among all cells that are not selected by the simple filtering algorithm. Corresponds to the candMaxN in STARsolo.

ind.min

An integer scalar specifying the lowest UMI count ranking for inclusion of a barcode in the ambient profile. Corresponds to the indMin argument in STARsolo.

ind.max

An integer scalar specifying the highest UMI count ranking for inclusion of a barcode in the ambient profile. Corresponds to the indMin argument in STARsolo.

round

A logical scalar indicating whether to check for non-integer values in m and, if present, round them for ambient profile estimation (see ?ambientProfileEmpty) and the multinomial simulations.

niters

An integer scalar specifying the number of iterations to use for the Monte Carlo p-value calculations.

BPPARAM

A BiocParallelParam object indicating whether parallelization should be used.

assay.type

String or integer specifying the assay of interest.

Details

emptyDropsCellRanger splits each sample's barcodes into three subsets.

  1. The first subset contains barcodes that are selected by the “simple filtering algorithm”, which are regarded as high quality cells without any further filtering. The minimum threshold T for this subset is defined by taking the max.percentile percentile of the top n.expected.cells barcodes, and then dividing by the max.min.ratio to obtain a minimum UMI count. All barcodes here will have an FDR of zero.

  2. The second subset contains the ambient pool and is defined as all barcodes with rankings between ind.min and ind.max. The barcodes that fall in this category will be used to compute the ambient profile. None of these barcodes are considered to be potential cells.

  3. The third subset contains the pool of barcodes that are potential cells, i.e., cell candidates. This is defined as all barcodes with total counts below T and higher than all of the thresholds defined by umi.min, umi.min.frac.median and cand.max.n. Only the barcodes within this subset will be tested for signficant deviations from the ambient profile, i.e., FDR is not NaN.

As of time of writing, the arguments in STARsolo have a one-to-one correspondence with the arguments in emptyDropsCellRanger. All parameter defaults are set as the same as those used in STARsolo 2.7.9a.

The main differences between emptyDropsCellRanger and emptyDrops are:

  • emptyDropsCellRanger applies a simple filtering strategy to identify some real cells before any further investigation.

  • emptyDropsCellRanger takes barcodes whose total count ranks within a certain range - by default, (45,000, 90,000] - to compute the ambient profile. In contrast, emptyDrops only defines the upper bound using lower or by.rank.

  • emptyDropsCellRanger defines a cell candidate pool according to three parameters, umi.min,umi.min.frac.median and cand.max.n. In emptyDrops, this is only defined by lower.

Value

A DataFrame with the same fields as that returned by emptyDrops.

Author(s)

Dongze He, Rob Patro

References

Kaminow B, Yunusov D, Dobin A (2021). STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1

See Also

emptyDrops, for the original implementation.

Examples

# Mocking up some data:
set.seed(0)
my.counts <- DropletUtils:::simCounts(nempty=100000, nlarge=2000, nsmall=1000)

# Identify likely cell-containing droplets.
out <- emptyDropsCellRanger(my.counts)
out

is.cell <- out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)

# Subsetting the matrix to the cell-containing droplets.
# (using 'which()' to handle NAs smoothly).
cell.counts <- my.counts[,which(is.cell),drop=FALSE]
dim(cell.counts)


MarioniLab/DropletUtils documentation built on July 25, 2024, 1:10 p.m.