barcodeRanks: Calculate barcode ranks

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/barcodeRanks.R

Description

Compute barcode rank statistics and identifry the knee and inflection points on the total count curve.

Usage

1
barcodeRanks(m, lower=100, fit.bounds=NULL, df=20, ...) 

Arguments

m

A real sparse matrix object, either a dgTMatrix or dgCMatrix. Columns represent barcoded droplets, rows represent genes.

lower

A numeric scalar specifying the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets.

fit.bounds

A numeric vector of length 2, specifying the lower and upper bouunds on the total UMI count for spline fitting.

df, ...

Further arguments to pass to smooth.spline.

Details

Analyses of droplet-based scRNA-seq data often show a plot of the log-total count against the log-rank of each barcode, where the highest ranks have the largest totals. This is equivalent to a transposed empirical cumulative density plot with log-transformed axes, which focuses on the barcodes with the largest counts. The barcodeRanks function will compute these ranks for all barcodes. Barcodes with the same total count receive the same average rank to avoid problems with discrete runs of the same total.

The function will also identify a number of interesting points on the curve for downstream use, namely the inflection and knee points. Both of these points correspond to a sharp transition between two components of the total count distribution, presumably reflecting the difference between empty droplets with little RNA and cell-containing droplets with much more RNA.

We supply a default setting of df to avoid overfitting the spline, as this results in unstability in the higher derivatives (and thus the curvature). df and other arguments to smooth.spline can be tuned if the estimated knee point is not at an appropriate location. We also restrict the fit to lie within the bounds defined by fit.bounds to focus on the region containing the knee point. This allows us to obtain an accurate fit with low df, rather than attempting to model the entire curve.

If fit.bounds is not specified, the lower bound is automatically set to the inflection point, which should lie after the knee point. The upper bound is set to the point at which the first derivative is closest to zero, i.e., the “plateau” region before the knee point. Note that only points with total counts above lower will be considered, regardless of how fit.bounds is defined.

Value

A DataFrame where each row corresponds to a column of m, and containing the following fields:

rank:

Numeric, the rank of each barcode (averaged across ties).

total:

Numeric, the total counts for each barcode.

fitted:

Numeric, the fitted value from the spline for each barcode. This is NA for points with x outside of fit.bounds.

The metadata contains knee, a numeric scalar containing the total count at the knee point; and inflection, a numeric scalar containing the total count at the inflection point.

Author(s)

Aaron Lun

See Also

emptyDrops

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Mocking up some data: 
set.seed(2000)
my.counts <- DropletUtils:::simCounts()

# Computing barcode rank statistics:
br.out <- barcodeRanks(my.counts)
names(br.out)

# Making a plot.
plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total")
o <- order(br.out$rank)
lines(br.out$rank[o], br.out$fitted[o], col="red")
abline(h=metadata(br.out)$knee, col="dodgerblue", lty=2)
abline(h=metadata(br.out)$inflection, col="forestgreen", lty=2)
legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"), 
    legend=c("knee", "inflection"))

DropletUtils documentation built on April 29, 2020, 3:36 a.m.