controlAmbience: Ambient contribution from controls
In DropletUtils: Utilities for Handling Single-Cell Droplet Data

Description Usage Arguments Details Value Author(s) References See Also Examples

Estimate the contribution of the ambient solution to a particular expression profile, based on the abundance of control features that should not be expressed in the latter.

controlAmbience(
  y,
  ambient,
  features,
  mode = c("scale", "profile", "proportion")
)

`y`	A numeric count matrix where each row represents a gene and each column represents an expression profile. The profile usually contains aggregated counts for multiple droplets in a sample, e.g., for a cluster of cells. This can also be a vector, in which case it is converted into a one-column matrix.
`ambient`	A numeric vector of length equal to `nrow(y)`, containing the proportions of transcripts for each gene in the ambient solution. Alternatively, a matrix where each row corresponds to a row of `y` and each column contains a specific ambient profile for the corresponding column of `y`.
`features`	A logical, integer or character vector specifying the control features in `y` and `ambient`. Alternatively, a list of vectors specifying mutually exclusive sets of features.
`mode`	String indicating the output to return - the scaling factor, the ambient profile or the proportion of each gene's counts in `y` that is attributable to ambient contamination.

Control features should be those that cannot be expressed and thus fully attributable to ambient contamination. This is most commonly determined a priori from the biological context and experimental system. For example, if spike-ins were introduced into the solution prior to cell capture, these would serve as a gold standard for ambient contamination in y. For single-nuclei sequencing, mitochondrial transcripts can serve a similar role under the assumption that all high-quality libraries are stripped nuclei.

If features is a list, it is expected to contain multiple sets of mutually exclusive features. These features need not be controls but each cell should only express features in one set (or no sets). The expression of multiple sets can thus be attributed to ambient contamination. For this mode, an archetypal pairing is that of hemoglobins with immunoglobulins (Young and Behjati, 2018), which should not be co-expressed in any (known) cell type.

If mode="scale", a numeric vector is returned quantifying the estimated “contribution” of the ambient solution to each column of y. Scaling columns of ambient by this vector yields the estimated ambient profile for each column of y, which can also be obtained by setting mode="profile".

If mode="proportion", a numeric matrix is returned containing the estimated proportion of counts in y that are attributable to ambient contamination. This is computed by simply dividing the output of mode="profile" by y and capping all values at 1.

Aaron Lun

Young MD and Behjati S (2018). SoupX removes ambient RNA contamination from droplet based single-cell RNA sequencing data. biorXiv.

estimateAmbience, to obtain an estimate to use in ambient.

maximumAmbience, when control features are not available.

# Making up some data.
ambient <- c(runif(900, 0, 0.1), runif(100))
y <- rpois(1000, ambient * 50)
y <- y + c(integer(100), rpois(900, 5)) # actual biology, but first 100 genes silent.

# Using the first 100 genes as a control:
scaling <- controlAmbience(y, ambient, features=1:100)
scaling

# Estimating the control contribution to 'y' by 'ambient'.
contribution <- controlAmbience(y, ambient, features=1:100, mode="profile")
DataFrame(ambient=drop(contribution), total=y)