maximumAmbience: Maximum ambient contribution
In DropletUtils: Utilities for Handling Single-Cell Droplet Data

Description Usage Arguments Details Value Caveats Author(s) See Also Examples

Compute the maximum contribution of the ambient solution to an expression profile for a group of droplets.

maximumAmbience(
  y,
  ambient,
  threshold = 0.1,
  dispersion = 0,
  num.points = 100,
  num.iter = 5,
  mode = c("scale", "profile", "proportion")
)

`y`	A numeric count matrix where each row represents a gene and each column represents an expression profile. The profile usually contains aggregated counts for multiple droplets in a sample, e.g., for a cluster of cells. This can also be a vector, in which case it is converted into a one-column matrix.
`ambient`	A numeric vector of length equal to `nrow(y)`, containing the proportions of transcripts for each gene in the ambient solution. Alternatively, a matrix where each row corresponds to a row of `y` and each column contains a specific ambient profile for the corresponding column of `y`.
`threshold`	Numeric scalar specifying the p-value threshold to use, see Details.
`dispersion`	Numeric scalar specifying the dispersion to use in the negative binomial model. Defaults to zero, i.e., a Poisson model.
`num.points`	Integer scalar specifying the number of points to use for the grid search.
`num.iter`	Integer scalar specifying the number of iterations to use for the grid search.
`mode`	String indicating the output to return - the scaling factor, the ambient profile or the proportion of each gene's counts in `y` that is attributable to ambient contamination.

On occasion, it is useful to estimate the maximum possible contribution of the ambient solution to a count profile. This represents the most pessimistic explanation of a particular expression pattern and can be used to identify and discard suspect genes or clusters prior to downstream analyses.

This function implements the following algorithm:

We compute the mean ambient contribution for each gene by scaling ambient by some factor. ambient itself is usually derived by summing counts across barcodes with low total counts, see the output of emptyDrops for an example.
We compute a p-value for each gene based on the probability of observing a count equal to or below that in y, using the lower tail of a negative binomial (or Poisson) distribution with mean set to the ambient contribution. The per-gene null hypothesis is that the expected count in y is equal to the sum of the scaled ambient proportion and some (non-negative) contribution from actual intra-cellular transcripts.
We combine p-values across all genes using Simes' method. This represents the evidence against the joint null hypothesis (that all of the per-gene nulls are true).
We find the largest scaling factor that fails to reject this joint null at the specified threshold. If sum(ambient) is equal to unity, this scaling factor can be interpreted as the maximum number of transcript molecules contributed to y by the ambient solution.

The process of going from a scaling factor to a combined p-value has no clean analytical solution, so we use an iterative grid search to identify to largest possible scaling factor at a decent resolution. num.points and num.iter control the resolution of the grid search, and generally do not need to be changed.

If mode="scale", a numeric vector is returned quantifying the maximum “contribution” of the ambient solution to each column of y. Scaling columns of ambient by this vector yields the maximum ambient profile for each column of y, which can also be obtained by setting mode="profile".

If mode="proportion", a numeric matrix is returned containing the maximum proportion of counts in y that are attributable to ambient contamination. This is computed by simply dividing the output of mode="profile" by y and capping all values at 1.

The above algorithm is rather ad hoc and offers little in the way of theoretical guarantees. The p-value is used as a score rather than providing any meaningful error control. Empirically, increasing threshold will return a higher scaling factor by making the estimation more robust to drop-outs in y, at the cost of increasing the risk of over-estimation of the ambient contribution.

Our abuse of the p-value machinery means that the reported scaling often exceeds the actual contribution, especially at low counts where the reduced power fails to penalize overly large scaling factors. Hence, the function works best when y contains aggregated counts for one or more groups of droplets with the same expected expression profile, e.g., clusters of related cells. Higher counts provide more power to detect deviations, hopefully leading to a more accurate estimate of the scaling factor.

Note that this function returns the maximum possible contribution of the ambient solution to y, not the actual contribution. In the most extreme case, if the ambient profile is similar to the expectation of y (e.g., due to sequencing a relatively homogeneous cell population), the maximum possible contribution of the ambient solution would be 100% of y, and subtraction would yield an empty count vector!

Aaron Lun

estimateAmbience, to estimate the ambient profile.

controlAmbience, for another method for estimating the ambient contribution.

emptyDrops, which uses the ambient profile to call cells.

estimateAmbience, to obtain an estimate to use in ambient.

controlAmbience, for a more accurate estimate when control features are available.

# Making up some data for, e.g., a single cluster.
ambient <- c(runif(900, 0, 0.1), runif(100))
y <- rpois(1000, ambient * 100)
y[1:100] <- y[1:100] + rpois(100, 20) # actual biology.

# Estimating the maximum possible scaling factor:
scaling <- maximumAmbience(y, ambient)
scaling

# Estimating the maximum contribution to 'y' by 'ambient'.
contribution <- maximumAmbience(y, ambient, mode="profile")
DataFrame(ambient=drop(contribution), total=y)