ambientContribMaximum: Ambient contribution by maximum scaling
In MarioniLab/DropletUtils: Utilities for Handling Single-Cell Droplet Data

ambientContribMaximum

R Documentation

Ambient contribution by maximum scaling

Description

Compute the maximum contribution of the ambient solution to an expression profile for a group of droplets, by scaling the ambient profile and testing for significant deviations in the count profile.

Usage

maximumAmbience(...)

ambientContribMaximum(y, ...)

## S4 method for signature 'ANY'
ambientContribMaximum(
  y,
  ambient,
  threshold = 0.1,
  dispersion = 0,
  num.points = 100,
  num.iter = 5,
  mode = c("scale", "profile", "proportion"),
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
ambientContribMaximum(y, ..., assay.type = "counts")

Arguments

`...`	For the generic, further arguments to pass to individual methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For `controlAmbience`, arguments to pass to `ambientContribMaximum`.
`y`	A numeric matrix-like object containing counts, where each row represents a gene and each column represents a cluster of cells (see Caveats). Alternatively, a SummarizedExperiment object containing such a matrix. `y` can also be a numeric vector of counts; this is coerced into a one-column matrix.
`ambient`	A numeric vector of length equal to `nrow(y)`, containing the proportions of transcripts for each gene in the ambient solution. Alternatively, a matrix where each row corresponds to a row of `y` and each column contains a specific ambient profile for the corresponding column of `y`.
`threshold`	Numeric scalar specifying the p-value threshold to use, see Details.
`dispersion`	Numeric scalar specifying the dispersion to use in the negative binomial model. Defaults to zero, i.e., a Poisson model.
`num.points`	Integer scalar specifying the number of points to use for the grid search.
`num.iter`	Integer scalar specifying the number of iterations to use for the grid search.
`mode`	String indicating the output to return, see Value.
`BPPARAM`	A BiocParallelParam object specifying how parallelization should be performed.
`assay.type`	Integer or string specifying the assay containing the count matrix.

Details

On occasion, it is useful to estimate the maximum possible contribution of the ambient solution to a count profile. This represents the most pessimistic explanation of a particular expression pattern and can be used to identify and discard suspect genes or clusters prior to downstream analyses.

This function implements the following algorithm:

We compute the mean ambient contribution for each gene by scaling ambient by some factor. ambient itself is usually derived by summing counts across barcodes with low total counts, see the output of emptyDrops for an example.
We compute a p-value for each gene based on the probability of observing a count equal to or below that in y, using the lower tail of a negative binomial (or Poisson) distribution with mean set to the ambient contribution. The per-gene null hypothesis is that the expected count in y is equal to the sum of the scaled ambient proportion and some (non-negative) contribution from actual intra-cellular transcripts.
We combine p-values across all genes using Simes' method. This represents the evidence against the joint null hypothesis (that all of the per-gene nulls are true).
We find the largest scaling factor that fails to reject this joint null at the specified threshold. If sum(ambient) is equal to unity, this scaling factor can be interpreted as the maximum number of transcript molecules contributed to y by the ambient solution.

The process of going from a scaling factor to a combined p-value has no clean analytical solution, so we use an iterative grid search to identify to largest possible scaling factor at a decent resolution. num.points and num.iter control the resolution of the grid search, and generally do not need to be changed.

maximumAmbience is soft-deprecated; use ambientContribMaximum instead.

Value

If mode="scale", a numeric vector is returned quantifying the maximum “contribution” of the ambient solution to each column of y. Scaling ambient by each entry yields the maximum ambient profile for the corresponding column of y.

If mode="profile", a numeric matrix is returned containing the maximum ambient profile for each column of y. This is computed by scaling as described above; if ambient is a matrix, each column is scaled by the corresponding entry of the scaling vector.

If mode="proportion", a numeric matrix is returned containing the maximum proportion of counts in y that are attributable to ambient contamination. This is computed by simply dividing the output of mode="profile" by y and capping all values at 1.

Caveats

The above algorithm is rather ad hoc and offers little in the way of theoretical guarantees. The p-value is used as a score rather than providing any meaningful error control. Empirically, increasing threshold will return a higher scaling factor by making the estimation more robust to drop-outs in y, at the cost of increasing the risk of over-estimation of the ambient contribution.

Our abuse of the p-value machinery means that the reported scaling often exceeds the actual contribution, especially at low counts where the reduced power fails to penalize overly large scaling factors. Hence, the function works best when y contains aggregated counts for one or more groups of droplets with the same expected expression profile, e.g., clusters of related cells. Higher counts provide more power to detect deviations, hopefully leading to a more accurate estimate of the scaling factor. (On a practical note, this function is rather slow so it is more feasible to calculate on cluster-level profiles rather than per cell.)

Note that this function returns the maximum possible contribution of the ambient solution to y, not the actual contribution. In the most extreme case, if the ambient profile is similar to the expectation of y (e.g., due to sequencing a relatively homogeneous cell population), the maximum possible contribution of the ambient solution would be 100% of y, and subtraction would yield an empty count vector!

Author(s)

Aaron Lun

Examples

# Making up some data for, e.g., a single cluster.
ambient <- c(runif(900, 0, 0.1), runif(100))
y <- rpois(1000, ambient * 100)
y[1:100] <- y[1:100] + rpois(100, 20) # actual biology.

# Estimating the maximum possible scaling factor:
scaling <- ambientContribMaximum(y, ambient)
scaling

# Estimating the maximum contribution to 'y' by 'ambient'.
contribution <- ambientContribMaximum(y, ambient, mode="profile")
DataFrame(ambient=drop(contribution), total=y)

MarioniLab/DropletUtils documentation built on July 16, 2025, 1:57 p.m.