# R/maximumAmbience.R In DropletUtils: Utilities for Handling Single-Cell Droplet Data

#### Documented in maximumAmbience

#' Maximum ambient contribution
#'
#' Compute the maximum contribution of the ambient solution to an expression profile for a group of droplets.
#'
#' @param y A numeric count matrix where each row represents a gene and each column represents an expression profile.
#' The profile usually contains aggregated counts for multiple droplets in a sample, e.g., for a cluster of cells.
#' This can also be a vector, in which case it is converted into a one-column matrix.
#' @param ambient A numeric vector of length equal to \code{nrow(y)},
#' containing the proportions of transcripts for each gene in the ambient solution.
#' Alternatively, a matrix where each row corresponds to a row of \code{y}
#' and each column contains a specific ambient profile for the corresponding column of \code{y}.
#' @param threshold Numeric scalar specifying the p-value threshold to use, see Details.
#' @param dispersion Numeric scalar specifying the dispersion to use in the negative binomial model.
#' Defaults to zero, i.e., a Poisson model.
#' @param num.points Integer scalar specifying the number of points to use for the grid search.
#' @param num.iter Integer scalar specifying the number of iterations to use for the grid search.
#' @param mode String indicating the output to return - the scaling factor, the ambient profile or the proportion of each gene's counts in \code{y} that is attributable to ambient contamination.
#'
#' @return
#' If \code{mode="scale"},
#' a numeric vector is returned quantifying the maximum \dQuote{contribution} of the ambient solution to each column of \code{y}.
#' Scaling columns of \code{ambient} by this vector yields the maximum ambient profile for each column of \code{y},
#' which can also be obtained by setting \code{mode="profile"}.
#'
#' If \code{mode="proportion"}, a numeric matrix is returned containing the maximum proportion of counts in \code{y} that are attributable to ambient contamination.
#' This is computed by simply dividing the output of \code{mode="profile"} by \code{y} and capping all values at 1.
#'
#' @details
#' On occasion, it is useful to estimate the maximum possible contribution of the ambient solution to a count profile.
#' This represents the most pessimistic explanation of a particular expression pattern
#' and can be used to identify and discard suspect genes or clusters prior to downstream analyses.
#'
#' This function implements the following algorithm:
#' \enumerate{
#' \item We compute the mean ambient contribution for each gene by scaling \code{ambient} by some factor.
#' \code{ambient} itself is usually derived by summing counts across barcodes with low total counts,
#' see the output of \code{\link{emptyDrops}} for an example.
#' \item We compute a p-value for each gene based on the probability of observing a count equal to or below that in \code{y}, using the lower tail of a negative binomial (or Poisson) distribution with mean set to the ambient contribution.
#' The per-gene null hypothesis is that the expected count in \code{y} is equal to the sum of the scaled ambient proportion and some (non-negative) contribution from actual intra-cellular transcripts.
#' \item We combine p-values across all genes using Simes' method.
#' This represents the evidence against the joint null hypothesis (that all of the per-gene nulls are true).
#' \item We find the largest scaling factor that fails to reject this joint null at the specified \code{threshold}.
#' If \code{sum(ambient)} is equal to unity, this scaling factor can be interpreted as the maximum number of transcript molecules contributed to \code{y} by the ambient solution.
#' }
#'
#' The process of going from a scaling factor to a combined p-value has no clean analytical solution,
#' so we use an iterative grid search to identify to largest possible scaling factor at a decent resolution.
#' \code{num.points} and \code{num.iter} control the resolution of the grid search,
#' and generally do not need to be changed.
#'
#' @section Caveats:
#' The above algorithm is rather \emph{ad hoc} and offers little in the way of theoretical guarantees.
#' The p-value is used as a score rather than providing any meaningful error control.
#' Empirically, increasing \code{threshold} will return a higher scaling factor by making the estimation more robust to drop-outs in \code{y}, at the cost of increasing the risk of over-estimation of the ambient contribution.
#'
#' Our abuse of the p-value machinery means that the reported scaling often exceeds the actual contribution, especially at low counts where the reduced power fails to penalize overly large scaling factors.
#' Hence, the function works best when \code{y} contains aggregated counts for one or more groups of droplets with the same expected expression profile, e.g., clusters of related cells.
#' Higher counts provide more power to detect deviations, hopefully leading to a more accurate estimate of the scaling factor.
#'
#' Note that this function returns the \emph{maximum possible} contribution of the ambient solution to \code{y}, not the actual contribution.
#' In the most extreme case, if the ambient profile is similar to the expectation of \code{y} (e.g., due to sequencing a relatively homogeneous cell population), the maximum possible contribution of the ambient solution would be 100\% of \code{y}, and subtraction would yield an empty count vector!
#'
#' @author Aaron Lun
#'
#' @seealso
#' \code{\link{estimateAmbience}}, to estimate the ambient profile.
#'
#' \code{\link{controlAmbience}}, for another method for estimating the ambient contribution.
#'
#' \code{\link{emptyDrops}}, which uses the ambient profile to call cells.
#'
#' @examples
#' # Making up some data for, e.g., a single cluster.
#' ambient <- c(runif(900, 0, 0.1), runif(100))
#' y <- rpois(1000, ambient * 100)
#' y[1:100] <- y[1:100] + rpois(100, 20) # actual biology.
#'
#' # Estimating the maximum possible scaling factor:
#' scaling <- maximumAmbience(y, ambient)
#' scaling
#'
#' # Estimating the maximum contribution to 'y' by 'ambient'.
#' contribution <- maximumAmbience(y, ambient, mode="profile")
#' DataFrame(ambient=drop(contribution), total=y)
#'
#' @seealso
#' \code{\link{estimateAmbience}}, to obtain an estimate to use in \code{ambient}.
#'
#' \code{\link{controlAmbience}}, for a more accurate estimate when control features are available.
#' @export
#' @importFrom stats p.adjust ppois pnbinom
maximumAmbience <- function(y, ambient, threshold=0.1, dispersion=0,
num.points=100, num.iter=5, mode=c("scale", "profile", "proportion"))
{
mode <- match.arg(mode)
args <- list(threshold=threshold, dispersion=dispersion, num.points=num.points, num.iter=num.iter, mode=mode)

if (is.null(dim(y))) {
y <- cbind(y)
colnames(y) <- NULL
}

collated <- vector("list", ncol(y))
names(collated) <- colnames(y)

for (i in seq_along(collated)) {
if (is.null(dim(ambient))) {
A <- ambient
} else {
A <- ambient[,i]
}
collated[[i]] <- do.call(.maximum_ambience, c(list(y=y[,i], ambient=A), args))
}

if (mode=="scale") {
unlist(collated)
} else {
do.call(cbind, collated)
}
}

.maximum_ambience <- function(y, ambient, threshold, dispersion, num.points, num.iter, mode) {
if (dispersion==0) {
FUN <- function(y, mu) {
ppois(y, lambda=mu)
}
} else {
FUN <- function(y, mu) {
pnbinom(y, mu=mu, size=1/dispersion)
}
}

if (!identical(names(y), names(ambient))) {
warning("'y' and 'ambient' do not have the same feature names")
}

# Removing all-zero genes in the ambient profile.
original.y <- y
original.ambient <- ambient

strip <- ambient==0
y <- y[!strip]
ambient <- ambient[!strip]

# Executing a grid search.
lower <- 0
upper <- sum(y)/sum(ambient)
iter <- 0

while (iter <= num.iter) {
scaling <- seq(from=lower, to=upper, length.out=num.points)
triggered <- FALSE

for (i in 2:num.points) {
p <- FUN(y, ambient * scaling[i])

if (any(p <= threshold)) {
lower <- scaling[i-1]
upper <- scaling[i]
triggered <- TRUE
break
}
}

if (!triggered) {
lower <- scaling[num.points-1]
upper <- scaling[num.points]
}

iter <- iter+1L
}

scale <- (lower+upper)/2

if (mode=="scale") {
scale
} else {
profile <- scale * original.ambient
if (mode=="profile") {
profile
} else {
prop <- .clean_amb_props(profile, original.y)
names(prop) <- names(original.ambient)
prop
}
}
}

.clean_amb_props <- function(s, y) {
p <- s/y
p[p > 1] <- 1
p[y == 0] <- NaN # for fairness's sake.
p
}


## Try the DropletUtils package in your browser

Any scripts or data that you put into this service are public.

DropletUtils documentation built on Feb. 4, 2021, 2:01 a.m.