R/cleanSizeFactors.R

Defines functions cleanSizeFactors

Documented in cleanSizeFactors

#' Clean size factors
#'
#' Coerce non-positive size factors to positive values based on the number of detected features.
#'
#' @param size.factors A numeric vector containing size factors for all libraries.
#' @param num.detected A numeric vector of the same length as \code{size.factors}, containing the number of features detected in each library.
#' @param control Argument passed to \code{\link{nls}} to control the fitting, see \code{?\link{nls.control}} for details.
#' @param iterations Integer scalar specifying the number of robustness iterations.
#' @param nmads Numeric scalar specifying the multiple of MADs to use for the tricube bandwidth in robustness iterations.
#' @param ... Further arguments to pass to \code{\link{nls}}.
#'
#' @details
#' This function will first fit a non-linear curve of the form
#' \deqn{y = \frac{ax}{1 + bx}}{y = ax/(1 + bx)}
#' where \code{y} is \code{num.detected} and \code{x} is \code{size.factors} for all positive size factors.
#' This is a purely empirical expression, chosen because it is passes through the origin, is linear near zero and asymptotes at large \code{x}.
#' The fitting is done robustly with iterations of tricube weighting to eliminate outliers.
#' 
#' We then consider the number of detected features for all samples with non-positive size factors.
#' This is treated as \code{y} and used to solve for \code{x} based on the curve fitted above.
#' The result is the \dQuote{cleaned} size factor, which must always be positive for \code{y < a/b}.
#' For \code{y > a/b}, there is no solution so the cleaned size factor is defined as the largest positive value in \code{size.factors}.
#' 
#' Negative size factors can occasionally be generated by \code{\link{computeSumFactors}}, see the documentation there for more details.
#' By coercing them to positive values, we can proceed to normalization and downstream analyses.
#' Here, we use the number of detected features as this is more robust to differential expression that would cause biases in the library size.
#' Of course, it is not theoretically guaranteed to yield the correct size factor, but a rough guess is better than a negative value.
#' 
#' @return 
#' A numeric vector identical to \code{size.factors} but with all non-positive size factors replaced with fitted values from the curve.
#'
#' @author
#' Aaron Lun
#'
#' @seealso
#' \code{\link{computeSumFactors}}, which can occasionally generate negative size factors.
#'
#' \code{\link{nls}}, which performs the curve fitting.
#'
#' @examples
#' set.seed(100)    
#' counts <- matrix(rpois(20000, lambda=1), ncol=100)
#' 
#' library(scuttle)
#' sf <- librarySizeFactors(counts)
#' ngenes <- colSums(counts > 0)
#' 
#' # Adding negative size factor values to be cleaned.
#' out <- cleanSizeFactors(c(-1, -1, sf), c(100, 50, ngenes))
#' head(out)
#'
#' @export
#' @importFrom stats nls residuals median coef lm
cleanSizeFactors <- function(size.factors, num.detected, control=nls.control(warnOnly=TRUE), iterations=3, nmads=3, ...) {
    keep <- size.factors > 0
    if (all(keep)) {
        return(size.factors)
    }

    if (length(size.factors)!=length(num.detected)) {
        stop("'size.factors' and 'num.detected' should be the same length")
    }
    X <- size.factors[keep]
    Y <- num.detected[keep]
    if (length(X) < 3) {
        stop("need at least three positive values for trend fitting")
    }

    # Solving for initial conditions.
    lower <- median(X) > X
    fit <- lm(Y[lower] ~ 0 + X[lower])
    A <- unname(coef(fit))

    below <- which(Y < A*X)
    top <- below[which.max(Y[below])]
    B <- unname(A / Y[top] - 1 / X[top]) # must be positive due to 'below'.

    # Using logs to enforce positivity.
    init <- log(c(logA=A, logB=B)) 
    lY <- log(Y)
    lX <- log(X)

    # Robustifying iterations with tricube weights.
    weights <- rep(1, length(Y))
    for (i in seq_len(iterations+1L)) {
        # Log-transforming both sides avoids large libraries dominating the least-squares.
        fit <- try(nls(lY ~ logA + lX - log(X * exp(logB) + 1), start=init, weights=weights, control=control, ...), silent=TRUE)

        # Brutal and dirty fallback to avoid errors.
        if (is(fit, "try-error")) {
            size.factors[!keep] <- min(size.factors[keep])
            return(size.factors)
        }

        init <- coef(fit)
        resids <- abs(residuals(fit))
        bandwidth <- pmax(1e-8, median(resids) * nmads)
        weights <- (1-pmin(resids/bandwidth, 1)^3)^3
    }

    # Fitting negative size factors to this trend.
    failed <- num.detected[!keep]
    coefs <- coef(fit)
    new.sf <- failed/(exp(coefs["logA"]) - exp(coefs["logB"]) * failed)

    # If any failed > A/B (very unlikely), we set it to the maximum positive SF. 
    new.sf[new.sf < 0] <- max(X)

    size.factors[!keep] <- new.sf
    size.factors
}

Try the scran package in your browser

Any scripts or data that you put into this service are public.

scran documentation built on April 17, 2021, 6:09 p.m.