fitTrendVar: Fit a trend to the variances of log-counts
In scran: Methods for Single-Cell RNA-Seq Data Analysis

Description Usage Arguments Details Value Filtering by mean Weighting by density Author(s) See Also Examples

Fit a mean-dependent trend to the variances of the log-normalized expression values derived from count data.

fitTrendVar(
  means,
  vars,
  min.mean = 0.1,
  parametric = TRUE,
  lowess = TRUE,
  density.weights = TRUE,
  nls.args = list(),
  ...
)

`means`	A numeric vector containing the mean log-expression value for each gene.
`vars`	A numeric vector containing the variance of log-expression values for each gene.
`min.mean`	A numeric scalar specifying the minimum mean to use for trend fitting.
`parametric`	A logical scalar indicating whether a parametric fit should be attempted.
`lowess`	A logical scalar indicating whether a LOWESS fit should be attempted.
`density.weights`	A logical scalar indicating whether to use inverse density weights.
`nls.args`	A list of parameters to pass to `nls` if `parametric=TRUE`.
`...`	Further arguments to pass to `weightedLowess` for LOWESS fitting.

This function fits a mean-dependent trend to the variance of the log-normalized expression for the selected features. The fitted trend can then be used to decompose the variance of each gene into biological and technical components, as done in modelGeneVar and modelGeneVarWithSpikes.

The default fitting process follows a two-step procedure when parametric=TRUE and lowess=TRUE:

A non-linear curve of the form

y = ax/(x^n + b)

is fitted to the variances against the means using nls. Starting values and the number of iterations are automatically set if not explicitly specified in nls.args.
weightedLowess is applied to the log-ratios of the variance of each gene to the corresponding fitted value from the non-linear curve. The final trend is defined as the product of the fitted values from the non-linear curve and the exponential of the LOWESS fitted value. If any tuning is necessary, the most important parameter here is span, which can be passed in the ... arguments.

The aim is to use the parametric curve to reduce the sharpness of the expected mean-variance relationship for easier smoothing. Conversely, the parametric form is not exact, so the smoothers will model any remaining trends in the residuals.

If parametric=FALSE, LOWESS smoothing is performed directly on the log-variances using weightedLowess. This may be helpful in situations where the data does not follow the expected parametric curve, e.g., for transformed data that spans negative values where the expression is not defined. (Indeed, the expression above is purely empirical, chosen simply as it matched the shape of the observed trend in many datasets.)

If lowess=FALSE, the LOWESS smoothing step is omitted and the parametric fit is used directly. This may be necessary in situations where the LOWESS overfits, e.g., because of very few points at high abundances.

A named list is returned containing:

trend:: A function that returns the fitted value of the trend at any value of the mean.
std.dev:: A numeric scalar containing the robust standard deviation of the ratio of var to the fitted value of the trend across all features used for trend fitting.

Genes with mean log-expression below min.mean are not used in trend fitting. This aims to remove the majority of low-abundance genes and preserve the sensitivity of span-based smoothers at moderate-to-high abundances. It also protects against discreteness, which can interfere with estimation of the variability of the variance estimates and accurate scaling of the trend.

Filtering is applied on the mean log-expression to avoid introducing spurious trends at the filter boundary. The default threshold is chosen based on the point at which discreteness is observed in variance estimates from Poisson-distributed counts. For heterogeneous droplet data, a lower threshold of 0.001-0.01 may be more appropriate, though this usually does not matter all too much.

When extrapolating to values below the smallest observed mean (or min.mean), the output function will approach zero as the mean approaches zero. This reflects the fact that the variance should be zero at a log-expression of zero (assuming a pseudo-count of 1 was used). When extrapolating to values above the largest observed mean, the output function will be set to the fitted value of the trend at the largest mean.

All fitting (with nls and weightedLowess) is performed by weighting each observation according to the inverse of the density of observations at the same mean. This ensures that parts of the curve with few points are fitted as well as parts of the trend with many points. Otherwise, differences in the distribution of means would favor good fits in highly dense intervals at the expense of sparser intervals. (Note that these densities are computed after filtering on min.mean, so the high density of points at zero has no effect.)

That being said, the density weighting can give inappropriate weight to very sparse intervals, especially those at high abundance. This results in overfitting where the trend is compelled to pass through each point at these intervals. For most part, this is harmless as most high-abundance genes are not highly variable so an overfitted trend is actually appropriate. However, if high-abundance genes are variable, it may be better to set density.weights=FALSE to avoid this overfitting effect.

Aaron Lun

modelGeneVar and modelGeneVarWithSpikes, where this function is used.

library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Fitting a trend:
library(DelayedMatrixStats)
means <- rowMeans(logcounts(sce))
vars <- rowVars(logcounts(sce))
fit <- fitTrendVar(means, vars)

# Comparing the two trend fits:
plot(means, vars, pch=16, cex=0.5, xlab="Mean", ylab="Variance")
curve(fit$trend(x), add=TRUE, col="dodgerblue", lwd=3)