trendVar: Fit a variance trend

Description Usage Arguments Details Value Trend fitting options Handling uninteresting factors of variation Additional notes on row selection Warning on size factor centring Author(s) References See Also Examples

Description

Fit a mean-dependent trend to the gene-specific variances in single-cell RNA-seq data.

Usage

1
2
3
4
5
6
7
8
## S4 method for signature 'ANY'
trendVar(x, method=c("loess", "spline"), parametric=FALSE, 
    loess.args=list(), spline.args=list(), nls.args=list(),
    span=NULL, family=NULL, degree=NULL, df=NULL, start=NULL, 
    block=NULL, design=NULL, weighted=TRUE, min.mean=0.1, subset.row=NULL) 

## S4 method for signature 'SingleCellExperiment'
trendVar(x, subset.row=NULL, ..., assay.type="logcounts", use.spikes=TRUE)

Arguments

x

A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SingleCellExperiment object that contains such values.

method

A string specifying the algorithm to use for smooth trend fitting.

parametric

A logical scalar indicating whether a parametric curve should be fitted prior to smoothing.

loess.args

A named list of arguments to pass to loess when method="loess".

spline.args

A named list of arguments to pass to robustSmoothSpline when method="spline".

nls.args

A named list of arguments to pass to nls when parametric=TRUE.

span, family, degree

Deprecated arguments to pass to loess.

df

Deprecated argument to pass to ns.

start

Deprecated argument to pass to nls.

block

A factor specifying the blocking level for each cell.

design

A numeric matrix describing the uninteresting factors contributing to expression in each cell.

weighted

A logical scalar indicated whether weighted trend fitting should be performed when block!=NULL.

min.mean

A numeric scalar specifying the minimum mean log-expression in order for a gene to be used for trend fitting.

subset.row

A logical, integer or character scalar indicating the rows of x to use.

...

Additional arguments to pass to trendVar,ANY-method.

assay.type

A string specifying which assay values in x to use.

use.spikes

A logical scalar specifying whether the trend should be fitted to variances for spike-in transcripts or endogenous genes.

Details

This function fits an abundance-dependent trend to the variance of the log-normalized expression for the spike-in transcripts. For SingleCellExperiment objects, these expression values are computed by normalize after setting the size factors, e.g., with computeSpikeFactors. Log-transformed values are used as these are more robust to genes/transcripts with strong expression in only one or two outlier cells. It also allows the fitted trend to be applied in downstream procedures that use log-transformed counts.

The mean and variance of the normalized log-counts is calculated for each spike-in transcript, and a trend is fitted to the variance against the mean for all transcripts. The fitted value of this trend represents technical variability due to sequencing, drop-outs during capture, etc. at a given mean. This assumes that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression are purely due to measurement error. Variance decomposition to biological and technical components for endogenous genes can then be performed later with decomposeVar.

Value

A named list is returned, containing:

mean:

A numeric vector of mean log-expression values for all spike-in transcripts, if block=NULL. Otherwise, a numeric matrix of means where each row corresponds to a spike-in and each column corresponds to a level of block.

var:

A numeric vector of the variances of log-expression values for all spike-in transcripts, if block=NULL. Otherwise, a numeric matrix of variances where each row corresponds to a spike-in and each column corresponds to a level of block.

resid.df:

An integer scalar specifying the residual d.f. used for variance estimation of each spike-in transcript, if block=NULL. Otherwise, a integer vector where each entry specifies the residual d.f. used in each level of block.

block:

A factor identical to the input block, only returned if it was not NULL.

design:

A numeric matrix identical to the input design, only returned if it was not NULL and block=NULL.

trend:

A function that returns the fitted value of the trend at any mean.

df2:

A numeric scalar, specifying the second degrees of freedom for a scaled F-distribution describing the variability of variance estimates around the trend.

Trend fitting options

If parametric=FALSE, smoothing is performed directly on the log-variances. This is the default as it provides the most stable performance on arbitrary mean-variance relationships.

If parametric=TRUE, a non-linear curve of the form

y = ax/(x^n + b)

is fitted to the variances against the means using nls. Starting values and the number of iterations are automatically set if not explicitly specified in nls.args. A smoothing algorihtm is then applied to the log-ratios of the variance to the fitted value for each gene. The aim is to use the parametric curve to reduce the sharpness of the expected mean-variance relationship[for easier smoothing. Conversely, the parametric form is not exact, so the smoothers will model any remaining trends in the residuals.

The method argument specifies the smoothing algorithm to be applied on the log-ratios/variances. By default, a robust loess curve is used for trend fitting via loess. This provides a fairly flexible fit while protecting against genes with very large or very small variances. Arguments to loess are specified with loess.args, with defaults of span=0.3, family="symmetric" and degree=1 unless otherwise specified. Some experimentation with these parameters may be required to obtain satisfactory results.

If method="spline", smoothing will instead be performed using the smooth.spline function Arguments are specified with spline.args, with a default degrees of freedom of df=4 unless otherwise specified. Splines can be more effective than loess at capturing smooth curves with strong non-linear gradients.

The trendVar function will produce an output trend function with which fitted values can be computed. When extrapolating to values below the smallest observed mean, the output function will approach zero. When extrapolating to values above the largest observed mean, the output function will be set to the fitted value of the trend at the largest mean.

Handling uninteresting factors of variation

There are three approaches to handling unwanted factors of variation. The simplest approach is to use a design matrix containing the uninteresting factors can be specified in design. This will fit a linear model to the log-expression values for each gene, yielding an estimate for the residual variance. The trend is then fitted to the residual variance against the mean for each spike-in transcripts.

Another approach is to use block, where all cells in each level of the blocking factor are treated as a separate group. Means and variances are estimated within each group and the resulting sets of means/variances are pooled across all groups. The trend is then fitted to the pooled observations, where observations from different levels are weighted according to the residual d.f. used for variance estimation. This effectively multiplies the number of points by the number of levels in block. If both block and design are specified, block will take priority and design will be ignored.

The final approach is to subset the data set for each level of the blocking factor, re-run normalize for each subset to centre the size factors (see below), and run trendVar and decomposeVar for each subset separately. Results from all levels are then consolidated using the combineVar function. This is the most correct approach if there are systematic differences in the size factors (spike-in or endogenous) between levels. With the other two methods, such differences would be normalized out in the full log-expression matrix, preventing proper estimation of the level-specific abundance.

Assuming there are no differences in the size factors between levels, we suggest using block wherever possible instead of design. This is because the use of block preserves differences in the means/variances between levels of the factor. In contrast, using design will effectively compute an average mean/variance. This may yield an inaccurate representation of the trend, as the fitted value at an average mean may not be equal to the average variance for non-linear trends. Nonetheless, we still support design as it can accommodate additive models, whereas block only handles one-way layouts.

Additional notes on row selection

The selection of spike-in transcripts can be adjusted in trendVar,SingleCellExperiment-method using the use.spikes method.

If use.spikes=FALSE, this implies that trendVar will be applied to the endogenous genes in the SingleCellExperiment object. For trendVar,ANY-method, it is equivalent to manually supplying a matrix of normalized expression for endogenous genes. This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population.

Low-abundance genes with mean log-expression below min.mean are not used in trend fitting, to preserve the sensitivity of span-based smoothers at moderate-to-high abundances. It also protects against discreteness, which can interfere with estimation of the variability of the variance estimates and accurate scaling of the trend. The default threshold is chosen based on the point at which discreteness is observed in variance estimates from Poisson-distributed counts. For heterogeneous droplet data, a lower threshold of 0.001-0.01 should be used.

Users can also directly specify which rows to use with subset.row. All of these parameters - including min.mean and use.spikes - interact with each other sensibly, by taking the intersection of rows that fulfill all criteria. For example, if subset.row is specified and use.spikes=TRUE, rows are only used if they are both spike-in transcripts and in subset.row. Otherwise, if use.spikes=FALSE, only rows in subset.row that are not spike-in transcripts are used.

Warning on size factor centring

If assay.type="logcounts", trendVar,SingleCellExperiment-method will attempt to determine if the expression values were computed from counts via normalize. If so, a warning will be issued if the size factors are not centred at unity. This is because different size factors are typically used for endogenous genes and spike-in transcripts. If these size factor sets are not centred at the same value, there will be systematic differences in abundance between these features. This precludes the use of a spike-in fitted trend with abundances for endogenous genes in decomposeVar.

For other expression values and in trendVar,ANY-method, the onus is on the user to ensure that normalization (i) does not introduce differences in abundance between spike-in and endogenous features, while (ii) preserving differences in abundance within the set of endogenous or spike-in features. In short, the scaling factors used to normalize each feature should have the same mean across all cells. This ensures that spurious differences in abundance are not introduced by the normalization process.

Author(s)

Aaron Lun

References

Lun ATL, McCarthy DJ and Marioni JC (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5:2122

See Also

nls, loess, decomposeVar, computeSpikeFactors, computeSumFactors, normalize

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
example(computeSpikeFactors) # Using the mocked-up data 'y' from this example.

# Normalizing (gene-based factors for genes, spike-in factors for spike-ins)
y <- computeSumFactors(y) 
y <- computeSpikeFactors(y, general.use=FALSE)
y <- normalize(y)

# Fitting a trend to the spike-ins.
fit <- trendVar(y)
plot(fit$mean, fit$var)
curve(fit$trend(x), col="red", lwd=2, add=TRUE)

# Fitting a trend to the endogenous genes. 
fit.g <- trendVar(y, use.spikes=FALSE)
plot(fit.g$mean, fit.g$var)
curve(fit.g$trend(x), col="red", lwd=2, add=TRUE)

scran documentation built on May 21, 2018, 6 p.m.

Related to trendVar in scran...