Description Usage Arguments Details Value Trend fitting options Handling uninteresting factors of variation Additional notes on row selection Warning on size factor centring Author(s) References See Also Examples
Fit a mean-dependent trend to the gene-specific variances in single-cell RNA-seq data.
1 2 3 4 5 6 7 8
## S4 method for signature 'ANY' trendVar(x, method=c("loess", "spline"), parametric=FALSE, loess.args=list(), spline.args=list(), nls.args=list(), span=NULL, family=NULL, degree=NULL, df=NULL, start=NULL, block=NULL, design=NULL, weighted=TRUE, min.mean=0.1, subset.row=NULL) ## S4 method for signature 'SingleCellExperiment' trendVar(x, subset.row=NULL, ..., assay.type="logcounts", use.spikes=TRUE)
A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SingleCellExperiment object that contains such values.
A string specifying the algorithm to use for smooth trend fitting.
A logical scalar indicating whether a parametric curve should be fitted prior to smoothing.
A named list of arguments to pass to
A named list of arguments to pass to
A named list of arguments to pass to
Deprecated arguments to pass to
Deprecated argument to pass to
Deprecated argument to pass to
A factor specifying the blocking level for each cell.
A numeric matrix describing the uninteresting factors contributing to expression in each cell.
A logical scalar indicated whether weighted trend fitting should be performed when
A numeric scalar specifying the minimum mean log-expression in order for a gene to be used for trend fitting.
A logical, integer or character scalar indicating the rows of
Additional arguments to pass to
A string specifying which assay values in
A logical scalar specifying whether the trend should be fitted to variances for spike-in transcripts or endogenous genes.
This function fits an abundance-dependent trend to the variance of the log-normalized expression for the spike-in transcripts.
For SingleCellExperiment objects, these expression values are computed by
normalize after setting the size factors,
Log-transformed values are used as these are more robust to genes/transcripts with strong expression in only one or two outlier cells.
It also allows the fitted trend to be applied in downstream procedures that use log-transformed counts.
The mean and variance of the normalized log-counts is calculated for each spike-in transcript, and a trend is fitted to the variance against the mean for all transcripts.
The fitted value of this trend represents technical variability due to sequencing, drop-outs during capture, etc. at a given mean.
This assumes that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression are purely due to measurement error.
Variance decomposition to biological and technical components for endogenous genes can then be performed later with
A named list is returned, containing:
A numeric vector of mean log-expression values for all spike-in transcripts, if
Otherwise, a numeric matrix of means where each row corresponds to a spike-in and each column corresponds to a level of
A numeric vector of the variances of log-expression values for all spike-in transcripts, if
Otherwise, a numeric matrix of variances where each row corresponds to a spike-in and each column corresponds to a level of
An integer scalar specifying the residual d.f. used for variance estimation of each spike-in transcript, if
Otherwise, a integer vector where each entry specifies the residual d.f. used in each level of
A factor identical to the input
block, only returned if it was not
A numeric matrix identical to the input
design, only returned if it was not
A function that returns the fitted value of the trend at any mean.
A numeric scalar, specifying the second degrees of freedom for a scaled F-distribution describing the variability of variance estimates around the trend.
parametric=FALSE, smoothing is performed directly on the log-variances.
This is the default as it provides the most stable performance on arbitrary mean-variance relationships.
parametric=TRUE, a non-linear curve of the form
y = ax/(x^n + b)
is fitted to the variances against the means using
Starting values and the number of iterations are automatically set if not explicitly specified in
A smoothing algorihtm is then applied to the log-ratios of the variance to the fitted value for each gene.
The aim is to use the parametric curve to reduce the sharpness of the expected mean-variance relationship[for easier smoothing.
Conversely, the parametric form is not exact, so the smoothers will model any remaining trends in the residuals.
method argument specifies the smoothing algorithm to be applied on the log-ratios/variances.
By default, a robust loess curve is used for trend fitting via
This provides a fairly flexible fit while protecting against genes with very large or very small variances.
loess are specified with
loess.args, with defaults of
degree=1 unless otherwise specified.
Some experimentation with these parameters may be required to obtain satisfactory results.
method="spline", smoothing will instead be performed using the
Arguments are specified with
spline.args, with a default degrees of freedom of
df=4 unless otherwise specified.
Splines can be more effective than loess at capturing smooth curves with strong non-linear gradients.
trendVar function will produce an output
trend function with which fitted values can be computed.
When extrapolating to values below the smallest observed mean, the output function will approach zero.
When extrapolating to values above the largest observed mean, the output function will be set to the fitted value of the trend at the largest mean.
There are three approaches to handling unwanted factors of variation.
The simplest approach is to use a design matrix containing the uninteresting factors can be specified in
This will fit a linear model to the log-expression values for each gene, yielding an estimate for the residual variance.
The trend is then fitted to the residual variance against the mean for each spike-in transcripts.
Another approach is to use
block, where all cells in each level of the blocking factor are treated as a separate group.
Means and variances are estimated within each group and the resulting sets of means/variances are pooled across all groups.
The trend is then fitted to the pooled observations, where observations from different levels are weighted according to the residual d.f. used for variance estimation.
This effectively multiplies the number of points by the number of levels in
design are specified,
block will take priority and
design will be ignored.
The final approach is to subset the data set for each level of the blocking factor, re-run
normalize for each subset to centre the size factors (see below),
decomposeVar for each subset separately.
Results from all levels are then consolidated using the
This is the most correct approach if there are systematic differences in the size factors (spike-in or endogenous) between levels.
With the other two methods, such differences would be normalized out in the full log-expression matrix, preventing proper estimation of the level-specific abundance.
Assuming there are no differences in the size factors between levels, we suggest using
block wherever possible instead of
This is because the use of
block preserves differences in the means/variances between levels of the factor.
In contrast, using
design will effectively compute an average mean/variance.
This may yield an inaccurate representation of the trend, as the fitted value at an average mean may not be equal to the average variance for non-linear trends.
Nonetheless, we still support
design as it can accommodate additive models, whereas
block only handles one-way layouts.
The selection of spike-in transcripts can be adjusted in
trendVar,SingleCellExperiment-method using the
use.spikes=TRUE which means that only rows labelled as spike-ins with
isSpike(x) will be used.
An error will be raised if no rows are labelled as spike-in transcripts.
use.spikes=FALSE, only the rows not labelled as spike-in transcripts will be used.
use.spikes=NA, every row will be used for trend fitting, regardless of whether it corresponds to a spike-in transcript or not.
use.spikes=FALSE, this implies that
trendVar will be applied to the endogenous genes in the SingleCellExperiment object.
trendVar,ANY-method, it is equivalent to manually supplying a matrix of normalized expression for endogenous genes.
This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population.
Low-abundance genes with mean log-expression below
min.mean are not used in trend fitting, to preserve the sensitivity of span-based smoothers at moderate-to-high abundances.
It also protects against discreteness, which can interfere with estimation of the variability of the variance estimates and accurate scaling of the trend.
The default threshold is chosen based on the point at which discreteness is observed in variance estimates from Poisson-distributed counts.
For heterogeneous droplet data, a lower threshold of 0.001-0.01 should be used.
Users can also directly specify which rows to use with
All of these parameters - including
use.spikes - interact with each other sensibly, by taking the intersection of rows that fulfill all criteria.
For example, if
subset.row is specified and
use.spikes=TRUE, rows are only used if they are both spike-in transcripts and in
use.spikes=FALSE, only rows in
subset.row that are not spike-in transcripts are used.
trendVar,SingleCellExperiment-method will attempt to determine if the expression values were computed from counts via
If so, a warning will be issued if the size factors are not centred at unity.
This is because different size factors are typically used for endogenous genes and spike-in transcripts.
If these size factor sets are not centred at the same value, there will be systematic differences in abundance between these features.
This precludes the use of a spike-in fitted trend with abundances for endogenous genes in
For other expression values and in
trendVar,ANY-method, the onus is on the user to ensure that normalization
(i) does not introduce differences in abundance between spike-in and endogenous features, while
(ii) preserving differences in abundance within the set of endogenous or spike-in features.
In short, the scaling factors used to normalize each feature should have the same mean across all cells.
This ensures that spurious differences in abundance are not introduced by the normalization process.
Lun ATL, McCarthy DJ and Marioni JC (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5:2122
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
example(computeSpikeFactors) # Using the mocked-up data 'y' from this example. # Normalizing (gene-based factors for genes, spike-in factors for spike-ins) y <- computeSumFactors(y) y <- computeSpikeFactors(y, general.use=FALSE) y <- normalize(y) # Fitting a trend to the spike-ins. fit <- trendVar(y) plot(fit$mean, fit$var) curve(fit$trend(x), col="red", lwd=2, add=TRUE) # Fitting a trend to the endogenous genes. fit.g <- trendVar(y, use.spikes=FALSE) plot(fit.g$mean, fit.g$var) curve(fit.g$trend(x), col="red", lwd=2, add=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.