Description Usage Arguments Details Value Accounting for uninteresting factors Feature selection Author(s) References See Also Examples

Decompose the gene-specific variance into biological and technical components for single-cell RNA-seq data.

1 2 3 4 5 6 | ```
## S4 method for signature 'ANY,list'
decomposeVar(x, fit, block=NA, design=NA, subset.row=NULL, ...)
## S4 method for signature 'SingleCellExperiment,list'
decomposeVar(x, fit, subset.row=NULL, ...,
assay.type="logcounts", get.spikes=NA)
``` |

`x` |
A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SingleCellExperiment object containing such a matrix. |

`fit` |
A list containing the output of |

`block` |
A factor containing the level of a blocking factor for each cell. |

`design` |
A numeric matrix describing the uninteresting factors contributing to expression in each cell. |

`subset.row` |
A logical, integer or character scalar indicating the rows of |

`...` |
For |

`assay.type` |
A string specifying which assay values to use from |

`get.spikes` |
A logical scalar specifying whether decomposition should be performed for spike-ins. |

This function computes the variance of the normalized log-counts for each endogenous gene.
The technical component of the variance for each gene is determined by interpolating the fitted trend in `fit`

at the mean log-count for that gene.
This represents variance due to sequencing noise, variability in capture efficiency, etc.
The biological component is determined by subtracting the technical component from the total variance.

Highly variable genes (HVGs) can be identified as those with large biological components. Unlike other methods for decomposition, this approach estimates the variance of the log-counts rather than of the counts themselves. The log-transformation blunts the impact of large positive outliers and ensures that HVGs are driven by strong log-fold changes between cells, not differences in counts. Interpretation is not compromised – HVGs will still be so, regardless of whether counts or log-counts are considered.

If `assay.type="logcounts"`

and the size factors are not centred at unity, a warning will be raised - see `?trendVar`

for details.

A DataFrame is returned where each row corresponds to and is named after a row of `x`

(if `subset.row=NULL`

; otherwise, each row corresponds to an element of `subset.row`

).
This contains the numeric fields:

`mean`

:Mean normalized log-expression per gene.

`total`

:Variance of the normalized log-expression per gene.

`bio`

:Biological component of the variance.

`tech`

:Technical component of the variance.

`p.value, FDR`

:Raw and adjusted p-values for the test against the null hypothesis that

`bio=0`

.

If `get.spikes=NA`

, the `p.value`

and `FDR`

fields will be set to `NA`

for rows corresponding to spike-in transcripts.

The metadata field of the output DataFrame also contains `num.cells`

, an integer scalar storing the number of cells in `x`

;
and `resid.df`

, an integer scalar specifying the residual d.f. used for variance estimation.

To account for uninteresting factors of variation, either `block`

or `design`

can be specified:

Setting

`block`

will estimate the mean and variance of each gene for cells in each group (i.e., each level of`block`

) separately. The technical component is also estimated for each group separately, based on the group-specific mean. Group-specific statistics are combined to obtain a single value per gene. For means and variances, this is done by taking a weighted average across groups, with weighting based on the residual d.f. (for variances) or number of cells (for means). For p-values, Stouffer's method is used on the group-specific p-values returned by`testVar`

, with the residual d.f. used as weights.Alternatively, uninteresting factors can be used to construct a design matrix to pass to the function via

`design`

. In this case, a linear model is fitted to the expression profile for each gene, and the variance is calculated from the residual variance of the fit. The technical component is estimated as the fitted value of the trend at the mean expression across all cells for that gene. This approach is useful for covariates or additive models that cannot be expressed as a one-way layout for use in`block`

.

If either of these arguments are `NA`

, they will be extracted from `fit`

, assuming that the same cells were used to fit the trend.
If `block`

is specified, this will override any setting of `design`

.
Use of `block`

is generally favoured as group-specific means result in a better estimate of the technical component than an average mean across all groups.

Note that the use of either `block`

or `design`

assumes that there are no systematic differences in the size factors across levels of an uninteresting factor.
If such differences are present, we suggest using `combineVar`

instead, see the discussion in `?trendVar`

for more details.

If `get.spikes=NA`

in `decomposeVar,SingleCellExperiment-method`

, the p-value and FDR will not be returned for spike-in transcripts.
This is the default as it returns the other variance statistics for diagnostic purposes, but ensures that the spike-ins are not treated as candidate HVGs.
If `get.spikes=FALSE`

, spike-in transcripts are filtered out from `x`

and no statistics are returned.
If `get.spikes=TRUE`

, all statistics are computed for all rows, regardless of spike-in status.

Users can also directly specify which rows to use with `subset.row`

.
This is equivalent to running `decomposeVar`

on `x[subset.row,]`

, but is more efficient as it avoids the construction of large temporary matrices.

If `x`

is not supplied, all genes used to fit the trend in `fit`

will be used instead for the variance decomposition.
This may be useful when a trend is fitted to all genes in `trendVar`

, such that the statistics for all genes will already be available in `fit`

.
By not specifying `x`

, users can avoid redundant calculations, which is particularly helpful for very large data sets.

Aaron Lun

Lun ATL, McCarthy DJ and Marioni JC (2016).
A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.
*F1000Res.* 5:2122

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ```
example(computeSpikeFactors) # Using the mocked-up data 'y' from this example.
y <- computeSumFactors(y) # Size factors for the the endogenous genes.
y <- computeSpikeFactors(y, general.use=FALSE) # Size factors for spike-ins.
y <- normalize(y) # Normalizing the counts by the size factors.
# Decomposing technical and biological noise.
fit <- trendVar(y)
results <- decomposeVar(y, fit)
head(results)
plot(results$mean, results$total)
o <- order(results$mean)
lines(results$mean[o], results$tech[o], col="red", lwd=2)
plot(results$mean, results$bio)
# A trend fitted to endogenous genes can also be used, pending assumptions.
fit.g <- trendVar(y, use.spikes=FALSE)
results.g <- decomposeVar(y, fit.g)
head(results.g)
``` |

scran documentation built on May 21, 2018, 6 p.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.