modelGeneCV2WithSpikes: Model the per-gene CV2 with spike-ins
In scran: Methods for Single-Cell RNA-Seq Data Analysis

Description Usage Arguments Details Value Computing p-values Default size factor choices Explanation for size factor rescaling Handling uninteresting factors Author(s) See Also Examples

Model the squared coefficient of variation (CV2) of the normalized expression profiles for each gene, using spike-ins to estimate the baseline level of technical noise at each abundance.

modelGeneCV2WithSpikes(x, ...)

## S4 method for signature 'ANY'
modelGeneCV2WithSpikes(
  x,
  spikes,
  size.factors = NULL,
  spike.size.factors = NULL,
  block = NULL,
  subset.row = NULL,
  ...,
  equiweight = TRUE,
  method = "fisher",
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
modelGeneCV2WithSpikes(x, ..., assay.type = "counts")

## S4 method for signature 'SingleCellExperiment'
modelGeneCV2WithSpikes(
  x,
  spikes,
  size.factors = NULL,
  spike.size.factors = NULL,
  ...,
  assay.type = "counts"
)

`x`	A numeric matrix of counts where rows are (usually endogenous) genes and columns are cells. Alternatively, a SummarizedExperiment or SingleCellExperiment containing such a matrix.
`...`	For the generic, further arguments to pass to each method. For the ANY method, further arguments to pass to `fitTrendCV2`. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method.
`spikes`	A numeric matrix of counts where each row corresponds to a spike-in transcript. This should have the same number of columns as `x`. Alternatively, for the SingleCellExperiment method, this can be a string or an integer scalar specifying the `altExp` containing the spike-in count matrix.
`size.factors`	A numeric vector of size factors for each cell in `x`, to be used for scaling gene expression.
`spike.size.factors`	A numeric vector of size factors for each cell in `spikes`, to be used for scaling spike-ins.
`block`	A factor specifying the blocking levels for each cell in `x`. If specified, variance modelling is performed separately within each block and statistics are combined across blocks.
`subset.row`	See `?"scran-gene-selection"`, specifying the rows for which to model the variance. Defaults to all genes in `x`.
`equiweight`	A logical scalar indicating whether statistics from each block should be given equal weight. Otherwise, each block is weighted according to its number of cells. Only used if `block` is specified.
`method`	String specifying how p-values should be combined when `block` is specified, see `combinePValues`.
`BPPARAM`	A BiocParallelParam object indicating whether parallelization should be performed across genes.
`assay.type`	String or integer scalar specifying the assay containing the counts. For the SingleCellExperiment method, this is used to retrieve both the endogenous and spike-in counts.

For each gene and spike-in transcript, we compute the variance and CV2 of the normalized expression values. A trend is fitted to the CV2 against the mean for spike-in transcripts using fitTrendCV2. The value of the trend at the abundance of each gene is used to define the variation attributable to technical noise. The ratio to the trend is then used to define overdispersion corresponding to interesting biological heterogeneity.

This function is almost the same as modelGeneCV2, with the only theoretical difference being that the trend is fitted on spike-in CV2 rather than using the means and CV2 of endogenous genes. This is because there are certain requirements for how normalization is performed when comparing spike-in transcripts with endogenous genes - see comments in “Explanation for size factor rescaling”. We enforce this by centering the size factors for both sets of features and recomputing normalized expression values.

A DataFrame is returned where each row corresponds to a gene in x (or in subset.row, if specified). This contains the numeric fields:

mean:: Mean normalized expression per gene.
total:: Squared coefficient of variation of the normalized expression per gene.
trend:: Fitted value of the trend.
ratio:: Ratio of total to trend.
p.value, FDR:: Raw and adjusted p-values for the test against the null hypothesis that ratio<=1.

If block is not specified, the metadata of the DataFrame contains the output of running fitTrendCV2 on the spike-in transcripts, along with the mean and cv2 used to fit the trend.

If block is specified, the output contains another per.block field. This field is itself a DataFrame of DataFrames, where each internal DataFrame contains statistics for the variance modelling within each block and has the same format as described above. Each internal DataFrame's metadata contains the output of fitTrendCV2 for the cells of that block.

The p-value for each gene is computed by assuming that the CV2 estimates are normally distributed around the trend, and that the standard deviation of the CV2 distribution is proportional to the value of the trend. This is used to construct a one-sided test for each gene based on its ratio, under the null hypothesis that the ratio is equal to 1. The proportionality constant for the standard deviation is set to the std.dev returned by fitTrendCV2. This is estimated from the spread of CV2 values for spike-in transcripts, so the null hypothesis effectively becomes “is this gene more variable than spike-in transcripts of the same abundance?”

If no size factors are supplied, they are automatically computed depending on the input type:

If size.factors=NULL for the ANY method, the sum of counts for each cell in x is used to compute a size factor via the librarySizeFactors function.
If spike.size.factors=NULL for the ANY method, the sum of counts for each cell in spikes is used to compute a size factor via the librarySizeFactors function.
If size.factors=NULL for the SingleCellExperiment method, sizeFactors(x) is used if available. Otherwise, it defaults to library size-derived size factors.
If spike.size.factors=NULL for the SingleCellExperiment method and spikes is not a matrix, sizeFactors(altExp(x, spikes) is used if available. Otherwise, it defaults to library size-derived size factors.

If size.factors or spike.size.factors are supplied, they will override any size factors present in x.

The use of a spike-in-derived trend makes several assumptions. The first is that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression of the spike-in transcripts can be wholly attributed to technical noise. The second is that endogenous genes and spike-in transcripts follow the same mean-variance relationship, i.e., a spike-in transcript captures the technical noise of an endogenous gene at the same mean count.

Here, the spike-in size factors across all cells are scaled so that their mean is equal to that of the gene-based size factors for the same set of cells. This ensures that the average normalized abundances of the spike-in transcripts are comparable to those of the endogenous genes, allowing the trend fitted to the former to be used to determine the biological component of the latter. Otherwise, differences in scaling of the size factors would shift the normalized expression values of the former away from the latter, violating the second assumption.

If block is specified, rescaling is performed separately for all cells in each block. This aims to avoid problems from (frequent) violations of the first assumption where there are differences in the quantity of spike-in RNA added to each batch. Without scaling, these differences would lead to systematic shifts in the spike-in abundances from the endogenous genes when fitting a batch-specific trend (even if there is no global difference in scaling across all batches).

Setting block will estimate the mean and variance of each gene for cells in each level of block separately. The trend is fitted separately for each level, and the variance decomposition is also performed separately. Per-level statistics are then combined to obtain a single value per gene:

For means and CV2 values, this is done by taking the geometric mean across blocking levels. If equiweight=FALSE, a weighted average is used where the value for each level is weighted by the number of cells. By default, all levels are equally weighted when combining statistics.
Per-level p-values are combined using combinePValues according to method. By default, Fisher's method is used to identify genes that are highly variable in any batch. Whether or not this is responsive to equiweight depends on the chosen method.
Blocks with fewer than 2 cells are completely ignored and do not contribute to the combined mean, variance component or p-value.

Aaron Lun

fitTrendCV2, for the trend fitting options.

modelGeneCV2, for modelling variance without spike-in controls.

library(scuttle)
sce <- mockSCE()

# Using spike-ins.
spk <- modelGeneCV2WithSpikes(sce, "Spikes")
spk

plot(spk$mean, spk$total)
points(metadata(spk)$mean, metadata(spk)$var, col="red", pch=16)
curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue")

# With blocking (and spike-ins).
block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE)
blk <- modelGeneCV2WithSpikes(sce, "Spikes", block=block)
blk

par(mfrow=c(1,2))
for (i in colnames(blk$per.block)) {
    current <- blk$per.block[[i]]
    plot(current$mean, current$total)
    points(metadata(current)$mean, metadata(current)$var, col="red", pch=16)
    curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue")
}