mnnDeltaVariance: Computes the variance of the paired MNN deltas

View source: R/mnnDeltaVariance.R

mnnDeltaVarianceR Documentation

Computes the variance of the paired MNN deltas

Description

Computes the variance of the paired MNN deltas

Usage

mnnDeltaVariance(
  ...,
  pairs,
  cos.norm = FALSE,
  subset.row = NULL,
  compute.all = FALSE,
  assay.type = "logcounts",
  BPPARAM = SerialParam(),
  trend.args = list()
)

Arguments

...

One or more log-expression matrices where genes correspond to rows and cells correspond to columns. Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the assay.type assay. Each object should contain the same number of rows, corresponding to the same genes in the same order. Objects of different types can be mixed together.

If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, it is assumed to contain cells from all batches, so batch should also be specified.

Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside each list were passed directly to ....

pairs

A DataFrame or list of DataFrames containing MNN pairing information. Each row of each DataFrame specifies an MNN pair; each DataFrame should have two columns containing the column indices of paired cells. This is typically produced by fastMNN, see that documentation for more information.

cos.norm

A logical scalar indicating whether cosine normalization should be performed on the expression values.

subset.row

A vector specifying which genes to use for the calculation. This is typically the same as the subset.row passed to fastMNN, if any.

compute.all

Logical scalar indicating whether statistics should be returned for all genes when subset.row is specified, see DEtails.

assay.type

A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs.

BPPARAM

A BiocParallelParam object specifying whether the calculations should be parallelized.

trend.args

Named list of further arguments to pass to fitTrendVar from the scran package.

Details

The “MNN delta” is defined as the difference in the uncorrected expression values of MNN-paired cells. We compute the variance of these values across all pairs for each gene; genes with highly variable deltas are strongly affected by the correction beyond a simple shift. By looking through the top genes, we may conclude that the correction did something unusual if, e.g., we find important markers in the top set. Conversely, if key genes do not have unusually high variances in their delta, we may gain some confidence in the reliability of the correction - at least with respect to the behavior of those genes.

We compute the variance of the deltas rather than using their magnitude, as a shift in expression space is not particularly concerning (and is, in fact, par for the course for batch correction). Rather, we are interested in identifying non-linear changes that might be indicative of correction errors, e.g., inappropriate merging of two different subpopulations. This would manifest as high variances for the relevant marker genes. Of course, whether or not this is actually an error can be a subjective decision - loss of some biological heterogeneity is often an acceptable cost for flexible correction.

To eliminate the mean-variance relationship, we fit a trend to the variances across all genes with respect to the mean log-expression in each MNN pair. We then define an adjusted variance based on the residuals of the fitted curve; this is a more appropriate measure to use for ranking affected genes, as it ensures that genes are not highly ranked due to their abundance alone. Fitting is done using the fitTrendVar function from the scran package - further arguments can be passed via trend.args.

If a list is passed to pairs, each entry of the list is assumed to correspond to a merge step. The variance and trend fitting is done separately for each merge step, and the results are combined by computing the average variance across all steps. This avoids considering the variance of the deltas across merge steps, which is not particularly concerning, e.g., if different batches require different translations.

Value

A DataFrame with one row per gene in ... (or as specified by subset.row), containing the following columns:

  • mean, the mean of the mean log-expressions across all MNN pairs. This may contain repeated contributions from the same cell if it is involved in many MNN pairs.

  • total, the total variance of the deltas across all MNN pairs.

  • trend, the fitted values of the trend in total with respect to mean.

  • adjusted, the adjusted variance, i.e., the residuals of the fitted trend.

The metadata contains the trend fitting statistics returned by fitTrendVar.

If pairs is a list of length greater than 1, the returned DataFrame will also contain per.block, a nested DataFrames of nested DataFrames. Each one of these contains statistics for the individual merge steps and has the same structure as that described above.

Improving consistency with fastMNN

If cos.norm=TRUE, cosine normalization is performed on each cell in ... with cosineNorm. This mimics what is done inside fastMNN for greater consistency with that function's internal calculations. Some further scaling is performed so that the magnitude of the normalized values is mostly unaffected.

If subset.row is specified, variances are only computed for the requested subset of genes, most typically the set of highly variable genes used in fastMNN. This also implies that the normalization and trend fitting is limited to the specified subset. However, if compute.all=TRUE, the scaling factor and fitted trend are extrapolated to compute adjusted variances for all other genes. This is useful for picking up genes outside of the subset used in the correction.

In most applications, it is not necessary to have strict consistency with the internal values computed by fastMNN. Nonetheless, it may be helpful for diagnostics in more difficult situations.

Author(s)

Aaron Lun

See Also

fastMNN and related functions, to obtain the MNN pairs in the first place.

Examples

means <- 2^rexp(200) * 10
B1 <- matrix(rpois(10000, means), ncol=50) # Batch 1 
B2 <- matrix(rpois(10000, means), ncol=50) # Batch 2

# Spiking in some population structure. Each batch is split into two halves 
# corresponding to 2 subpopulations. These populations match up across batches
# but the group in batch 1 also upregulates the first gene.
B1[1:2,1:25] <- B1[1:2,1:25] * 50 
B2[2,1:25] <- B2[2,1:25] * 50 

# Quick log-transformation and correction. We'll omit the multiBatchNorm as the
# sequencing depth is the same across batches anyway.
library(scuttle)
B1 <- normalizeCounts(B1)
B2 <- normalizeCounts(B2)
out <- fastMNN(B1, B2)

# First gene should have high variance of deltas, as the upregulation in B1 is 
# removed to enable the alignment of subpopulations across batches.
mnnDeltaVariance(B1, B2, pairs=metadata(out)$merge.info$pairs[[1]])


LTLA/batchelor documentation built on Jan. 19, 2024, 6:33 p.m.