multiBatchPCA: Multi-batch PCA

View source: R/multiBatchPCA.R

multiBatchPCAR Documentation

Multi-batch PCA

Description

Perform a principal components analysis across multiple gene expression matrices to project all cells to a common low-dimensional space.

Usage

multiBatchPCA(
  ...,
  batch = NULL,
  d = 50,
  subset.row = NULL,
  weights = NULL,
  get.all.genes = FALSE,
  get.variance = FALSE,
  preserve.single = FALSE,
  assay.type = "logcounts",
  BSPARAM = IrlbaParam(),
  deferred = TRUE,
  BPPARAM = SerialParam()
)

Arguments

...

Two or more matrices containing expression values (usually log-normalized). Each matrix is assumed to represent one batch and should contain the same number of rows, corresponding to the same genes in the same order.

Alternatively, two or more SingleCellExperiment objects containing these matrices. Note the same restrictions described above for gene expression matrix inputs.

Alternatively, one matrix or SingleCellExperiment can be supplied containing cells from all batches. This requires batch to also be specified.

Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside were passed directly to ....

batch

A factor specifying the batch identity of each cell in the input data. Ignored if ... contains more than one argument.

d

An integer scalar specifying the number of dimensions to keep from the PCA. Alternatively NA, in which case the PCA step is omitted entirely - see details below.

subset.row

A vector specifying which features to use for correction.

weights

Numeric vector, logical scalar or list specifying the weighting scheme to use, see below for details.

get.all.genes

A logical scalar indicating whether the reported rotation vectors should include genes that are excluded by a non-NULL value of subset.row.

get.variance

A logical scalar indicating whether to return the (weighted) variance explained by each PC.

preserve.single

A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in ....

assay.type

A string or integer scalar specifying the assay containing the expression values, if SingleCellExperiment objects are present in ....

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA, see runSVD for details.

deferred

A logical scalar used to overwrite the deferred status of BSPARAM for greater speed. Set to NULL to use the supplied status in BSPARAM directly.

BPPARAM

A BiocParallelParam object specifying whether the SVD should be parallelized.

Details

This function is roughly equivalent to cbinding all matrices in ... and performing PCA on the merged matrix. The main difference is that each sample is forced to contribute equally to the identification of the rotation vectors. Specifically, the mean vector used for centering is defined as the grand mean of the mean vectors within each batch. Each batch's contribution to the gene-gene covariance matrix is also divided by the number of cells in that batch.

Our approach is to effectively weight the cells in each batch to mimic the situation where all batches have the same number of cells. This ensures that the low-dimensional space can distinguish subpopulations in smaller batches. Otherwise, batches with a large number of cells would dominate the PCA, i.e., the definition of the mean vector and covariance matrix. This may reduce resolution of unique subpopulations in smaller batches that differ in a different dimension to the subspace of the larger batches.

It is usually recommended to set subset.row to a subset of interesting features (e.g., highly variable genes). This reduces computational time and eliminates uninteresting noise that could interfere with identification of the most relevant axes of variation. Setting get.all.genes=TRUE will report rotation vectors that span all genes, even when only a subset of genes are used for the PCA. This is done by projecting all non-used genes into the low-dimensional “cell space” defined by the first d components.

Value

A List of numeric matrices is returned where each matrix corresponds to a batch and contains the first d PCs (columns) for all cells in the batch (rows).

If preserve.single=TRUE and ... contains a single object, the List will only contain a single matrix. This contains the first d PCs (columns) for all cells in the same order as supplied in the single input object.

The metadata contains rotation, a matrix of rotation vectors, which can be used to construct a low-rank approximation of the input matrices. This has number of rows equal to the number of genes after any subsetting, except if get.all.genes=TRUE, where the number of rows is equal to the genes before subsetting.

If get.variance=TRUE, the metadata will also contain var.explained, a numeric vector containing the weighted variance explained by each of d PCs; and var.total, the total variance across all components after weighting.

Tuning the weighting

By default, weights=NULL or TRUE will use the default weights, i.e., the reciprocal of the number of cells in each batch. This equalizes the contribution of batches with different numbers of cells as described above.

If weights=FALSE, no weighting is performed. This means that larger batches will drive the PCA, which may be desirable when dealing with technical replicates where there is no concern about unique subpopulations in smaller batches.

If weights is a numeric vector, it is expected to be of the same length (and, if named, have the same names) as the entries in .... Each entry of the vector is used to scale the default weight of the corresponding batch. This allows users to fine-tune the contribution of each batch in situations with multiple levels of heterogeneity. For example, consider merging data from multiple donors where each donor contains a variable number of batches. In such cases, it may be more appropriate to ensure that each donor has equal weight, rather than each batch. This is done by assigning a value of weights to each replicate that is inversely proportional to the number of batches for the same donor - see Examples.

Alternatively, weights can be a list representing a tree-like structure, identical to the tree controlling the merge order in fastMNN. Here, weights are scaled so that each partition in the tree yields subtrees with identical scaled weights (summed across each subtree's children). This allows us to easily adjust the weighting scheme for hierarchical batch structures like the replicate-study scenario described above.

Performing the PCA

Most of the parameters related to the PCA are controlled by BSPARAM, which determines the choice and parameterization of SVD algorithms. The default is to use IrlbaParam though RandomParam is often faster (at the cost of some accuracy). Most choices of BSPARAM will interactly sanely with BPPARAM to achieve parallelization during the PCA itself.

With the default deferred=TRUE, the per-gene centering and per-cell scaling will be deferred during matrix multiplication in approximate SVD algorithms. This is much faster when the input matrices are sparse, as deferred operations avoids loss of sparsity at the cost of numerical precision. If deferred=NULL, the use of deferred scaling is determined by the setting within BSPARAM itself - see ?bsdeferred for details.

If d=NA, no PCA is performed. Instead, the centered matrices are transposed and returned directly, while the rotation matrix is set to an identity matrix. This allows developers to easily switch between PCA-based approximations versus the underlying dataset in their functions by simply changing d. (In some settings, one can even interpret d=NA as the maximum d, as Euclidean distance calculations between cells will be identical to a full-rank PC matrix.) Note that, in this mode, no projection will be done with get.all.genes=TRUE; rather, genes not in subset.row will simply have rotation values of zero. If get.variance=TRUE, the weighted variances of the individual genes are returned.

Author(s)

Aaron Lun

See Also

runSVD

Examples

d1 <- matrix(rnorm(5000), ncol=100)
d1[1:10,1:10] <- d1[1:10,1:10] + 2 # unique population in d1
d2 <- matrix(rnorm(2000), ncol=40)
d2[11:20,1:10] <- d2[11:20,1:10] + 2 # unique population in d2

# PCA defaults to IRLBA, so we need to set the seed.
set.seed(10)
out <- multiBatchPCA(d1, d2, d=10)

# Examining results.
xlim <- range(c(out[[1]][,1], out[[2]][,1]))
ylim <- range(c(out[[1]][,2], out[[2]][,2]))
plot(out[[1]][,1], out[[1]][,2], col="red", xlim=xlim, ylim=ylim)
points(out[[2]][,1], out[[2]][,2], col="blue") 

# Using the weighting scheme, assuming that 'd2' and 'd3'
# are replicates and should contribute the same combined
# weight as 'd1'.
d3 <- d2 + 5
set.seed(10)
out <- multiBatchPCA(d1, d2, d3, d=10, weights=c(1, 0.5, 0.5))

set.seed(10)
alt <- multiBatchPCA(d1, d2, d3, d=10, weights=list(1, list(2, 3))) 
stopifnot(all.equal(out, alt)) # As they are the same.


LTLA/batchelor documentation built on Jan. 19, 2024, 6:33 p.m.