multiBatchPCA: Multi-batch PCA
In batchelor: Single-Cell Batch Correction Methods

Description Usage Arguments Details Value Tuning the weighting Author(s) See Also Examples

Perform a principal components analysis across multiple gene expression matrices to project all cells to a common low-dimensional space.

multiBatchPCA(
  ...,
  batch = NULL,
  d = 50,
  subset.row = NULL,
  weights = NULL,
  get.all.genes = FALSE,
  get.variance = FALSE,
  preserve.single = FALSE,
  assay.type = "logcounts",
  BSPARAM = IrlbaParam(),
  deferred = TRUE,
  BPPARAM = SerialParam()
)

`...`	Two or more matrices containing expression values (usually log-normalized). Each matrix is assumed to represent one batch and should contain the same number of rows, corresponding to the same genes in the same order. Alternatively, two or more SingleCellExperiment objects containing these matrices. Note the same restrictions described above for gene expression matrix inputs. Alternatively, one matrix or SingleCellExperiment can be supplied containing cells from all batches. This requires `batch` to also be specified. Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside were passed directly to `...`.
`batch`	A factor specifying the batch identity of each cell in the input data. Ignored if `...` contains more than one argument.
`d`	An integer scalar specifying the number of dimensions to keep from the PCA.
`subset.row`	A vector specifying which features to use for correction.
`weights`	Numeric vector, logical scalar or list specifying the weighting scheme to use, see below for details.
`get.all.genes`	A logical scalar indicating whether the reported rotation vectors should include genes that are excluded by a non-`NULL` value of `subset.row`.
`get.variance`	A logical scalar indicating whether to return the (weighted) variance explained by each PC.
`preserve.single`	A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in `...`.
`assay.type`	A string or integer scalar specifying the assay containing the expression values, if SingleCellExperiment objects are present in `...`.
`BSPARAM`	A BiocSingularParam object specifying the algorithm to use for PCA, see `runSVD` for details.
`deferred`	A logical scalar used to overwrite the `deferred` status of `BSPARAM` for greater speed. Set to `NULL` to use the supplied status in `BSPARAM` directly.
`BPPARAM`	A BiocParallelParam object specifying whether the SVD should be parallelized.

This function is roughly equivalent to cbinding all matrices in ... and performing PCA on the merged matrix. The main difference is that each sample is forced to contribute equally to the identification of the rotation vectors. Specifically, the mean vector used for centering is defined as the grand mean of the mean vectors within each batch. Each batch's contribution to the gene-gene covariance matrix is also divided by the number of cells in that batch.

Our approach is to effectively weight the cells in each batch to mimic the situation where all batches have the same number of cells. This ensures that the low-dimensional space can distinguish subpopulations in smaller batches. Otherwise, batches with a large number of cells would dominate the PCA, i.e., the definition of the mean vector and covariance matrix. This may reduce resolution of unique subpopulations in smaller batches that differ in a different dimension to the subspace of the larger batches.

Setting get.all.genes=TRUE will report rotation vectors that span all genes, even when only a subset of genes are used for the PCA. This is done by projecting all non-used genes into the low-dimensional “cell space” defined by the first d components.

With the default deferred=TRUE, the per-gene centering and per-cell scaling will be deferred during matrix multiplication. This can greatly improve speeds when the input matrices are sparse, as deferred operations avoids loss of sparsity at the cost of numerical precision. If deferred=NULL, the use of deferred scaling is determined by the setting within BSPARAM itself.

A List of numeric matrices is returned where each matrix corresponds to a batch and contains the first d PCs (columns) for all cells in the batch (rows).

If preserve.single=TRUE and ... contains a single object, the List will only contain a single matrix. This contains the first d PCs (columns) for all cells in the same order as supplied in the single input object.

The metadata contains rotation, a matrix of rotation vectors, which can be used to construct a low-rank approximation of the input matrices. This has number of rows equal to the number of genes after any subsetting, except if get.all.genes=TRUE, where the number of rows is equal to the genes before subsetting.

If get.variance=TRUE, the metadata will also contain var.explained, the weighted variance explained by each PC; and var.total, the total variance after weighting.

By default, weights=NULL or TRUE will use the default weights, i.e., the reciprocal of the number of cells in each batch. This equalizes the contribution of batches with different numbers of cells as described above.

If weights=FALSE, no weighting is performed. This means that larger batches will drive the PCA, which may be desirable when dealing with technical replicates where there is no concern about unique subpopulations in smaller batches.

If weights is a numeric vector, it is expected to be of the same length (and, if named, have the same names) as the entries in .... Each entry of the vector is used to scale the default weight of the corresponding batch. This allows users to fine-tune the contribution of each batch in situations with multiple levels of heterogeneity. For example, consider merging data from multiple donors where each donor contains a variable number of batches. In such cases, it may be more appropriate to ensure that each donor has equal weight, rather than each batch. This is done by assigning a value of weights to each replicate that is inversely proportional to the number of batches for the same donor - see Examples.

Alternatively, weights can be a list representing a tree-like structure, identical to the tree controlling the merge order in fastMNN. Here, weights are scaled so that each partition in the tree yields subtrees with identical scaled weights (summed across each subtree's children). This allows us to easily adjust the weighting scheme for hierarchical batch structures like the replicate-study scenario described above.

Aaron Lun

runSVD

d1 <- matrix(rnorm(5000), ncol=100)
d1[1:10,1:10] <- d1[1:10,1:10] + 2 # unique population in d1
d2 <- matrix(rnorm(2000), ncol=40)
d2[11:20,1:10] <- d2[11:20,1:10] + 2 # unique population in d2

# PCA defaults to IRLBA, so we need to set the seed.
set.seed(10)
out <- multiBatchPCA(d1, d2, d=10)

# Examining results.
xlim <- range(c(out[[1]][,1], out[[2]][,1]))
ylim <- range(c(out[[1]][,2], out[[2]][,2]))
plot(out[[1]][,1], out[[1]][,2], col="red", xlim=xlim, ylim=ylim)
points(out[[2]][,1], out[[2]][,2], col="blue") 

# Using the weighting scheme, assuming that 'd2' and 'd3'
# are replicates and should contribute the same combined
# weight as 'd1'.
d3 <- d2 + 5
set.seed(10)
out <- multiBatchPCA(d1, d2, d3, d=10, weights=c(1, 0.5, 0.5))

set.seed(10)
alt <- multiBatchPCA(d1, d2, d3, d=10, weights=list(1, list(2, 3))) 
stopifnot(all.equal(out, alt)) # As they are the same.