View source: R/multiBatchPCA.R

multiBatchPCA | R Documentation |

Perform a principal components analysis across multiple gene expression matrices to project all cells to a common low-dimensional space.

```
multiBatchPCA(
...,
batch = NULL,
d = 50,
subset.row = NULL,
weights = NULL,
get.all.genes = FALSE,
get.variance = FALSE,
preserve.single = FALSE,
assay.type = "logcounts",
BSPARAM = IrlbaParam(),
deferred = TRUE,
BPPARAM = SerialParam()
)
```

`...` |
Two or more matrices containing expression values (usually log-normalized). Each matrix is assumed to represent one batch and should contain the same number of rows, corresponding to the same genes in the same order. Alternatively, two or more SingleCellExperiment objects containing these matrices. Note the same restrictions described above for gene expression matrix inputs. Alternatively, one matrix or SingleCellExperiment can be supplied containing cells from all batches.
This requires Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside were passed directly to |

`batch` |
A factor specifying the batch identity of each cell in the input data.
Ignored if |

`d` |
An integer scalar specifying the number of dimensions to keep from the PCA.
Alternatively |

`subset.row` |
A vector specifying which features to use for correction. |

`weights` |
Numeric vector, logical scalar or list specifying the weighting scheme to use, see below for details. |

`get.all.genes` |
A logical scalar indicating whether the reported rotation vectors should include genes
that are excluded by a non- |

`get.variance` |
A logical scalar indicating whether to return the (weighted) variance explained by each PC. |

`preserve.single` |
A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in |

`assay.type` |
A string or integer scalar specifying the assay containing the expression values, if SingleCellExperiment objects are present in |

`BSPARAM` |
A BiocSingularParam object specifying the algorithm to use for PCA, see |

`deferred` |
A logical scalar used to overwrite the |

`BPPARAM` |
A BiocParallelParam object specifying whether the SVD should be parallelized. |

This function is roughly equivalent to `cbind`

ing all matrices in `...`

and performing PCA on the merged matrix.
The main difference is that each sample is forced to contribute equally to the identification of the rotation vectors.
Specifically, the mean vector used for centering is defined as the grand mean of the mean vectors within each batch.
Each batch's contribution to the gene-gene covariance matrix is also divided by the number of cells in that batch.

Our approach is to effectively weight the cells in each batch to mimic the situation where all batches have the same number of cells. This ensures that the low-dimensional space can distinguish subpopulations in smaller batches. Otherwise, batches with a large number of cells would dominate the PCA, i.e., the definition of the mean vector and covariance matrix. This may reduce resolution of unique subpopulations in smaller batches that differ in a different dimension to the subspace of the larger batches.

It is usually recommended to set `subset.row`

to a subset of interesting features (e.g., highly variable genes).
This reduces computational time and eliminates uninteresting noise that could interfere with identification of the most relevant axes of variation.
Setting `get.all.genes=TRUE`

will report rotation vectors that span all genes, even when only a subset of genes are used for the PCA.
This is done by projecting all non-used genes into the low-dimensional “cell space” defined by the first `d`

components.

A List of numeric matrices is returned where each matrix corresponds to a batch and contains the first `d`

PCs (columns) for all cells in the batch (rows).

If `preserve.single=TRUE`

and `...`

contains a single object, the List will only contain a single matrix.
This contains the first `d`

PCs (columns) for all cells in the same order as supplied in the single input object.

The metadata contains `rotation`

, a matrix of rotation vectors, which can be used to construct a low-rank approximation of the input matrices.
This has number of rows equal to the number of genes after any subsetting, except if `get.all.genes=TRUE`

, where the number of rows is equal to the genes before subsetting.

If `get.variance=TRUE`

, the metadata will also contain `var.explained`

, a numeric vector containing the weighted variance explained by each of `d`

PCs;
and `var.total`

, the total variance across all components after weighting.

By default, `weights=NULL`

or `TRUE`

will use the default weights,
i.e., the reciprocal of the number of cells in each batch.
This equalizes the contribution of batches with different numbers of cells as described above.

If `weights=FALSE`

, no weighting is performed.
This means that larger batches will drive the PCA,
which may be desirable when dealing with technical replicates where there is no concern about unique subpopulations in smaller batches.

If `weights`

is a numeric vector, it is expected to be of the same length
(and, if named, have the same names) as the entries in `...`

.
Each entry of the vector is used to scale the default weight of the corresponding batch.
This allows users to fine-tune the contribution of each batch in situations with multiple levels of heterogeneity.
For example, consider merging data from multiple donors where each donor contains a variable number of batches.
In such cases, it may be more appropriate to ensure that each donor has equal weight, rather than each batch.
This is done by assigning a value of `weights`

to each replicate that is inversely proportional to the number of batches for the same donor - see Examples.

Alternatively, `weights`

can be a list representing a tree-like structure,
identical to the tree controlling the merge order in `fastMNN`

.
Here, weights are scaled so that each partition in the tree yields subtrees with identical scaled weights (summed across each subtree's children).
This allows us to easily adjust the weighting scheme for hierarchical batch structures like the replicate-study scenario described above.

Most of the parameters related to the PCA are controlled by `BSPARAM`

, which determines the choice and parameterization of SVD algorithms.
The default is to use `IrlbaParam`

though `RandomParam`

is often faster (at the cost of some accuracy).
Most choices of `BSPARAM`

will interactly sanely with `BPPARAM`

to achieve parallelization during the PCA itself.

With the default `deferred=TRUE`

, the per-gene centering and per-cell scaling will be deferred during matrix multiplication in approximate SVD algorithms.
This is much faster when the input matrices are sparse, as deferred operations avoids loss of sparsity at the cost of numerical precision.
If `deferred=NULL`

, the use of deferred scaling is determined by the setting within `BSPARAM`

itself - see `?bsdeferred`

for details.

If `d=NA`

, no PCA is performed.
Instead, the centered matrices are transposed and returned directly, while the rotation matrix is set to an identity matrix.
This allows developers to easily switch between PCA-based approximations versus the underlying dataset in their functions by simply changing `d`

.
(In some settings, one can even interpret `d=NA`

as the maximum `d`

, as Euclidean distance calculations between cells will be identical to a full-rank PC matrix.)
Note that, in this mode, no projection will be done with `get.all.genes=TRUE`

; rather, genes not in `subset.row`

will simply have rotation values of zero.
If `get.variance=TRUE`

, the weighted variances of the individual genes are returned.

Aaron Lun

`runSVD`

```
d1 <- matrix(rnorm(5000), ncol=100)
d1[1:10,1:10] <- d1[1:10,1:10] + 2 # unique population in d1
d2 <- matrix(rnorm(2000), ncol=40)
d2[11:20,1:10] <- d2[11:20,1:10] + 2 # unique population in d2
# PCA defaults to IRLBA, so we need to set the seed.
set.seed(10)
out <- multiBatchPCA(d1, d2, d=10)
# Examining results.
xlim <- range(c(out[[1]][,1], out[[2]][,1]))
ylim <- range(c(out[[1]][,2], out[[2]][,2]))
plot(out[[1]][,1], out[[1]][,2], col="red", xlim=xlim, ylim=ylim)
points(out[[2]][,1], out[[2]][,2], col="blue")
# Using the weighting scheme, assuming that 'd2' and 'd3'
# are replicates and should contribute the same combined
# weight as 'd1'.
d3 <- d2 + 5
set.seed(10)
out <- multiBatchPCA(d1, d2, d3, d=10, weights=c(1, 0.5, 0.5))
set.seed(10)
alt <- multiBatchPCA(d1, d2, d3, d=10, weights=list(1, list(2, 3)))
stopifnot(all.equal(out, alt)) # As they are the same.
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.