regressBatches: Regress out batch effects
In batchelor: Single-Cell Batch Correction Methods

Description Usage Arguments Details Value Author(s) See Also Examples

Fit a linear model to each gene regress out uninteresting factors of variation, returning a matrix of residuals.

regressBatches(
  ...,
  batch = NULL,
  design = NULL,
  keep = NULL,
  restrict = NULL,
  subset.row = NULL,
  correct.all = FALSE,
  assay.type = "logcounts",
  d = NA,
  BSPARAM = IrlbaParam(),
  deferred = TRUE,
  BPPARAM = SerialParam()
)

`...`	One or more log-expression matrices where genes correspond to rows and cells correspond to columns. Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the `assay.type` assay. Each object should contain the same number of rows, corresponding to the same genes in the same order. Objects of different types can be mixed together. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, it is assumed to contain cells from all batches, so `batch` should also be specified. Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside each list were passed directly to `...`.
`batch`	A factor specifying the batch of origin for all cells when only a single object is supplied in `...`. This is ignored if multiple objects are present.
`design`	A numeric design matrix with number of rows equal to the total number of cells, specifying the experimental factors to remove. Each row corresponds to a cell in the order supplied in `...`.
`keep`	Integer vector specifying the coefficients of `design` to not regress out, see the `ResidualMatrix` constructor for more details.
`restrict`	A list of length equal to the number of objects in `...`. Each entry of the list corresponds to one batch and specifies the cells to use when computing the correction.
`subset.row`	A vector specifying which features to use for correction.
`correct.all`	Logical scalar indicating whether corrected expression values should be computed for genes not in `subset.row`. Only relevant if `subset.row` is not `NULL`.
`assay.type`	A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs.
`d`	Numeric scalar specifying the number of dimensions to use for PCA via `multiBatchPCA`. If `NA`, no PCA is performed.
`BSPARAM`	A BiocSingularParam object specifying the algorithm to use for PCA in `multiBatchPCA`.
`deferred`	Logical scalar indicating whether to defer centering/scaling, see `multiBatchPCA` for details.
`BPPARAM`	A BiocParallelParam object specifying whether the PCA should be parallelized.

This function fits a linear model to the log-expression values for each gene and returns the residuals. By default, the model is parameterized as a one-way layout with the batch of origin, so the residuals represent the expression values after correcting for the batch effect. The novelty of this function is that it returns a ResidualMatrix in as the "corrected" assay. This avoids explicitly computing the residuals, which would result in a loss of sparsity or similar problems. Rather, residuals are either computed as needed or are never explicitly computed at all (e.g., during matrix multiplication). This means that regressBatches is faster and lighter than naive regression or even rescaleBatches.

More complex designs should be explicitly specified with the design argument, e.g., to regress out a covariate. This can be any full-column-rank matrix that is typically constructed with model.matrix. If design is specified with a single object in ..., batch is ignored. If design is specified with multiple objects, regression is applied to the matrix obtained by cbinding all of those objects together; this means that the first few rows of design correspond to the cells from the first object, then the next rows correspond to the second object and so on.

Like rescaleBatches, this function assumes that the batch effect is orthogonal to the interesting factors of variation. For example, each batch is assumed to have the same composition of cell types. The same reasoning applies to any uninteresting factors specified in design, including continuous variables. For example, if one were to use this function to regress out cell cycle, the assumption is that all cell types are similarly distributed across cell cycle phases. If this is not true, the correction will not only be incomplete but can introduce spurious differences.

See ?"batchelor-restrict" for a description of the restrict argument. Specifically, this function will compute the model coefficients using only the specified subset of cells. The regression will then be applied to all cells in each batch.

If set, the d option will perform a PCA via multiBatchPCA. This is provided for convenience as efficiently executing a PCA on a ResidualMatrix is not always intuitive. (Specifically, BiocSingularParam objects must be set up with deferred=TRUE for best performance.) The arguments BSPARAM, deferred and BPPARAM only have an effect when d is set to a non-NA value.

All genes are used with the default setting of subset.row=NULL. If a subset of genes is specified, residuals are only returned for that subset. Similarly, if d is set, only the genes in the subset are used to perform the PCA. If additionally correct.all=TRUE, residuals are returned for all genes but only the subset is used for the PCA.

A SingleCellExperiment object containing the corrected assay. This contains the computed residuals for each gene (row) in each cell (column) in each batch. A batch field is present in the column data, specifying the batch of origin for each cell.

Cells in the output object are always ordered in the same manner as supplied in .... For a single input object, cells will be reported in the same order as they are arranged in that object. In cases with multiple input objects, the cell identities are simply concatenated from successive objects, i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.

If d is not NA, a PCA is performed on the residual matrix via multiBatchPCA, and an additional corrected field is present in the reducedDims of the output object.

Aaron Lun

rescaleBatches, for another approach to regressing out the batch effect.

The ResidualMatrix class, for the class of the residual matrix.

means <- 2^rgamma(1000, 2, 1)
A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 
A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2

B1 <- log2(A1 + 1)
B2 <- log2(A2 + 1)
out <- regressBatches(B1, B2)