rescaleBatches: Scale counts across batches

View source: R/rescaleBatches.R

rescaleBatchesR Documentation

Scale counts across batches

Description

Scale counts so that the average count within each batch is the same for each gene.

Usage

rescaleBatches(
  ...,
  batch = NULL,
  restrict = NULL,
  log.base = 2,
  pseudo.count = 1,
  subset.row = NULL,
  correct.all = FALSE,
  assay.type = "logcounts"
)

Arguments

...

One or more log-expression matrices where genes correspond to rows and cells correspond to columns. Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the assay.type assay. Each object should contain the same number of rows, corresponding to the same genes in the same order. Objects of different types can be mixed together.

If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, it is assumed to contain cells from all batches, so batch should also be specified.

Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside each list were passed directly to ....

batch

A vector or factor specifying the batch of origin for all cells when only a single object is supplied in .... This is ignored if multiple objects are present.

restrict

A list of length equal to the number of objects in .... Each entry of the list corresponds to one batch and specifies the cells to use when computing the correction.

log.base

A numeric scalar specifying the base of the log-transformation.

pseudo.count

A numeric scalar specifying the pseudo-count used for the log-transformation.

subset.row

A vector specifying which features to use for correction.

correct.all

Logical scalar indicating whether corrected expression values should be computed for genes not in subset.row. Only relevant if subset.row is not NULL.

assay.type

A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs.

Details

This function assumes that the log-expression values were computed by a log-transformation of normalized count data, plus a pseudo-count. It reverses the log-transformation and scales the underlying counts in each batch so that the average (normalized) count is equal across batches. The assumption here is that each batch contains the same population composition. Thus, any scaling difference between batches is technical and must be removed.

This function is approximately equivalent to centering in log-expression space, the simplest application of linear regression methods for batch correction. However, by scaling the raw counts, it avoids loss of sparsity that would otherwise result from centering. It also mitigates issues with artificial differences in variance due to log-transformation. This is done by always downscaling to the lowest average expression for each gene such that differences in variance are dampened by the addition of the pseudo-count.

Use of rescaleBatches assumes that the uninteresting factors described in design are orthogonal to the interesting factors of variation. For example, each batch is assumed to have the same composition of cell types. If this is not true, the correction will not only be incomplete but may introduce spurious differences.

The output values are always re-log-transformed with the same log.base and pseudo.count. These can be used directly in place of the input values for downstream operations.

All genes are used with the default setting of subset.row=NULL. Users can set subset.row to subset the inputs, though this is purely for convenience as each gene is processed independently of other genes.

See ?"batchelor-restrict" for a description of the restrict argument. Specifically, the function will compute the scaling differences using only the specified subset of cells, and then apply the re-scaling to all cells in each batch.

Value

A SingleCellExperiment object containing the corrected assay. This contains corrected log-expression values for each gene (row) in each cell (column) in each batch. A batch field is present in the column data, specifying the batch of origin for each cell.

Cells in the output object are always ordered in the same manner as supplied in .... For a single input object, cells will be reported in the same order as they are arranged in that object. In cases with multiple input objects, the cell identities are simply concatenated from successive objects, i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.

Author(s)

Aaron Lun

See Also

regressBatches, for a residual calculation based on a fitted linear model.

applyMultiSCE, to apply this function over multiple altExps.

Examples

means <- 2^rgamma(1000, 2, 1)
A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 
A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2

B1 <- log2(A1 + 1)
B2 <- log2(A2 + 1)
out <- rescaleBatches(B1, B2) 


LTLA/batchelor documentation built on Jan. 19, 2024, 6:33 p.m.