mnnCorrect: Mutual nearest neighbors correction

View source: R/mnnCorrect.R

mnnCorrectR Documentation

Mutual nearest neighbors correction

Description

Correct for batch effects in single-cell expression data using the mutual nearest neighbors method.

Usage

mnnCorrect(
  ...,
  batch = NULL,
  restrict = NULL,
  k = 20,
  prop.k = NULL,
  sigma = 0.1,
  cos.norm.in = TRUE,
  cos.norm.out = TRUE,
  svd.dim = 0L,
  var.adj = TRUE,
  subset.row = NULL,
  correct.all = FALSE,
  merge.order = NULL,
  auto.merge = FALSE,
  assay.type = "logcounts",
  BSPARAM = ExactParam(),
  BNPARAM = KmknnParam(),
  BPPARAM = SerialParam()
)

Arguments

...

One or more log-expression matrices where genes correspond to rows and cells correspond to columns. Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the assay.type assay. Each object should contain the same number of rows, corresponding to the same genes in the same order. Objects of different types can be mixed together.

If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, it is assumed to contain cells from all batches, so batch should also be specified.

Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside each list were passed directly to ....

batch

A vector or factor specifying the batch of origin for all cells when only a single object is supplied in .... This is ignored if multiple objects are present.

restrict

A list of length equal to the number of objects in .... Each entry of the list corresponds to one batch and specifies the cells to use when computing the correction.

k

An integer scalar specifying the number of nearest neighbors to consider when identifying MNNs.

prop.k

A numeric scalar in (0, 1) specifying the proportion of cells in each dataset to use for mutual nearest neighbor searching. If set, the number of nearest neighbors used for the MNN search in each batch is redefined as max(k, prop.k*N) where N is the number of cells in that batch.

sigma

A numeric scalar specifying the bandwidth of the Gaussian smoothing kernel used to compute the correction vector for each cell.

cos.norm.in

A logical scalar indicating whether cosine normalization should be performed on the input data prior to calculating distances between cells.

cos.norm.out

A logical scalar indicating whether cosine normalization should be performed prior to computing corrected expression values.

svd.dim

An integer scalar specifying the number of dimensions to use for summarizing biological substructure within each batch.

var.adj

A logical scalar indicating whether variance adjustment should be performed on the correction vectors.

subset.row

A vector specifying which features to use for correction.

correct.all

A logical scalar specifying whether correction should be applied to all genes, even if only a subset is used for the MNN calculations.

merge.order

An integer vector containing the linear merge order of batches in .... Alternatively, a list of lists representing a tree structure specifying a hierarchical merge order.

auto.merge

Logical scalar indicating whether to automatically identify the “best” merge order.

assay.type

A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs.

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA in multiBatchPCA.

BNPARAM

A BiocNeighborParam object specifying the nearest neighbor algorithm.

BPPARAM

A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized.

Details

This function is designed for batch correction of single-cell RNA-seq data where the batches are partially confounded with biological conditions of interest. It does so by identifying pairs of mutual nearest neighbors (MNN) in the high-dimensional log-expression space. Each MNN pair represents cells in different batches that are of the same cell type/state, assuming that batch effects are mostly orthogonal to the biological manifold. Correction vectors are calculated from the pairs of MNNs and corrected (log-)expression values are returned for use in clustering and dimensionality reduction.

For each MNN pair, a pairwise correction vector is computed based on the difference in the log-expression profiles. The correction vector for each cell is computed by applying a Gaussian smoothing kernel with bandwidth sigma is the pairwise vectors. This stabilizes the vectors across many MNN pairs and extends the correction to those cells that do not have MNNs. The choice of sigma determines the extent of smoothing - a value of 0.1 is used by default, corresponding to 10% of the radius of the space after cosine normalization.

Value

A SingleCellExperiment object containing the corrected assay. This contains corrected expression values for each gene (row) in each cell (column) in each batch. A batch field is present in the column data, specifying the batch of origin for each cell.

Cells in the output object are always ordered in the same manner as supplied in .... For a single input object, cells will be reported in the same order as they are arranged in that object. In cases with multiple input objects, the cell identities are simply concatenated from successive objects, i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.

The metadata of the SingleCellExperiment contains merge.info, a DataFrame where each row corresponds to a merge step. See “Merge diagnostics” for more information.

Choosing the gene set

All genes are used with the default setting of subset.row=NULL. Users can set subset.row to subset the inputs to highly variable genes or marker genes. This may provide more meaningful identification of MNN pairs by reducing the noise from irrelevant genes. Note that users should not be too restrictive with subsetting, as high dimensionality is required to satisfy the orthogonality assumption in MNN detection.

If subset.row is specified and correct.all=TRUE, corrected values are returned for all genes. This is possible as subset.row is only used to identify the MNN pairs and other cell-based distance calculations. Correction vectors between MNN pairs can then be computed in for all genes in the supplied matrices. Note that setting correct.all=TRUE will not alter the corrected expression values for the subsetted genes.

Expected type of input data

The input expression values should generally be log-transformed, e.g., log-counts, see logNormCounts for details. They should also be normalized within each data set to remove cell-specific biases in capture efficiency and sequencing depth. Users may also consider using the multiBatchNorm functino to mitigate the effect of differences in sequencing depth between batches.

By default, a further cosine normalization step is performed on the supplied expression data to eliminate gross scaling differences between data sets.

  • When cos.norm.in=TRUE, cosine normalization is performed on the matrix of expression values used to compute distances between cells. This can be turned off when there are no scaling differences between data sets.

  • When cos.norm.out=TRUE, cosine normalization is performed on the matrix of values used to calculate correction vectors (and on which those vectors are applied). This can be turned off to obtain corrected values on the log-scale, similar to the input data.

The cosine normalization is achieved using the cosineNorm function.

Further options

The function depends on a shared biological manifold, i.e., one or more cell types/states being present in multiple batches. If this is not true, MNNs may be incorrectly identified, resulting in over-correction and removal of interesting biology. Some protection can be provided by removing components of the correction vectors that are parallel to the biological subspaces in each batch. The biological subspace in each batch is identified with a SVD on the expression matrix to obtain svd.dim dimensions. (By default, this option is turned off by setting svd.dim=0.)

If var.adj=TRUE, the function will adjust the correction vector to equalize the variances of the two data sets along the batch effect vector. In particular, it avoids “kissing” effects whereby MNN pairs are identified between the surfaces of point clouds from different batches. Naive correction would then bring only the surfaces into contact, rather than fully merging the clouds together. The adjustment ensures that the cells from the two batches are properly intermingled after correction. This is done by identifying each cell's position on the correction vector, identifying corresponding quantiles between batches, and scaling the correction vector to ensure that the quantiles are matched after correction.

See ?"batchelor-restrict" for a description of the restrict argument. Specifically, mnnCorrect will only use the restricted subset of cells in each batch to identify MNN pairs (and to perform variance adjustment, if var.adj=TRUE), and then apply the correction to all cells in each batch.

Merge diagnostics

Each merge step combines two mutually exclusive sets of cells, a “left” set and “right” set. The metadata thus contains the following fields:

  • left, a List of integer or character vectors. Each vector specifies the batches in the left set at a given merge step.

  • right, a similar List of integer or character vectors. Each vector specifies the batches in the right set at a given merge step.

  • pairs, a List of DataFrames specifying which pairs of cells were identified as MNNs at each step. In each DataFrame, each row corresponds to a single MNN pair and specifies the paired cells that were in the left and right sets, respectively. Note that the indices refer to those paired cells in the output ordering of cells, i.e., users can identify the paired cells at each step by column-indexing the output of the mnnCorrect function.

Specifying the number of neighbors

The threshold to define nearest neighbors is defined by k, which is passed to findMutualNN to identify MNN pairs. The size of k can be roughly interpreted as the anticipated minimum size of a shared subpopulation in each batch. If a batch has fewer than k cells of a shared subpopulation, there is an increased risk that its counterparts in other batches will form incorrect MNN pairs.

From the perspective of the algorithm, larger values allow for more MNN pairs to be obtained, which improves the stability of the correction vectors. Larger values also increase robustness against non-orthogonality, by ignoring a certain level of biological variation when identifying pairs. This can be used to avoid the kissing problem where MNN pairs are only detected on the “surface” of the distribution. However, values of k should not be too large, as this would result in MNN pairs being inappropriately identified between biologically distinct populations.

In practice, increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. We suggest starting with the default k and increasing it if one is confident that the same cell types are not adequately merged across batches. This is better than starting with a large k as incorrect merging is much harder to diagnose than insufficient merging.

An additional consideration is that the effect of any given k will vary with the number of cells in each batch. With more cells, a larger k may be preferable to achieve better merging in the presence of non-orthogonality. We can achieve this by setting prop.k, e.g., prop.k=0.05 will set k to 5% of the number of cells in each batch. This allows the choice of k to adapt to the size of each batch at each merge step and handles asymmetry in batch sizes (via the k1 and k2 arguments in findMutualNN).

Controlling the merge order

By default, batches are merged in the user-supplied order in ..., i.e., the first batch is merged with the second batch, the third batch is merged with the combined first-second batch, the fourth batch is merged with the combined first-second-third batch and so on. We refer to this approach as a progressive merge. When batch is supplied for a single object in ..., the ordering of batches in a progressive merge is determined by the ordering of factor levels in batch.

If merge.order is an integer vector, it is treated as an ordering permutation with which to perform a progressive merge. For example, if merge.order=c(4,1,3,2), batches 4 and 1 in ... are merged first; batch 3 is merged with the combined 4+1 batch; and then batch 2 is merged with the combined 4+1+3 batch. This is often more convenient than changing the order manually in ..., which would alter the order of batches in the output corrected matrix.

If merge.order is a character vector, it is treated as an ordering permutation for named batches. If merge.order is a factor, it is coerced into a character vector and also treated as a permutation of names.

If merge.order is a nested list, it is treated as a tree that specifies a hierarchical merge. Each element of the list should either be a string or integer scalar, corresponding to a leaf node that specifies a batch; or another list, corresponding to an internal node that should contain at least two children; or an integer or character vector of length 2 or more, again corresponding to an internal node.

  • For example, list(list(1,2), list(3,4)) indicates that batch 1 should be merged with batch 2; batch 3 should be merged with batch 4; and that, finally, the combined batches 1+2 and 3+4 should be merged.

  • More than two children per node are supported and will result in a progressive merge within that node. For example, list(list(1,2,3), list(4,5,6)) will merge batch 1 with 2, then 1+2 with 3; batch 4 with 5, and then 4+5 with 6; and finally, 1+2+3 with 4+5+6.

  • The same approach can be used for integer or character vectors, e.g., list(1:3, 4:6) has the same effect as above.

Note that, while batches can be specified by name (character) or index (integer), users cannot use both in the same tree. Factors can be provided and are coerced to character vectors.

The merge order may occasionally be important as it determines the number of MNN pairs available at each merge step. MNN pairs results in greater stability of the batch vectors and increased likelihood of identifying shared subpopulations, which are important to the precision and accuracy of the MNN-based correction, respectively.

  • In a progressive merge, the reference increases in size at each step, ensuring that more cells are available to identify MNN pairs in later merges. We suggest setting the largest, most heterogeneous batch as the first reference, which favors detection of sufficient MNN pairs between the first and other batches. Conversely, if two small batches without shared populations are supplied first, the wrong MNN pairs will be detected and the result of the merge will be incorrect.

  • A merge tree is useful for merging together batches that are known to be more closely related (e.g., replicates) before attempting difficult merges involving more dissimilar batches. The idea is to increase the number of cells and thus MNN pairs prior to merging batches with few shared subpopulations. By comparison, performing the more difficult merges first is more likely to introduce errors whereby distinct subpopulations are incorrectly placed together, which is propagated to later steps as the initial merge is used as a reference for subsequent merges.

  • If auto.merge=TRUE, merge steps are chosen to maximize the number of MNN pairs at each step. The aim is to improve the stability of the correction by first merging more similar batches with more MNN pairs. This can be somewhat time-consuming as MNN pairs need to be iteratively recomputed for all possible batch pairings.

The order of cells in the output is never affected by the setting of merge.order. It depends only on the order of objects in ... and the order of cells within each object.

Author(s)

Laleh Haghverdi, with modifications by Aaron Lun

References

Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421

See Also

fastMNN for a faster equivalent.

Examples

B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 
B2 <- matrix(rnorm(10000), ncol=50) # Batch 2
out <- mnnCorrect(B1, B2) # corrected values


LTLA/batchelor documentation built on Jan. 19, 2024, 6:33 p.m.