clusterMNN: Cluster-based MNN
In batchelor: Single-Cell Batch Correction Methods

Description Usage Arguments Details Value Author(s) References See Also Examples

Perform MNN correction based on cluster centroids, using the corrected centroid coordinates to correct the per-cell expression values with a variable bandwidth Gaussian kernel.

clusterMNN(
  ...,
  batch = NULL,
  restrict = NULL,
  clusters,
  cluster.d = 50,
  cos.norm = TRUE,
  merge.order = NULL,
  auto.merge = FALSE,
  min.batch.skip = 0,
  subset.row = NULL,
  correct.all = FALSE,
  assay.type = "logcounts",
  BSPARAM = IrlbaParam(),
  BNPARAM = KmknnParam(),
  BPPARAM = SerialParam()
)

`...`	One or more log-expression matrices where genes correspond to rows and cells correspond to columns. Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the `assay.type` assay. Each object should contain the same number of rows, corresponding to the same genes in the same order. Objects of different types can be mixed together. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, it is assumed to contain cells from all batches, so `batch` should also be specified. Alternatively, one or more lists of matrices or SingleCellExperiments can be provided; this is flattened as if the objects inside each list were passed directly to `...`.
`batch`	A factor specifying the batch of origin for all cells when only a single object is supplied in `...`. This is ignored if multiple objects are present.
`restrict`	A list of length equal to the number of objects in `...`. Each entry of the list corresponds to one batch and specifies the cells to use when computing the correction.
`clusters`	A list of length equal to `...` containing the assigned cluster for each cell in each batch. Alternatively, a BlusterParam object from the bluster package, specifying the clustering to be applied to each batch.
`cluster.d`	Integer scalar indicating how many PCs should be used in clustering. Only used if `clusters` is not a list. If `NA`, no PCA is performed and clustering is applied directly to the (cosine-normalized) gene expression values.
`cos.norm`	A logical scalar indicating whether cosine normalization should be performed on the input data prior to PCA.
`merge.order`	An integer vector containing the linear merge order of batches in `...`. Alternatively, a list of lists representing a tree structure specifying a hierarchical merge order.
`auto.merge`	Logical scalar indicating whether to automatically identify the “best” merge order.
`min.batch.skip`	Numeric scalar specifying the minimum relative magnitude of the batch effect, below which no correction will be performed at a given merge step.
`subset.row`	A vector specifying which features to use for correction.
`correct.all`	Logical scalar indicating whether corrected expression values should be computed for genes not in `subset.row`. Only relevant if `subset.row` is not `NULL`.
`assay.type`	A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs.
`BSPARAM`	A BiocSingularParam object specifying the algorithm to use for PCA. Only used if `clusters` is not a list.
`BNPARAM`	A BiocNeighborParam object specifying the nearest neighbor algorithm.
`BPPARAM`	A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized.

These functions are motivated by the scenario where each batch has been clustered separately and each cluster has already been annotated with some meaningful biological state. We want to identify which biological states match to each other across batches; this is achieved by identifying mutual nearest neighbors based on the cluster centroids with reducedMNN.

MNN pairs are identified with k=1 to ensure that each cluster has no more than one match in another batch. This reduces the risk of inadvertently merging together different clusters from the same batch. By comparison, higher values of k may result in many-to-one mappings between batches such that the correction will implicitly force different clusters together.

Using this guarantee of no-more-than-one mappings across batches, we can form meta-clusters by identifying all components of the resulting MNN graph. Each meta-cluster can be considered to represent some biological state (e.g., cell type), and all of its constituents are the matching clusters within each batch.

As an extra courtesy, clusterMNN will also compute corrected values for each cell. This is done by applying a Gaussian kernel to the correction vectors for the centroids, where the bandwidth is proportional to the distance between that cell and the closest cluster centroid. This yields a smooth correction function that avoids edge effects at cluster boundaries.

If clusters is set to a BlusterParam object (see the bluster package), a PCA is performed in each batch with the specified BSPARAM. The PCs are then used in clustering with clusterRows to obtain a list of clusters. This can be used to mimic per-cell batch correction in the absence of a priori clusters.

A SingleCellExperiment containing per-cell expression values where each row is a gene and each column is a cell. This has the same format as the output of fastMNN but with an additional cluster field in the colData containing the cluster identity of each cell. The metadata contains:

merge.info, a DataFrame with the same format as the output of fastMNN. However, the pairs and lost.var refer to the cluster centroids, not the cells.
clusters, a DataFrame with one row for each cluster across all batches in .... This can be row-indexed by the values in pairs to determine the identity of the clusters in each MNN pair. An additional meta column is provided that describes the meta-cluster to which each cluster belongs.

Aaron Lun

Lun ATL (2019). Cluster-based mutual nearest neighbors correction https://marionilab.github.io/FurtherMNN2018/theory/clusters.html

Lun ATL (2019). A discussion of the known failure points of the fastMNN algorithm. https://marionilab.github.io/FurtherMNN2018/theory/failure.html

reducedMNN, which is used internally to perform the correction.

# Mocking up some data for multiple batches:
means <- matrix(rnorm(3000), ncol=3)
colnames(means) <- LETTERS[1:3]

B1 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)]
B1 <- B1 + rnorm(length(B1))

B2 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)]
B2 <- B2 + rnorm(length(B2)) + rnorm(nrow(B2)) # batch effect.

# Applying the correction with some made-up clusters:
cluster1 <- kmeans(t(B1), centers=10)$cluster
cluster2 <- kmeans(t(B2), centers=10)$cluster
out <- clusterMNN(B1, B2, clusters=list(cluster1, cluster2)) 

rd <- reducedDim(out, "corrected") 
plot(rd[,1], rd[,2], col=out$batch)

# Obtaining the clusters internally.
out2 <- clusterMNN(B1, B2, clusters=bluster::NNGraphParam())
rd2 <- reducedDim(out2, "corrected") 
plot(rd2[,1], rd2[,2], col=out$batch)