createClusterMST: Minimum spanning trees on cluster centroids
In TSCAN: Tools for Single-Cell Analysis

Description Usage Arguments Value Introducing an outgroup Confidence on the edges An alternative distance calculation Author(s) References Examples

Build a MST where each node is a cluster centroid and each edge is weighted by the Euclidean distance between centroids. This represents the most parsimonious explanation for a particular trajectory and has the advantage of being directly intepretable with respect to any pre-existing clusters.

createClusterMST(x, ...)

## S4 method for signature 'ANY'
createClusterMST(
  x,
  clusters,
  outgroup = FALSE,
  outscale = 3,
  columns = NULL,
  with.mnn = FALSE,
  mnn.k = 50,
  BNPARAM = NULL,
  BPPARAM = NULL
)

## S4 method for signature 'SummarizedExperiment'
createClusterMST(x, ..., assay.type = "logcounts")

## S4 method for signature 'SingleCellExperiment'
createClusterMST(
  x,
  clusters = colLabels(x, onAbsence = "error"),
  ...,
  use.dimred = NULL
)

`x`	A numeric matrix of coordinates where each row represents a cell/sample and each column represents a dimension (usually a PC or another low-dimensional embedding, but features or genes can also be used). Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix in its `assays`, as specified by `assay.type`. This will be transposed prior to use. Alternatively, for SingleCellExperiments, this matrix may be extracted from its `reducedDims`, based on the `use.dimred` specification. In this case, no transposition is performed. Alternatively, if `clusters=NULL`, a numeric matrix of coordinates for cluster centroids, where each row represents a cluster and each column represents a dimension Each row should be named with the cluster name. This mode can also be used with assays/matrices extracted from SummarizedExperiments and SingleCellExperiments.
`...`	For the generic, further arguments to pass to the specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method (if `use.dimred` is specified) or the ANY method (otherwise).
`clusters`	A factor-like object of the same length as `nrow(x)`, specifying the cluster identity for each cell in `x`. If `NULL`, `x` is assumed to already contain coordinates for the cluster centroids.
`outgroup`	A logical scalar indicating whether an outgroup should be inserted to split unrelated trajectories. Alternatively, a numeric scalar specifying the distance threshold to use for this splitting.
`outscale`	A numeric scalar specifying the scaling to apply to the median distance between centroids to define the threshold for outgroup splitting. Only used if `outgroup=TRUE`.
`columns`	A character, logical or integer vector specifying the columns of `x` to use. If `NULL`, all provided columns are used by default.
`with.mnn`	A logical scalar indicating whether to use distances computed from mutual nearest neighbor pairs, see Details.
`mnn.k`	An integer scalar specifying the number of nearest neighbors to consider for the MNN-based distance calculation. See `findMutualNN` for more details.
`BNPARAM`	A BiocNeighborParam object specifying how the nearest-neighbor search should be performed when `with.mnn=TRUE`, see the BiocNeighbors package for more details.
`BPPARAM`	A BiocParallelParam object specifying whether the nearest neighbor search should be parallelized when `with.mnn=TRUE`, see the BiocNeighbors package for more details.
`assay.type`	An integer or string specifying the assay to use from a SummarizedExperiment `x`.
`use.dimred`	An integer or string specifying the reduced dimensions to use from a SingleCellExperiment `x`.

A graph object containing an MST computed on centers. Each node corresponds to a cluster centroid and has a numeric vector of coordinates in the coordinates attribute. The edge weight is set to the Euclidean distance and the confidence is stored as the gain attribute.

If outgroup=TRUE, we add an outgroup to avoid constructing a trajectory between “unrelated” clusters. This is done by adding an extra row/column to the distance matrix corresponding to an artificial outgroup cluster, where the distance to all of the other real clusters is set to ω/2. Large jumps in the MST between real clusters that are more distant than ω will then be rerouted through the outgroup, allowing us to break up the MST into multiple subcomponents by removing the outgroup.

The default ω value is computed by constructing the MST from the original distance matrix, computing the median edge length in that MST, and then scaling it by outscale. This adapts to the magnitude of the distances and the internal structure of the dataset while also providing some margin for variation across cluster pairs. Alternatively, outgroup can be set to a numeric scalar in which case it is used directly as ω.

For the MST, we obtain a measure of the confidence in each edge by computing the distance gained if that edge were not present. Ambiguous parts of the tree will be less penalized from deletion of an edge, manifesting as a small distance gain. In contrast, parts of the tree with clear structure will receive a large distance gain upon deletion of an obvious edge.

For each edge, we divide the distance gain by the length of the edge to normalize for cluster resolution. This avoids overly penalizing edges in parts of the tree involving broad clusters while still retaining sensitivity to detect distance gain in overclustered regions. As an example, a normalized gain of unity for a particular edge means that its removal requires an alternative path that increases the distance travelled by that edge's length.

The normalized gain is reported as the "gain" attribute in the edges of the MST from createClusterMST. Note that the "weight" attribute represents the edge length.

While distances between centroids are usually satisfactory for gauging cluster “closeness”, they do not consider the behavior at the boundaries of the clusters. Two clusters that are immediately adjacent (i.e., intermingling at the boundaries) may have a large distance between their centroids if the clusters themselves span a large region of the coordinate space. This may preclude the obvious edge from forming in the MST.

In such cases, we can use an alternative distance calculation based on the distance between mutual nearest neighbors (MNNs). An MNN pair is defined as two cells in separate clusters that are each other's nearest neighbors in the other cluster. For each pair of clusters, we identify all MNN pairs and compute the median distance between them. This distance is then used in place of the distance between centroids to construct the MST. In this manner, we focus on cluster pairs that are close at their boundaries rather than at their centers.

This mode can be enabled by setting with.mnn=TRUE, while the stringency of the MNN definition can be set with mnn.k. Similarly, the performance of the nearest neighbor search can be controlled with BPPARAM and BSPARAM. Note that this mode performs a cell-based search and so cannot be used when x already contains aggregated profiles.

Aaron Lun

Ji Z and Ji H (2016). TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44, e117

# Mocking up a Y-shaped trajectory.
centers <- rbind(c(0,0), c(0, -1), c(1, 1), c(-1, 1))
rownames(centers) <- seq_len(nrow(centers))
clusters <- sample(nrow(centers), 1000, replace=TRUE)
cells <- centers[clusters,]
cells <- cells + rnorm(length(cells), sd=0.5)

# Creating the MST:
mst <- createClusterMST(cells, clusters)
plot(mst)

# We could also do it on the centers:
mst2 <- createClusterMST(centers, clusters=NULL)
plot(mst2)

# Works if the expression matrix is in a SE:
library(SummarizedExperiment)
se <- SummarizedExperiment(t(cells), colData=DataFrame(group=clusters))
mst3 <- createClusterMST(se, se$group, assay.type=1)
plot(mst3)