createClusterMST: Minimum spanning trees on cluster centroids

createClusterMSTR Documentation

Minimum spanning trees on cluster centroids

Description

Build a MST where each node is a cluster centroid and each edge is weighted by the Euclidean distance between centroids. This represents the most parsimonious explanation for a particular trajectory and has the advantage of being directly intepretable with respect to any pre-existing clusters.

Usage

createClusterMST(x, ...)

## S4 method for signature 'ANY'
createClusterMST(
  x,
  clusters,
  use.median = FALSE,
  outgroup = FALSE,
  outscale = 1.5,
  endpoints = NULL,
  columns = NULL,
  dist.method = c("simple", "scaled.full", "scaled.diag", "slingshot", "mnn"),
  with.mnn = FALSE,
  mnn.k = 50,
  BNPARAM = NULL,
  BPPARAM = NULL
)

## S4 method for signature 'SummarizedExperiment'
createClusterMST(x, ..., assay.type = "logcounts")

## S4 method for signature 'SingleCellExperiment'
createClusterMST(
  x,
  clusters = colLabels(x, onAbsence = "error"),
  ...,
  use.dimred = NULL
)

Arguments

x

A numeric matrix of coordinates where each row represents a cell/sample and each column represents a dimension (usually a PC or another low-dimensional embedding, but features or genes can also be used).

Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix in its assays, as specified by assay.type. This will be transposed prior to use.

Alternatively, for SingleCellExperiments, this matrix may be extracted from its reducedDims, based on the use.dimred specification. In this case, no transposition is performed.

Alternatively, if clusters=NULL, a numeric matrix of coordinates for cluster centroids, where each row represents a cluster and each column represents a dimension Each row should be named with the cluster name. This mode can also be used with assays/matrices extracted from SummarizedExperiments and SingleCellExperiments.

...

For the generic, further arguments to pass to the specific methods.

For the SummarizedExperiment method, further arguments to pass to the ANY method.

For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method (if use.dimred is specified) or the ANY method (otherwise).

clusters

A factor-like object of the same length as nrow(x), specifying the cluster identity for each cell in x. If NULL, x is assumed to already contain coordinates for the cluster centroids.

Alternatively, a matrix with number of rows equal to nrow(x), containing soft assignment weights for each cluster (column). All weights should be positive and sum to 1 for each row.

use.median

A logical scalar indicating whether cluster centroid coordinates should be computed using the median rather than mean.

outgroup

A logical scalar indicating whether an outgroup should be inserted to split unrelated trajectories. Alternatively, a numeric scalar specifying the distance threshold to use for this splitting.

outscale

A numeric scalar specifying the scaling of the median distance between centroids, used to define the threshold for outgroup splitting. Only used if outgroup=TRUE.

endpoints

A character vector of clusters that must be endpoints, i.e., nodes of degree 1 or lower in the MST.

columns

A character, logical or integer vector specifying the columns of x to use. If NULL, all provided columns are used by default.

dist.method

A string specifying the distance measure to be used, see Details.

with.mnn

Logical scalar, deprecated; use dist.method="mnn" instead.

mnn.k

An integer scalar specifying the number of nearest neighbors to consider for the MNN-based distance calculation when dist.method="mnn". See findMutualNN for more details.

BNPARAM

A BiocNeighborParam object specifying how the nearest-neighbor search should be performed when dist.method="mnn", see the BiocNeighbors package for more details.

BPPARAM

A BiocParallelParam object specifying whether the nearest neighbor search should be parallelized when dist.method="mnn", see the BiocNeighbors package for more details.

assay.type

An integer or string specifying the assay to use from a SummarizedExperiment x.

use.dimred

An integer or string specifying the reduced dimensions to use from a SingleCellExperiment x.

Value

A graph object containing an MST computed on centers. Each node corresponds to a cluster centroid and has a numeric vector of coordinates in the coordinates attribute. The edge weight is set to the Euclidean distance and the confidence is stored as the gain attribute.

Computing the centroids

By default, the cluster centroid is defined by taking the mean value across all of its cells for each dimension. If clusters is a matrix, a weighted mean is used instead. This treats the column of weights as fractional identities of each cell to the corresponding cluster.

If use.median=TRUE, the median across all cells in each cluster is used to compute the centroid coordinate for each dimension. (With a matrix-like clusters, a weighted median is calculated.) This protects against outliers but is less stable than the mean. Enabling this option is advisable if one observes that the default centroid is not located near any of its points due to outliers. Note that the centroids computed in this manner is not a true medoid, which was too much of a pain to compute.

Introducing an outgroup

If outgroup=TRUE, we add an outgroup to avoid constructing a trajectory between “unrelated” clusters (Street et al., 2018). This is done by adding an extra row/column to the distance matrix corresponding to an artificial outgroup cluster, where the distance to all of the other real clusters is set to \omega/2. Large jumps in the MST between real clusters that are more distant than \omega will then be rerouted through the outgroup, allowing us to break up the MST into multiple subcomponents (i.e., a minimum spanning forest) by removing the outgroup.

The default \omega value is computed by constructing the MST from the original distance matrix, computing the median edge length in that MST, and then scaling it by outscale. This adapts to the magnitude of the distances and the internal structure of the dataset while also providing some margin for variation across cluster pairs. The default outscale=1.5 will break any branch that is 50% longer than the median length.

Alternatively, outgroup can be set to a numeric scalar in which case it is used directly as \omega.

Forcing endpoints

If certain clusters are known to be endpoints (e.g., because they represent terminal states), we can specify them in endpoints. This ensures that the returned graph will have such clusters as nodes of degree 1, i.e., they terminate the path. The function uses an exhaustive search to identify the MST with these constraints. If no configuration can be found, an error is raised - this will occur if all nodes are specified as endpoints, for example.

If outgroup=TRUE, the function is allowed to connect two endpoints together to create a two-node subcomponent. This will result in the formation of a minimum spanning forest if there are more than two clusters in x. Of course, if there are only two nodes and both are specified as endpoints, a two-node subcomponent will be formed regardless of outgroup.

Note that edges involving endpoint nodes will have infinite confidence values (see below). This reflects the fact that they are forced to exist during graph construction.

Confidence on the edges

For the MST, we obtain a measure of the confidence in each edge by computing the distance gained if that edge were not present. Ambiguous parts of the tree will be less penalized from deletion of an edge, manifesting as a small distance gain. In contrast, parts of the tree with clear structure will receive a large distance gain upon deletion of an obvious edge.

For each edge, we divide the distance gain by the length of the edge to normalize for cluster resolution. This avoids overly penalizing edges in parts of the tree involving broad clusters while still retaining sensitivity to detect distance gain in overclustered regions. As an example, a normalized gain of unity for a particular edge means that its removal requires an alternative path that increases the distance travelled by that edge's length.

The normalized gain is reported as the "gain" attribute in the edges of the MST from createClusterMST. Note that the "weight" attribute represents the edge length.

Distance measures

Distances between cluster centroids may be calculated in multiple ways:

  • The default is "simple", which computes the Euclidean distance between cluster centroids.

  • With "scaled.diag", we downscale the distance between the centroids by the sum of the variances of the two corresponding clusters (i.e., the diagonal of the covariance matrix). This accounts for the cluster “width” by reducing the effective distances between broad clusters.

  • With "scaled.full", we repeat this scaling with the full covariance matrix. This accounts for the cluster shape by considering correlations between dimensions, but cannot be computed when there are more cells than dimensions.

  • The "slingshot" option will typically be equivalent to the "scaled.full" option, but switches to "scaled.diag" in the presence of small clusters (fewer cells than dimensions in the reduced dimensional space).

  • For "mnn", see the more detailed explanation below.

If clusters is a matrix with "scaled.diag", "scaled.full" and "slingshot", a weighted covariance is computed to account for the assignment ambiguity. In addition, a warning will be raised if use.median=TRUE for these choices of dist.method; the Mahalanobis distances will not be correctly computed when the centers are medians instead of means.

Alternative distances with MNN pairs

While distances between centroids are usually satisfactory for gauging cluster “closeness”, they do not consider the behavior at the boundaries of the clusters. Two clusters that are immediately adjacent (i.e., intermingling at the boundaries) may have a large distance between their centroids if the clusters themselves span a large region of the coordinate space. This may preclude the obvious edge from forming in the MST.

In such cases, we can use an alternative distance calculation based on the distance between mutual nearest neighbors (MNNs). An MNN pair is defined as two cells in separate clusters that are each other's nearest neighbors in the other cluster. For each pair of clusters, we identify all MNN pairs and compute the median distance between them. This distance is then used in place of the distance between centroids to construct the MST. In this manner, we focus on cluster pairs that are close at their boundaries rather than at their centers.

This mode can be enabled by setting dist.method="mnn", while the stringency of the MNN definition can be set with mnn.k. Similarly, the performance of the nearest neighbor search can be controlled with BPPARAM and BSPARAM. Note that this mode performs a cell-based search and so cannot be used when x already contains aggregated profiles.

Author(s)

Aaron Lun

References

Ji Z and Ji H (2016). TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44, e117

Street K et al. (2018). Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics, 477.

Examples

# Mocking up a Y-shaped trajectory.
centers <- rbind(c(0,0), c(0, -1), c(1, 1), c(-1, 1))
rownames(centers) <- seq_len(nrow(centers))
clusters <- sample(nrow(centers), 1000, replace=TRUE)
cells <- centers[clusters,]
cells <- cells + rnorm(length(cells), sd=0.5)

# Creating the MST:
mst <- createClusterMST(cells, clusters)
plot(mst)

# We could also do it on the centers:
mst2 <- createClusterMST(centers, clusters=NULL)
plot(mst2)

# Works if the expression matrix is in a SE:
library(SummarizedExperiment)
se <- SummarizedExperiment(t(cells), colData=DataFrame(group=clusters))
mst3 <- createClusterMST(se, se$group, assay.type=1)
plot(mst3)


LTLA/TrajectoryUtils documentation built on Feb. 5, 2024, 11:56 a.m.