createClusterMST: Minimum spanning trees on cluster centroids
In scran: Methods for Single-Cell RNA-Seq Data Analysis

Description Usage Arguments Details Value Introducing an outgroup Confidence on the edges Interpreting the pseudotime matrix Author(s) References See Also Examples

View source: R/createClusterMST.R

Perform basic trajectory analyses with minimum spanning trees (MST) computed on cluster centroids, based on the methodology in the TSCAN package. These functions are now deprecated as they have been moved to the TSCAN package itself.

createClusterMST(centers, outgroup = FALSE, outscale = 3)

connectClusterMST(centers, mst, combined = TRUE)

orderClusterMST(x, ids, centers, mst, start = NULL)

`centers`	A numeric matrix of cluster centroids where each row represents a cluster and each column represents a dimension (usually a PC or another low-dimensional embedding). Each row should be named with the cluster name.
`outgroup`	A logical scalar indicating whether an outgroup should be inserted to split unrelated trajectories. Alternatively, a numeric scalar specifying the distance threshold to use for this splitting.
`outscale`	A numeric scalar specifying the scaling to apply to the median distance between centroids to define the threshold for outgroup splitting. Only used if `outgroup=TRUE`.
`mst`	A graph object containing a MST, typically the output of `createClusterMST(centers)`. For `connectClusterMSTNodes`, the MST may be computed from a different `centers`.
`combined`	Logical scalar indicating whether a single data.frame of edge coordinates should be returned.
`x`	A numeric matrix of per-cell coordinates where each row represents a cell and each column represents a dimension (again, usually a low-dimensional embedding).
`ids`	A character vector of length equal to the number of cells, specifying the cluster to which each cell is assigned.
`start`	A character vector specifying the starting node from which to compute pseudotimes in each component of `mst`. Defaults to an arbitrarily chosen node of degree 1 or lower in each component.

These functions represent some basic utilities for a simple trajectory analysis based on the algorithm in the TSCAN package.

createClusterMST builds a MST where each node is a cluster centroid and each edge is weighted by the Euclidean distance between centroids. This represents the most parsimonious explanation for a particular trajectory and has the advantage of being directly intepretable with respect to any pre-existing clusters.

connectClusterMST provides the coordinates of the start and end of every edge. This is mostly useful for plotting purposes in segments or the equivalent ggplot2 functionality. We suggest using aggregateAcrossCells to obtain centers for multiple low-dimensional results at once.

orderClusterMST will map each cell to the closest edge involving the cluster to which it is assigned. (Here, edges are segments terminated by their nodes, so some cells may simply be mapped to the edge terminus.) It will then calculate the distance of that cell along the MST from the starting node specified by start. This distance represents the pseudotime for that cell and can be used in further quantitative analyses.

createClusterMST returns a graph object containing an MST computed on centers.

connectClusterMST returns, by default, a data.frame containing the start and end coordinates of segments representing all the edges in mst. If combined=FALSE, a list of two data.frames is returned where corresponding rows represent the start and end coordinates of the same edge.

orderClusterMST returns a numeric matrix containing the pseudotimes of all cells (rows) across all paths (columns) through mst.

If outgroup=TRUE, we add an outgroup to avoid constructing a trajectory between “unrelated” clusters. This is done by adding an extra row/column to the distance matrix corresponding to an artificial outgroup cluster, where the distance to all of the other real clusters is set to ω/2. Large jumps in the MST between real clusters that are more distant than ω will then be rerouted through the outgroup, allowing us to break up the MST into multiple subcomponents by removing the outgroup.

The default ω value is computed by constructing the MST from the original distance matrix, computing the median edge length in that MST, and then scaling it by outscale. This adapts to the magnitude of the distances and the internal structure of the dataset while also providing some margin for variation across cluster pairs. Alternatively, outgroup can be set to a numeric scalar in which case it is used directly as ω.

For the MST, we obtain a measure of the confidence in each edge by computing the distance gained if that edge were not present. Ambiguous parts of the tree will be less penalized from deletion of an edge, manifesting as a small distance gain. In contrast, parts of the tree with clear structure will receive a large distance gain upon deletion of an obvious edge.

For each edge, we divide the distance gain by the length of the edge to normalize for cluster resolution. This avoids overly penalizing edges in parts of the tree involving broad clusters while still retaining sensitivity to detect distance gain in overclustered regions. As an example, a normalized gain of unity for a particular edge means that its removal requires an alternative path that increases the distance travelled by that edge's length.

The normalized gain is reported as the "gain" attribute in the edges of the MST from createClusterMST. Note that the "weight" attribute represents the edge length.

The pseudotimes are returned as a matrix where each row corresponds to cell in x and each column corresponds to a path through the MST from start to all nodes of degree 1. (If start is itself a node of degree 1, then paths are only considered to all other such nodes.) This format is inspired by that from the slingshot package and provides a compact representation of branching events.

Each branching event in the MST results in a new path and thus a new column in the pseudotime matrix. For any given row in this matrix, entries are either NA or they are identical. This reflects the fact that multiple paths will share a section of the MST for which the pseudotimes are the same.

The starting node in start is completely arbitrarily chosen by orderClusterMST, as directionality is impossible to infer from the expression matrix alone. However, it is often possible to use prior biological knowledge to pick an appropriate cluster as the starting node.

Aaron Lun

Ji Z and Ji H (2016). TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44, e117

quickPseudotime, a wrapper to quickly perform these calculations.

# Mocking up a Y-shaped trajectory.
centers <- rbind(c(0,0), c(0, -1), c(1, 1), c(-1, 1))
rownames(centers) <- seq_len(nrow(centers))
clusters <- sample(nrow(centers), 1000, replace=TRUE)
cells <- centers[clusters,]
cells <- cells + rnorm(length(cells), sd=0.5)

# Creating the MST first:
mst <- createClusterMST(centers)
plot(mst)

# Also plotting the MST on top of existing visualizations:
edges <- connectClusterMST(centers, mst, combined=FALSE)
plot(cells[,1], cells[,2], col=clusters)
segments(edges$start$dim1, edges$start$dim2, edges$end$dim1, 
     edges$end$dim2, lwd=5)

# Finally, obtaining pseudo-time orderings.
ordering <- orderClusterMST(cells, clusters, centers, mst)
unified <- rowMeans(ordering, na.rm=TRUE)
plot(cells[,1], cells[,2], col=topo.colors(21)[cut(unified, 21)], pch=16)