getClusteredPCs: Use clusters to choose the number of PCs
In scran: Methods for Single-Cell RNA-Seq Data Analysis

Description Usage Arguments Details Value Author(s) See Also Examples

Cluster cells after using varying number of PCs, and pick the number of PCs using a heuristic based on the number of clusters.

getClusteredPCs(
  pcs,
  FUN = NULL,
  ...,
  min.rank = 5,
  max.rank = ncol(pcs),
  by = 1
)

`pcs`	A numeric matrix of PCs, where rows are cells and columns are dimensions representing successive PCs.
`FUN`	A clustering function that takes a numeric matrix with rows as cells and returns a vector containing a cluster label for each cell.
`...`	Further arguments to pass to `FUN`.
`min.rank`	Integer scalar specifying the minimum number of PCs to use.
`max.rank`	Integer scalar specifying the maximum number of PCs to use.
`by`	Integer scalar specifying what intervals should be tested between `min.rank` and `max.rank`.

Assume that the data contains multiple subpopulations, each of which is separated from the others on a different axis. For example, each subpopulation could be defined by a unique set of marker genes that drives separation on its own PC. If we had x subpopulations, we would need at least x-1 PCs to successfully distinguish all of them. This motivates the choice of the number of PCs provided we know the number of subpopulations in the data.

In practice, we do not know the number of subpopulations so we use the number of clusters as a proxy instead. We apply a clustering function FUN on the first d PCs, and only consider the values of d that yield no more than d+1 clusters. If we see more clusters with fewer dimensions, we consider this to represent overclustering rather than distinct subpopulations, as multiple subpopulations should not be distinguishable on the same axes (based on the assumption above).

We choose d that satisfies the constraint above and maximizes the number of clusters. The idea is that more PCs should include more biological signal, allowing FUN to detect more distinct subpopulations; until the point that the extra signal outweights the added noise at high dimensions, such that resolution decreases and it becomes more difficult for FUN to distinguish between subpopulations.

Any FUN can be used that automatically chooses the number of clusters based on the data. The default is a graph-based clustering method using buildSNNGraph and cluster_walktrap, where arguments in ... are passed to the former. Users should not supply FUN where the number of clusters is fixed in advance, (e.g., k-means, hierarchical clustering with known k in cutree).

The identities of the output clusters are returned at each step for comparison, e.g., using methods like clustree.

A DataFrame with one row per tested number of PCs. This contains the fields:

n.pcs:: Integer scalar specifying the number of PCs used.
n.clusters:: Integer scalar specifying the number of clusters identified.
clusters:: A List containing the cluster identities for this number of PCs.

The metadata of the DataFrame contains chosen, an integer scalar specifying the “ideal” number of PCs to use.

Aaron Lun

runPCA, to compute the PCs in the first place.

buildSNNGraph, for arguments to use in with default FUN.

library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

sce <- scater::runPCA(sce)
output <- getClusteredPCs(reducedDim(sce))
output

metadata(output)$chosen