View source: R/getClusteredPCs.R
getClusteredPCs | R Documentation |
Cluster cells after using varying number of PCs, and pick the number of PCs using a heuristic based on the number of clusters.
getClusteredPCs(
pcs,
FUN = NULL,
...,
BLUSPARAM = NNGraphParam(),
min.rank = 5,
max.rank = ncol(pcs),
by = 1
)
pcs |
A numeric matrix of PCs, where rows are cells and columns are dimensions representing successive PCs. |
FUN |
A clustering function that takes a numeric matrix with rows as cells and
returns a vector containing a cluster label for each cell.
Defaults to |
... |
Further arguments to pass to |
BLUSPARAM |
A BlusterParam object specifying the clustering to use when |
min.rank |
Integer scalar specifying the minimum number of PCs to use. |
max.rank |
Integer scalar specifying the maximum number of PCs to use. |
by |
Integer scalar specifying what intervals should be tested between |
Assume that the data contains multiple subpopulations, each of which is separated from the others on a different axis.
For example, each subpopulation could be defined by a unique set of marker genes that drives separation on its own PC.
If we had x
subpopulations, we would need at least x-1
PCs to successfully distinguish all of them.
This motivates the choice of the number of PCs provided we know the number of subpopulations in the data.
In practice, we do not know the number of subpopulations so we use the number of clusters as a proxy instead.
We apply a clustering function FUN
on the first d
PCs,
and only consider the values of d
that yield no more than d+1
clusters.
If we see more clusters with fewer dimensions,
we consider this to represent overclustering rather than distinct subpopulations,
as multiple subpopulations should not be distinguishable on the same axes (based on the assumption above).
We choose d
that satisfies the constraint above and maximizes the number of clusters.
The idea is that more PCs should include more biological signal, allowing FUN
to detect more distinct subpopulations;
until the point that the extra signal outweights the added noise at high dimensions,
such that resolution decreases and it becomes more difficult for FUN
to distinguish between subpopulations.
Any FUN
can be used that automatically chooses the number of clusters based on the data.
The default is a graph-based clustering method using makeSNNGraph
and cluster_walktrap
,
where arguments in ...
are passed to the former.
Users should not supply FUN
where the number of clusters is fixed in advance,
(e.g., k-means, hierarchical clustering with known k
in cutree
).
The identities of the output clusters are returned at each step for comparison, e.g., using methods like clustree.
A DataFrame with one row per tested number of PCs. This contains the fields:
n.pcs
:Integer scalar specifying the number of PCs used.
n.clusters
:Integer scalar specifying the number of clusters identified.
clusters
:A List containing the cluster identities for this number of PCs.
The metadata of the DataFrame contains chosen
,
an integer scalar specifying the “ideal” number of PCs to use.
Aaron Lun
runPCA
, to compute the PCs in the first place.
clusterRows
and BlusterParam, for possible choices of BLUSPARAM
.
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)
sce <- scater::runPCA(sce)
output <- getClusteredPCs(reducedDim(sce))
output
metadata(output)$chosen
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.