Cluster similar cells based on their expression profiles, using either log-expression values or ranks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
quickCluster(x, ...) ## S4 method for signature 'ANY' quickCluster( x, min.size = 100, method = c("igraph", "hclust"), use.ranks = FALSE, d = NULL, subset.row = NULL, min.mean = NULL, graph.fun = "walktrap", BSPARAM = bsparam(), BPPARAM = SerialParam(), block = NULL, block.BPPARAM = SerialParam(), ... ) ## S4 method for signature 'SummarizedExperiment' quickCluster(x, ..., assay.type = "counts")
A numeric count matrix where rows are genes and columns are cells.
Alternatively, a SummarizedExperiment object containing such a matrix.
For the generic, further arguments to pass to specific methods.
For the ANY method, additional arguments to be passed to
For the SummarizedExperiment method, additional arguments to pass to the ANY method.
An integer scalar specifying the minimum size of each cluster.
String specifying the clustering method to use.
A logical scalar indicating whether clustering should be performed on the rank matrix, i.e., based on Spearman's rank correlation.
An integer scalar specifying the number of principal components to retain.
A numeric scalar specifying the filter to be applied on the average count for each filter prior to computing ranks.
Only used when
A function specifying the community detection algorithm to use on the nearest neighbor graph when
A BiocSingularParam object specifying the algorithm to use for PCA, if
A BiocParallelParam object to use for parallel processing within each block.
A factor of length equal to
A BiocParallelParam object specifying whether and how parallelization should be performed across blocks,
A string specifying which assay values to use.
This function provides a convenient wrapper to quickly define clusters of a minimum size
Its intended use is to generate “quick and dirty” clusters for use in
Two clustering strategies are available:
method="hclust", a distance matrix is constructed;
hierarchical clustering is performed using Ward's criterion;
cutreeDynamic is used to define clusters of cells.
method="igraph", a shared nearest neighbor graph is constructed using the
This is used to define clusters based on highly connected communities in the graph, using the
quickCluster will apply these clustering algorithms on the principal component (PC) scores generated from the log-expression values.
These are obtained by running
denoisePCA on HVGs detected using the trend fitted to endogenous genes with
d is specified, the PCA is directly performed on the entire
x and the specified number of PCs is retained.
It is also possible to use the clusters from this function for actual biological interpretation.
In such cases, users should set
min.size=0 to avoid aggregation of small clusters.
However, it is often better to call the relevant functions (
buildSNNGraph) manually as this provides more opportunities for diagnostics when the meaning of the clusters is important.
A character vector of cluster identities for each cell in
We can break up the dataset by specifying
block to cluster cells, usually within each batch or run.
This generates clusters within each level of
block, which is entirely adequate for applications like
computeSumFactors where the aim of clustering is to separate dissimilar cells rather than group together all similar cells.
Blocking reduces computational work considerably by allowing each level to be processed independently, without compromising performance provided that there are enough cells within each batch.
Indeed, for applications like
computeSumFactors, we can use
block even in the absence of any known batch structure.
Specifically, we can set it to an arbitrary factor such as
block=cut(seq_len(ncol(x)), 10) to split the cells into ten batches of roughly equal size.
This aims to improve speed, especially when combined with
block.PARAM to parallelize processing of the independent levels.
use.ranks=TRUE, clustering is instead performed on PC scores obtained from scaled and centred ranks generated by
This effectively means that clustering uses distances based on the Spearman's rank correlation between two cells.
In addition, if
x is a dgCMatrix and
ranks will be computed without loss of sparsity to improve speed and memory efficiency during PCA.
use.ranks=TRUE, the function will filter out genes with average counts (as defined by
min.mean prior to computing ranks.
This removes low-abundance genes with many tied ranks, especially due to zeros, which may reduce the precision of the clustering.
We suggest setting
min.mean to 1 for read count data and 0.1 for UMI data - the function will automatically try to determine this from the data if
use.ranks=TRUE is invariant to scaling normalization and avoids circularity between normalization and clustering, e.g., in
However, the default is to use the log-expression values with
use.ranks=FALSE, as this yields finer and more precise clusters.
cutreeDynamic is used to ensure that all clusters contain a minimum number of cells.
However, some cells may not be assigned to any cluster and are assigned identities of
"0" in the output vector.
In most cases, this is because those cells belong in a separate cluster with fewer than
The function will not be able to call this as a cluster as the minimum threshold on the number of cells has not been passed.
Users are advised to check that the unassigned cells do indeed form their own cluster.
Otherwise, it may be necessary to use a different clustering algorithm.
method="igraph", clusters are first identified using the specified
If the smallest cluster contains fewer cells than
min.size, it is merged with the closest neighbouring cluster.
In particular, the function will attempt to merge the smallest cluster with each other cluster.
The merge that maximizes the modularity score is selected, and a new merged cluster is formed.
This process is repeated until all (merged) clusters are larger than
Aaron Lun and Karsten Bach
van Dongen S and Enright AJ (2012). Metric distances derived from cosine similarity and Pearson and Spearman correlations. arXiv 1208.3145
Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75
quickSubCluster, for a related function that uses a similar approach for subclustering.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
library(scuttle) sce <- mockSCE() # Basic application (lowering min.size for this small demo): clusters <- quickCluster(sce, min.size=50) table(clusters) # Operating on ranked expression values: clusters2 <- quickCluster(sce, min.size=50, use.ranks=TRUE) table(clusters2) # Using hierarchical clustering: clusters <- quickCluster(sce, min.size=50, method="hclust") table(clusters)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.