quickCluster: Quick clustering of cells

Description Usage Arguments Details Value Enforcing cluster sizes Gene selection Obtaining the scaled and centred ranks Author(s) References See Also Examples

Description

Cluster similar cells based on rank correlations in their gene expression profiles.

Usage

1
2
3
4
5
6
## S4 method for signature 'ANY'
quickCluster(x, min.size=200, max.size=NULL, method=c("hclust", "igraph"),
    pc.approx=TRUE, get.ranks=FALSE, subset.row=NULL, min.mean=1, ...)

## S4 method for signature 'SingleCellExperiment'
quickCluster(x, subset.row=NULL, ..., assay.type="counts", get.spikes=FALSE)

Arguments

x

A numeric count matrix where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix.

min.size

An integer scalar specifying the minimum size of each cluster.

max.size

An integer scalar specifying the maximum size of each cluster.

get.ranks

A logical scalar specifying whether a matrix of adjusted ranks should be returned.

method

A string specifying the clustering method to use.

pc.approx

Argument passed to buildSNNGraph when method="igraph", otherwise ignored.

subset.row

A logical, integer or character scalar indicating the rows of x to use.

min.mean

A numeric scalar specifying the filter to be applied on the average count for each filter prior to computing ranks. Disabled by setting to NULL.

...

For quickCluster,ANY-method, additional arguments to be passed to cutreeDynamic for method="hclust", or buildSNNGraph for method="igraph". For quickCluster,SingleCellExperiment-method, additional arguments to pass to quickCluster,ANY-method.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

A logical specifying whether spike-in transcripts should be used.

Details

This function provides a correlation-based approach to quickly define clusters of a minimum size min.size. Two clustering strategies are available:

A correlation-based approach is preferred here as it is invariant to scaling normalization. This avoids circularity between normalization and clustering, e.g., in computeSumFactors.

Value

If get.ranks=FALSE, a character vector of cluster identities for each cell in counts is returned.

If get.ranks=TRUE, a numeric matrix is returned where each column contains ranks for the expression values in each cell.

Enforcing cluster sizes

With method="hclust", cutreeDynamic is used to ensure that all clusters contain a minimum number of cells. However, some cells may not be assigned to any cluster and are assigned identities of "0" in the output vector. In most cases, this is because those cells belong in a separate cluster with fewer than min.size cells. The function will not be able to call this as a cluster as the minimum threshold on the number of cells has not been passed. Users are advised to check that the unassigned cells do indeed form their own cluster. Otherwise, it may be necessary to use a different clustering algorithm.

When using method="igraph", clusters are first identified using cluster_fast_greedy. If the smallest cluster contains fewer cells than min.size, it is merged with the closest neighbouring cluster. In particular, the function will attempt to merge the smallest cluster with each other cluster. The merge that maximizes the modularity score is selected, and a new merged cluster is formed. This process is repeated until all (merged) clusters are larger than min.size.

Gene selection

In quickCluster,SingleCellExperiment-method, spike-in transcripts are not used by default as they provide little information on the biological similarities between cells. This may not be the case if subpopulations differ by total RNA content, in which case setting get.spikes=TRUE may provide more discriminative power.

Users can also set subset.row to specify which rows of x are to be used to calculate correlations. This is equivalent to but more efficient than subsetting x directly, as it avoids constructing a (potentially large) temporary matrix. Note that if subset.row is specified, it will intersect with any setting of get.spikes.

By default, the function will also filter out genes with average counts (as defined by calcAverage) below min.mean. This removes low-abundance genes with many tied ranks, especially due to zeros, which may reduce the precision of the clustering. We suggest setting min.mean to 1 for read count data and 0.1 for UMI data. This can be disabled completely by setting it to NULL.

Obtaining the scaled and centred ranks

Users can also set get.ranks=TRUE, in which case a matrix of ranks will be returned. Each column contains the ranks for the expression values within a single cell after standardization and mean-centring. Computing Euclidean distances between the rank vectors for pairs of cells will yield the same correlation-based distance as that used above. This allows users to apply their own clustering algorithms on the ranks, which protects against outliers and is invariant to scaling (at the cost of sensitivity).

Author(s)

Aaron Lun and Karsten Bach

References

van Dongen S and Enright AJ (2012). Metric distances derived from cosine similarity and Pearson and Spearman correlations. arXiv 1208.3145

Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75

See Also

cutreeDynamic, computeSumFactors, buildSNNGraph

Examples

1
2
3
4
5
6
7
8
set.seed(100)
popsize <- 200
ngenes <- 1000
all.facs <- 2^rnorm(popsize, sd=0.5)
counts <- matrix(rnbinom(ngenes*popsize, mu=all.facs, size=1), ncol=popsize, byrow=TRUE)

clusters <- quickCluster(counts, min.size=20)
clusters <- quickCluster(counts, method="igraph")

Example output

Loading required package: BiocParallel
Loading required package: scater
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colMeans, colSums, colnames, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: ggplot2

Attaching package: 'scater'

The following object is masked from 'package:stats':

    filter

scran documentation built on May 3, 2018, 6 p.m.

Related to quickCluster in scran...