doubletCluster: Detect doublet clusters
In scran: Methods for Single-Cell RNA-Seq Data Analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

Identify potential clusters of doublet cells based on whether they have intermediate expression profiles, i.e., their profiles lie between two other “source” clusters. This function is now deprecated, use findDoubletClusters from scDblFinder instead.

doubletCluster(x, ...)

## S4 method for signature 'ANY'
doubletCluster(x, clusters, subset.row = NULL, threshold = 0.05, ...)

## S4 method for signature 'SummarizedExperiment'
doubletCluster(x, ..., assay.type = "counts")

## S4 method for signature 'SingleCellExperiment'
doubletCluster(x, clusters = colLabels(x, onAbsence = "error"), ...)

`x`	A numeric matrix-like object of count values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix.
`...`	For the generic, additional arguments to pass to specific methods. For the ANY method, additional arguments to pass to `findMarkers`. For the SummarizedExperiment method, additional arguments to pass to the ANY method. For the SingleCellExperiment method, additional arguments to pass to the SummarizedExperiment method.
`clusters`	A vector of length equal to `ncol(x)`, containing cluster identities for all cells. If `x` is a SingleCellExperiment, this is taken from `colLabels(x)` by default.
`subset.row`	See `?"scran-gene-selection"`.
`threshold`	A numeric scalar specifying the FDR threshold with which to identify significant genes.
`assay.type`	A string specifying which assay values to use, e.g., `"counts"` or `"logcounts"`.

This function detects clusters of doublet cells in a manner similar to the method used by Bach et al. (2017). For each “query” cluster, we examine all possible pairs of “source” clusters, hypothesizing that the query consists of doublets formed from the two sources. If so, gene expression in the query cluster should be strictly intermediate between the two sources after library size normalization.

We apply pairwise t-tests to the normalized log-expression profiles (see logNormCounts) to reject this null hypothesis. This is done by identifying genes that are consistently up- or down-regulated in the query compared to both of the sources. We count the number of genes that reject the null hypothesis at the specified FDR threshold. For each query cluster, the most likely pair of source clusters is that which minimizes the number of significant genes.

Potential doublet clusters are identified using the following characteristics:

Low number of significant genes, i.e., N in the output DataFrame. The threshold can be identified by looking for small outliers in log(N) across all clusters, under the assumption that most clusters are not doublets (and thus should have high N).
A reasonable proportion of cells in the cluster, i.e., prop. This requires some expectation of the doublet rate in the experimental protocol.
Library sizes of the source clusters that are below that of the query cluster, i.e., lib.size* values below unity. This assumes that the doublet cluster will contain more RNA and have more counts than either of the two source clusters.

For each query cluster, the function will only report the pair of source clusters with the lowest N. It is possible that a source pair with slightly higher (but still low) value of N may have more appropriate lib.size* values. Thus, it may be valuable to examine all.pairs in the output, especially in over-clustered data sets with closely neighbouring clusters.

The reported p.value is of little use in a statistical sense, and is only provided for inspection. Technically, it could be treated as the Simes combined p-value against the doublet hypothesis for the query cluster. However, this does not account for the multiple testing across all pairs of clusters for each chosen cluster, especially as we are chosing the pair that is most concordant with the doublet null hypothesis.

We use library size normalization (via librarySizeFactors) even if existing size factors are present. This is because intermediate expression of the doublet cluster is not guaranteed for arbitrary size factors. For example, expression in the doublet cluster will be higher than that in the source clusters if normalization was performed with spike-in size factors.

A DataFrame containing one row per query cluster with the following fields:

source1:: String specifying the identity of the first source cluster.
source2:: String specifying the identity of the second source cluster.
N:: Integer, number of genes that are significantly non-intermediate in the query cluster compared to the two putative source clusters.
best:: String specifying the identify of the top gene with the lowest p-value against the doublet hypothesis for this combination of query and source clusters.
p.value:: Numeric, containing the adjusted p-value for the best gene.
lib.size1:: Numeric, ratio of the median library sizes for the first source cluster to the query cluster.
lib.size2:: Numeric, ratio of the median library sizes for the second source cluster to the query cluster.
prop:: Numeric, proportion of cells in the query cluster.
all.pairs:: A SimpleList object containing the above statistics for every pair of potential source clusters.

Each row is named according to its query cluster.

Aaron Lun

Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC and Khaled WT (2017). Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing. Nat Commun. 8, 1:2128.

Lun ATL (2018). Detecting clusters of doublet cells in scran. https://ltla.github.io/SingleCellThoughts/software/doublet_detection/bycluster.html

doubletCells, which provides another approach for doublet detection.

findMarkers, to detect DE genes between clusters.

# Mocking up an example.
ngenes <- 100
mu1 <- 2^rexp(ngenes)
mu2 <- 2^rnorm(ngenes)

counts.1 <- matrix(rpois(ngenes*100, mu1), nrow=ngenes)
counts.2 <- matrix(rpois(ngenes*100, mu2), nrow=ngenes)
counts.m <- matrix(rpois(ngenes*20, mu1+mu2), nrow=ngenes)

counts <- cbind(counts.1, counts.2, counts.m)
clusters <- rep(1:3, c(ncol(counts.1), ncol(counts.2), ncol(counts.m)))

# Compute doublet-ness of each cluster:
dbl <- doubletCluster(counts, clusters)
dbl

# Narrow this down to clusters with very low 'N':
library(scuttle)
isOutlier(dbl$N, log=TRUE, type="lower") 

# Get help from "lib.size" below 1.
dbl$lib.size1 < 1 & dbl$lib.size2 < 1