Cluster-based correction diagnostics


A variety of correction diagnostics that make use of clustering information, usually obtained by clustering on cells from all batches in the corrected data.


clusterAbundanceTest(x, batch)

clusterAbundanceVar(x, batch, pseudo.count = 10)



A factor or vector specifying the assigned cluster for each cell in each batch in the corrected data. Alternatively, a matrix or table containing the number of cells in each cluster (row) and batch (column).


A factor or vector specifying the batch of origin for each cell. Ignored if x is a matrix or table.


A numeric scalar containing the pseudo-count to use for the log-transformation.


For clusterAbundanceTest, the null hypothesis for each cluster is that the distribution of cells across batches is proportional to the total number of cells in each batch. We then use chisq.test to test for deviations from the expected proportions, possibly indicative of imperfect mixing across batches. This works best for technical replicates where the population composition should be identical across batches. However, the interpretation of the p-value loses its meaning for experiments where there is more biological variability between batches.

For clusterAbundanceVar, we compute log-normalized abundances for each cluster using normalizeCounts. We then compute the variance of the log-abundances across batches for each cluster. Large variances indicate that there are strong relative differences in abundance across batches, indicative of either imperfect mixing or genuine batch-specific subpopulations. The idea is to rank clusters by their variance to prioritize them for manual inspection to decide between these two possibilities. We use a large pseudo.count by default to avoid spuriously large variances when the counts are low.


For clusterAbundanceTest, a named numeric vector of p-values from applying Pearson's chi-squared test on each cluster.

For clusterAbundanceVar, a named numeric vector of variances of log-abundances across batches for each cluster.


means <- 2^rgamma(1000, 2, 1)
A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 
A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2

B1 <- log2(A1 + 1)
B2 <- log2(A2 + 1)
out <- fastMNN(B1, B2) 

cluster1 <- kmeans(t(B1), centers=10)$cluster
cluster2 <- kmeans(t(B2), centers=10)$cluster
merged.cluster <- kmeans(reducedDim(out, "corrected"), centers=10)$cluster

# Low p-values indicate unexpected differences in abundance.
clusterAbundanceTest(paste("Cluster", merged.cluster), out$batch)

# High variances indicate differences in normalized abundance.
clusterAbundanceVar(paste("Cluster", merged.cluster), out$batch)

