diagnostics-cluster | R Documentation |

A variety of correction diagnostics that make use of clustering information, usually obtained by clustering on cells from all batches in the corrected data.

```
clusterAbundanceTest(x, batch)
clusterAbundanceVar(x, batch, pseudo.count = 10)
```

`x` |
A factor or vector specifying the assigned cluster for each cell in each batch in the corrected data. Alternatively, a matrix or table containing the number of cells in each cluster (row) and batch (column). |

`batch` |
A factor or vector specifying the batch of origin for each cell.
Ignored if |

`pseudo.count` |
A numeric scalar containing the pseudo-count to use for the log-transformation. |

For `clusterAbundanceTest`

, the null hypothesis for each cluster is that the distribution of cells across batches is proportional to the total number of cells in each batch.
We then use `chisq.test`

to test for deviations from the expected proportions, possibly indicative of imperfect mixing across batches.
This works best for technical replicates where the population composition should be identical across batches.
However, the interpretation of the p-value loses its meaning for experiments where there is more biological variability between batches.

For `clusterAbundanceVar`

, we compute log-normalized abundances for each cluster using `normalizeCounts`

.
We then compute the variance of the log-abundances across batches for each cluster.
Large variances indicate that there are strong relative differences in abundance across batches, indicative of either imperfect mixing or genuine batch-specific subpopulations.
The idea is to rank clusters by their variance to prioritize them for manual inspection to decide between these two possibilities.
We use a large `pseudo.count`

by default to avoid spuriously large variances when the counts are low.

For `clusterAbundanceTest`

, a named numeric vector of p-values from applying Pearson's chi-squared test on each cluster.

For `clusterAbundanceVar`

, a named numeric vector of variances of log-abundances across batches for each cluster.

Aaron Lun

```
set.seed(1000)
means <- 2^rgamma(1000, 2, 1)
A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1
A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2
B1 <- log2(A1 + 1)
B2 <- log2(A2 + 1)
out <- fastMNN(B1, B2)
cluster1 <- kmeans(t(B1), centers=10)$cluster
cluster2 <- kmeans(t(B2), centers=10)$cluster
merged.cluster <- kmeans(reducedDim(out, "corrected"), centers=10)$cluster
# Low p-values indicate unexpected differences in abundance.
clusterAbundanceTest(paste("Cluster", merged.cluster), out$batch)
# High variances indicate differences in normalized abundance.
clusterAbundanceVar(paste("Cluster", merged.cluster), out$batch)
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.