knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE )

The `r BiocStyle::Biocpkg("CellMixS")`

package is a toolbox to
explore and compare group effects in single-cell RNA-seq data.
It has two major applications:

- Detection of batch effects and biases in single-cell RNA-seq data.
- Evaluation and comparison of data integration (e.g. after batch effect correction).

For this purpose it introduces two new metrics:

**Cellspecific Mixing Score (CMS)**: A test for batch effects within k-nearest neighbouring cells.**Local Density Differences (ldfDiff)**: A score describing the change in relative local cell densities by data integration or projection.

It also provides implementations and wrappers for a set of metrics with a similar purpose: entropy, the inverse Simpson index [@Korsunsky2018], and Seurat's mixing metric and local structure metric [@Stuart2018]. Besides this, several exploratory plotting functions enable evaluation of key integration and mixing features.

`r BiocStyle::Biocpkg("CellMixS")`

can be installed from Bioconductor as follows.

if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("CellMixS")

After installation the package can be loaded into R.

```
library(CellMixS)
```

`r BiocStyle::Biocpkg("CellMixS")`

uses the `SingleCellExperiment`

class from the `r BiocStyle::Biocpkg("SingleCellExperiment")`

Bioconductor
package as the format for input data.

The package contains example data named **sim50**, a list of simulated
single-cell RNA-seq data with varying batch effect strength and unbalanced
batch sizes.

Batch effects were introduced by sampling 0%, 20% or 50% of gene expression values from a distribution with modified mean value (e.g. 0% - 50% of genes were affected by a batch effect).

All datasets consist of *3 batches*, one with *250 cells* and the others with
half of its size (*125 cells*). The simulation is modified after
[@Buttner2019] and described in sim50.

# Load required packages suppressPackageStartupMessages({ library(SingleCellExperiment) library(cowplot) library(limma) library(magrittr) library(dplyr) library(purrr) library(ggplot2) library(scater) })

# Load sim_list example data sim_list <- readRDS(system.file(file.path("extdata", "sim50.rds"), package = "CellMixS")) names(sim_list) sce50 <- sim_list[["batch50"]] class(sce50) table(sce50[["batch"]])

Often batch effects can already be detected by visual inspection and simple
visualization (e.g. in a normal tSNE or UMAP plot) depending on the strength. `r BiocStyle::Biocpkg("CellMixS")`

contains various plotting functions to
visualize group label and mixing scores aside. Results are `ggplot`

objects and can be further customized
using `r BiocStyle::CRANpkg("ggplot2")`

. Other packages, such as
`r BiocStyle::Biocpkg("scater")`

, provide similar plotting functions and could
be used instead.

# Visualize batch distribution in sce50 visGroup(sce50, group = "batch")

# Visualize batch distribution in other elements of sim_list batch_names <- c("batch0", "batch20") vis_batch <- lapply(batch_names, function(name){ sce <- sim_list[[name]] visGroup(sce, "batch") + ggtitle(paste0("sim_", name)) }) plot_grid(plotlist = vis_batch, ncol = 2)

Not all batch effects or group differences are obvious using visualization. Also, in single-cell experiments celltypes and cells can be affected in different ways by experimental conditions causing batch effects (e.g. some cells are more robust to storing conditions than others).

Furthermore the range of methods for data integration and batch effect removal gives rise to the question of which method performs best on which data, and thereby a need to quantify batch effects.

The **cellspecific mixing score** `cms`

tests for each cell the hypothesis that
batch-specific distance distributions towards it's k-nearest neighbouring (knn)
cells are derived from the same unspecified underlying distribution using the
Anderson-Darling test [@Scholz1987]. Results from the `cms`

function are two
scores *cms* and *cms_smooth*, where the latter is the weighted mean of the cms
within each cell's neighbourhood. They can be interpreted as the data's
probability within an equally mixed neighbourhood. A high `cms`

score refers to
good mixing, while a low score indicates batch-specific bias.
The test considers differences in the number of cells from each batch.
This facilitates the `cms`

score to differentiate between unbalanced batches
(e.g. one celltype being more abundant in a specific batch) and a biased
distribution of cells. The `cms`

function takes a `SingleCellExperiment`

object (described in `r BiocStyle::Biocpkg("SingleCellExperiment")`

) as input
and appends results to the colData slot.

# Call cell-specific mixing score for sce50 # Note that cell_min is set to 4 due to the low number of cells and small k. # Usually default parameters are recommended!! sce50 <- cms(sce50, k = 30, group = "batch", res_name = "unaligned", n_dim = 3, cell_min = 4) head(colData(sce50)) # Call cell-specific mixing score for all datasets sim_list <- lapply(batch_names, function(name){ sce <- sim_list[[name]] sce <- cms(sce, k = 30, group = "batch", res_name = "unaligned", n_dim = 3, cell_min = 4) }) %>% set_names(batch_names) # Append cms50 sim_list[["batch50"]] <- sce50

A key question when evaluating dataset structures is how to define neighbourhoods,
or in this case, which cells to use to calculate the mixing.
`cms`

provides 3 options to define cells included in each Anderson-Darling test:

**Number of knn**: This is the default setting and can be set by the parameter`k`

. The optimal choice depends on the application, as with a small`k`

focus is on local mixing, while with a large`k`

mixing with regard to more global structures is evaluated. In general`k`

should not exceed the size of the smallest cell population as including cells from different cell populations can conflict with the underlying assumptions of distance distributions.**Density based neighbourhoods**: In cases of unequal cell population sizes the optimal number of neighbours might vary. Using the size of the smallest population can lead to suboptimal neighbourhoods for larger populations in these cases, as the power of the AD-test increases with cell numbers. For these cases or others where a local adaptation of the neighbourhood is desired the`k_min`

parameter is provided. It denotes the minimum number of cells that define a neighbourhood. Starting with the*knn*as defined by`k`

the`cms`

function will cut neighbourhoods by their first local minimum in the*overall distance distribution*of the*knn*cells. Only cells within a distance smaller than the first local minimum are included in the AD-test, but at least`k_min`

cells.**Fixed minimum per batch**: Another option to define a dynamic neighbourhood is provided by the`batch_min`

parameter. It defines the minimum number of cells for each batch that shall be included into each neighbourhood. In this case the neighbourhoods will include an increasing number of neighbours until at least`batch_min`

cells from each batch are included.

For smoothing, either `k`

or (if specified) `k_min`

cells are included to get a
weighted mean of `cms`

-scores. Weights are defined by the reciprocal distance
towards the respective cell, with 1 as weight of the respective cell itself.

Another important parameter is the subspace to use to calculate cell distances.
This can be set using the `dim_red`

parameter. By default the *PCA* subspace will be
used and calculated if not present. Some *data integration methods* provide
embeddings of a *common subspace* instead of "corrected counts". `cms`

scores
can be calculated within these by defining them with the `dim_red`

argument (see \@ref(di1)).
In general all reduced dimension representations can be specified, but only
*PCA* will be computed automatically, while other methods need to be
precomputed.

An overall summary of `cms`

scores can be visualized as a histogram. As `cms`

scores are
*p.values* from hypothesis testing, without any batch effect the p.value
histogram should be flat. An increased number of very small p-values
indicates the presence of a batch-specific bias within data.

# p-value histogram of cms50 visHist(sce50) # p-value histogram sim30 # Combine cms results in one matrix batch_names <- names(sim_list) cms_mat <- batch_names %>% map(function(name) sim_list[[name]]$cms.unaligned) %>% bind_cols() %>% set_colnames(batch_names) visHist(cms_mat, n_col = 3)

Results of `cms`

can be visualized in a cell-specific way and alongside any
metadata.

# cms only cms20 sce20 <- sim_list[["batch20"]] metric_plot <- visMetric(sce20, metric_var = "cms_smooth.unaligned") # group only cms20 group_plot <- visGroup(sce20, group = "batch") plot_grid(metric_plot, group_plot, ncol = 2)

# Add random celltype assignments as new metadata sce20[["celltype"]] <- rep(c("CD4+", "CD8+", "CD3"), length.out = ncol(sce20)) %>% as.factor visOverview(sce20, "batch", other_var = "celltype")

Systematic differences (e.g. celltype differences) can be further explored using
`visCluster`

. Here we do not expect any systematic difference as celltypes were
randomly assigned.

visCluster(sce20, metric_var = "cms.unaligned", cluster_var = "celltype")

To remove batch effects when integrating different single-cell RNAseq datasets,
a range of methods can be used. The `cms`

function can be used to evaluate the
effect of these methods, using a cell-specific mixing score. Some of them
(e.g. `fastMNN`

from the `r BiocStyle::Biocpkg("batchelor")`

package) provide a
"common subspace" with integrated embeddings. Other methods like
`r BiocStyle::Biocpkg("limma")`

give "batch-corrected data" as results.
Both work as input for `cms`

.

# MNN - embeddings are stored in the reducedDims slot of sce reducedDimNames(sce20) sce20 <- cms(sce20, k = 30, group = "batch", dim_red = "MNN", res_name = "MNN", n_dim = 3, cell_min = 4) # Run limma sce20 <- scater::logNormCounts(sce20) limma_corrected <- removeBatchEffect(logcounts(sce20), batch = sce20$batch) # Add corrected counts to sce assay(sce20, "lim_corrected") <- limma_corrected # Run cms sce20 <- cms(sce20, k = 30, group = "batch", assay_name = "lim_corrected", res_name = "limma", n_dim = 3, cell_min = 4) names(colData(sce20))

To compare different methods, summary plots from `visIntegration`

(see \@ref(ldf)) and p-value histograms from `visHist`

can be used. Local
patterns within single methods can be explored as described above.

# As pvalue histograms visHist(sce20, metric = "cms.", n_col = 3)

Here both methods `r BiocStyle::Biocpkg("limma")`

and `fastMNN`

from the `r BiocStyle::Biocpkg("scran")`

package flattened the p.value distribution.
So cells are better mixed after batch effect removal.

Besides successful batch "mixing", data integration should also preserve the data's internal structure and variability without adding new sources of variability or removing underlying structures. Especially for methods that result in "corrected counts" it is important to understand how much of the dataset's internal structures are preserved.

`ldfDiff`

calculates the differences between each cell's
**local density factor** before and after data integration [@Latecki2007].
The local density factor is a relative measure of the cell density around a cell
compared to the densities within its neighbourhood. Local density factors are
calculated on the same set of k cells from the cell's knn before integration.
In an optimal case relative densities (according to the same set of cells)
should not change by integration and the `ldfDiff`

score should be close to 0.
In general the overall distribution of `ldfDiff`

should be centered around 0
without long tails.

# Prepare input # List with single SingleCellExperiment objects # (Important: List names need to correspond to batch levels! See ?ldfDiff) sce_pre_list <- list("1" = sce20[,sce20$batch == "1"], "2" = sce20[,sce20$batch == "2"], "3" = sce20[,sce20$batch == "3"]) sce20 <- ldfDiff(sce_pre_list, sce_combined = sce20, group = "batch", k = 70, dim_red = "PCA", dim_combined = "MNN", assay_pre = "counts", n_dim = 3, res_name = "MNN") sce20 <- ldfDiff(sce_pre_list, sce_combined = sce20, group = "batch", k = 70, dim_red = "PCA", dim_combined = "PCA", assay_pre = "counts", assay_combined = "lim_corrected", n_dim = 3, res_name = "limma") names(colData(sce20))

Results from `ldfDiff`

can be visualized in a similar way as results from `cms`

.

# ldfDiff score summarized visIntegration(sce20, metric = "diff_ldf", metric_name = "ldfDiff")

`ldfDiff`

shows a clear difference between the two methods.
While `r BiocStyle::Biocpkg("limma")`

is able to preserve the batch internal
structure within batches, `fastMNN`

clearly changes it.
Even if batches are well mixed (see \@ref(di2)), `fastMNN`

does not work
for batch effect removal on these simulated data.
Again this is in line with expectations due to the small number of genes in
the example data. One of MNN’s assumptions is that batch effects should be much
smaller than biological variation, which does not hold true in this small
example dataset.

Often it is useful to check different aspects of data mixing and integration by
the use of different metrics, as many of them emphasize different features of
mixing. To provide an easy interface for thorough investigation of batch effects
and data integration a wrapper function of a variety of metrics is included into
`r BiocStyle::Biocpkg("CellMixS")`

. `evalIntegration`

calls one or all of `cms`

,
`ldfDiff`

, `entropy`

or equivalents to `mixingMetric`

, `localStruct`

from the
`r BiocStyle::CRANpkg("Seurat")`

package or `isi`

, a simplfied version of the
local inverse Simpson index as suggested by [@Korsunsky2018]. `entropy`

calculates the Shannon entropy within each cell's *knn* describing the
randomness of the batch variable.
`isi`

calculates the inverse Simpson index within each cell's *knn*.
The Simpson index describes the probability that two entities are taken at
random from the dataset and its inverse represents the effective number of
batches in the neighbourhood. A simplified version of the distance based
weightening as proposed by [@Korsunsky2018] is provided by the weight option.
As before the resulting scores are included into the colData slot of the input
`SingleCellExperiment`

object and can be visualized with `visMetric`

and other
plotting functions.

sce50 <- evalIntegration(metrics = c("isi", "entropy"), sce50, group = "batch", k = 30, n_dim = 2, cell_min = 4, res_name = c("weighted_isi", "entropy")) visOverview(sce50, "batch", metric = c("cms_smooth.unaligned", "weighted_isi", "entropy"), prefix = FALSE)

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.