evalIntegration: evalIntegration
In CellMixS: Evaluate Cellspecific Mixing

Description Usage Arguments Details Value Metrics References Examples

Function to evaluate sc data integration providing a framework for different metrics. Metrics to evaluate mixing and preservance of the local/individual structure are provided.

evalIntegration(
  metrics,
  sce,
  group,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  res_name = NULL,
  k = NULL,
  k_min = NA,
  smooth = TRUE,
  cell_min = 10,
  batch_min = NULL,
  unbalanced = FALSE,
  weight = TRUE,
  k_pos = 5,
  sce_pre_list = NULL,
  dim_combined = dim_red,
  assay_pre = "logcounts",
  n_combined = 10,
  BPPARAM = SerialParam()
)

`metrics`	Character vector. Name of the metrics to apply. Must be one to all of 'cms', 'ldfDiff', 'isi', 'mixingMetric', 'localStructure', 'entropy'.
`sce`	`SingleCellExperiment` object, with the integrated data.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`dim_red`	Character. Name of embedding to use as subspace for distance distributions. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce))`. Default is "logcounts".
`n_dim`	Numeric. Number of dimensions to include to define the subspace.
`res_name`	Character vector. Appendix of the result score's name (e.g. method used to combine batches). Needs to have the same length as metrics or NULL.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`k_min`	Numeric. Minimum number of knn to include (see `cms`). Relevant for metrics: 'cms'.
`smooth`	Logical. Indicating if cms results should be smoothened within each neighbourhood using the weigthed mean. Relevant for metric: 'cms'.
`cell_min`	Numeric. Minimum number of cells from each group to be included into the AD test. Should be > 4. Relevant for metric: 'cms'.
`batch_min`	Numeric. Minimum number of cells per batch to include in to the AD test. If set, neighbours will be included until batch_min cells from each batch are present. Relevant for metrics: 'cms'.
`unbalanced`	Boolean. If TRUE, neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening. Relevant for metrics: 'cms'.
`weight`	Boolean. If TRUE, batch probabilities to calculate the isi score are weighted by the mean distance of their cells towards the cell of interest. Relevant for metrics: 'isi'.
`k_pos`	Numeric. Position of cell to be used as reference within mixing metric. See `MixingMetric` for details. Relevant for metric: 'mixingMetric'
`sce_pre_list`	A list of `SingleCellExperiment` objects with single datasets before integration. Names should correspond to levels in `colData(sce_combined)[,group]`. Relevant for metric: 'ldfDiff'
`dim_combined`	Character. Name of embeddings to use as subspace to calculate LDF after integration. Default is `dim_red`. Relevant for metric 'ldfDiff'.
`assay_pre`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce_pre))`. Default is "logcounts". Relevant for metric 'ldfDiff'.
`n_combined`	Number of PCs to use in original space. See `LocalStruct` for details. Relevant for metric 'localStructure'.
`BPPARAM`	A BiocParallelParam object specifying whether cms scores shall be calculated in parallel. Relevant for metric: 'cms'.

evalIntegration is a wrapper function for different metrics to understand results of integrated single cell data sets. In general there are metrics evaluationg the *mixing* of datasets, that is, metrics that show whether there still is a bias for different datasets after integration. Furthermore there are metrics to evaluate how well the dataset internal structure has been retained, that is, metrics that show whether there has been (potentially biological) signal removed or noise added by integration.

A SingleCellExperiment with the chosen metric's score within colData.

Here we provide the following metrics:

cms: Cellspecific Mixing Score. Metric that tests the hypothesis that group-specific distance distributions of knn cells have the same underlying unspecified distribution. The score can be interpreted as the data's probability within an equally mixed neighbourhood according to the batch variable (see cms).
isi: Inverse Simpson Index. Metric that uses the Inverse Simpson’s Index to calculate the diversification within a specified neighbourhood. The Simpson index describes the probability that two entities are taken at random from the dataset and its inverse represent the effective number of batches in a neighbourhood. The inverse Simpson index has been proposed as a diversity score for batch mixing in single cell RNAseq by Korunsky et al. They provide a distance-based neighbourhood weightening in their Lisi package.
mixingMetric: Mixing Metric. Metric using the median position of the kth cell from each batch within its knn as a score. The lower the better mixed is the neighbourhood. We implemented an equivalent version to the one in the Seurat package (See MixingMetric and mixMetric.)
entropy: Shannon entropy. Metric calculating the Shannon entropy of the batch/group variable within each cell's k-nearest neigbours. For balanced batches the entropy is closer to 1 the higher the variables randomness. For unbalanced batches entropy should only be used as a relative metric in a comparative setting (See entropy.)
ldfDiff: Local density factor differences. Metric that determines cell-specific changes in the Local Density Factor before and after data integration. A metric/difference close to 0 indicates no distortion of the previous structure (see ldfDiff).
localStructure: Local structure. Metric that compares the intersection of knn from the same batch before and after integration returning the average between all groups. The higher the more neighbours were reproduced after integration. Here we implemented an equivalent version to the one in the Seurat package (See LocalStruct and locStructure ).

Korsunsky I Fan J Slowikowski K Zhang F Wei K et. al. (2018). Fast, sensitive, and accurate integration of single cell data with Harmony. bioRxiv (preprint).

Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 300:320, 16:30)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_batch2 <- sce[,colData(sce)$batch == "2"]
pre <- list("1" = sce_batch1, "2" = sce_batch2)

sce <- evalIntegration(metrics = c("cms", "mixingMetric", "isi", "entropy"), sce, "batch", k = 20)
sce <- evalIntegration("ldfDiff", sce, "batch", k = 20, sce_pre_list = pre)