scPOP
is a lightweight, low dependency R package which brings together multiple metrics to evaluate batch correction for single cell RNA-seq.
The package includes the Local Simpson Index (LISI) and Average Silhouette Width (ASW) metrics from Harmony and kBET, respectively, as well as the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) algorithms.
Install with the following:
library(devtools) devtools::install_github('vinay-swamy/scPOP')
Note that to install this package, you may require additional software to compile Rcpp code:
The metrics we include are :
scPOP
It's important to note that based on the type of label being evaluated, the "optimal" score a given metric may change. For example, when calculating LISI based on batch, a highscore is better(multiple batches close together), but when calculating based on Cell Type, a low score is better( the same celltypes are close together.)
The ideal use case for scPOP
to generate metrics on dataset for which multiple rounds of batch correction have been calculated. These metrics can be used to rank different
We provide a toy dataset in .h5ad format
download.file('https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/scEiaD_all_anndata_mini_ref.h5ad', 'scEiaD_all_anndata_mini_ref.h5ad')
We recommend calculating all metrics at once using run_all_metrics
. This function requires a matrix of reduced dimensions, a data.frame containing metadata, and the names of 3 columns
batch_key
: column corresponds to batch for each celllabel1_key
: primary label for each cell, ie Cell Typesecondary_label
: for each cell, ie ClusterWe recommend the zellkonverter
for reading .h5ad formatted into R. The example we provide uses data in the SingleCellExperiment format, but as scPOP only requires vectors of labels and matrices of reduced dimensions, data from other frameworks like Seurat can be easily used.
library(zellkonverter,quietly = T) library(SingleCellExperiment,quietly = T) library(scPOP) sce <- zellkonverter::readH5AD('scEiaD_all_anndata_mini_ref.h5ad') sce
## run this to make example data library(tidyverse) library() column_labels <- colData(sce) %>% as_tibble %>% select(Barcode, batch, cluster, subcluster, CellType, CellType_predict) dimred_data <- reducedDim(sce) %>% as.data.frame %>% rownames_to_column('Barcode') colnames(dimred_data)[-1] <- paste0('scviDim_', 1:8) sceiad_subset_data <- inner_join(column_labels, dimred_data) %>% sample_n(10000) %>% as.data.frame usethis::use_data(sceiad_subset_data, overwrite = T)
Running scPOP
metrics <-run_all_metrics(reduction = reducedDim(sce, 'X_scvi'), metadata = colData(sce), batch_key = 'batch', label1_key = 'CellType_predict', label2_key = 'cluster', run_name = 'example') metrics
These metrics can be applied to multiple integration runs to determine the optimal integration method/parameters. To illustrate this, we'll generate some fake data.
multi_run_example <- lapply(c(23232, 23423423, 66774, 2341345, 56733), function(i){ set.seed(i) sce_shuffled <- sce sce_shuffled$batch <- sample(sce_shuffled$batch, ncol(sce)) sce_shuffled$CellType_predict <- sample(sce_shuffled$CellType_predict, ncol(sce)) sce_shuffled$cluster <- sample(sce_shuffled$cluster, ncol(sce)) run_all_metrics(reduction = reducedDim(sce_shuffled, 'X_scvi'), metadata = colData(sce_shuffled), batch_key = 'batch', label1_key = 'CellType_predict', label2_key = 'cluster', run_name = as.character(i), sil_width_prop = .25, sil_width_group_key = 'CellType_predict', quietly=T) }) run_metrics <- do.call(rbind, multi_run_example) run_metrics
We provide a method, calc_sumZscore
, to aggregate these metrics together across multiple runs to generate a single score for each run
run_metrics$sumZscore <- calc_sumZscore(run_metrics, 'batch') run_metrics[,c('run', 'sumZscore')]
We also provide functions for running each function individually
ari_score <- ari(sce$batch, sce$CellType_predict) ari_score
nmi_score <- nmi(sce$batch, sce$CellType_predict) nmi_score
Lisi requires about 10gb of memory for 100K cells and scales linearly with number of cells
lisi_score <- lisi(X = reducedDim(sce, 'X_scvi'), meta_data = as.data.frame(colData(sce)), label_colnames = 'CellType_predict' ) head(lisi_score)
For some the silhouette_width
, a distance matrix mist be calculated, which requires significant memory usage. We provide the function stratified_sample
to downsample data based on a grouping variable
idx <- stratified_sample(colnames(sce), sce$batch) sce_ds <- sce[,idx] sil_score <- silhouette_width(reduction = reducedDim(sce_ds, 'X_scvi'), meta.data = colData(sce_ds), keys ='CellType_predict')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.