knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE )
Single-cell RNA-seq integration methods aim to remove technical batch effects while preserving biological variation. CIDER provides a ground-truth-free approach to:
This vignette focuses how showing the process using the example data of dendritic cells.
Apart from CIDER, the following packages also need to be loaded:
library(CIDER) library(Seurat) library(cowplot) library(ggplot2)
The example data can be downloaded from https://figshare.com/s/d5474749ca8c711cc205. This dataset contains 26593 genes and 564 cells, and comprises four dendritic cell subtypes (CD141, CD1C, DoubleNeg, and pDC) from two batches. The raw count matrix and sample information were downloaded from a curated set, and cells with fewer than 500 detected genes have been removed.
# Download the data data_url <- "https://figshare.com/ndownloader/files/52116197" data_file <- file.path(tempdir(), "dendritic.rds") if (!file.exists(data_file)) { message("Downloading data...") download.file(data_url, destfile = data_file, mode = "wb") } dendritic_reduced <- load(data_file)
# load("../data/dendritic.rda") dendritic <- CreateSeuratObject(counts = dendritic@assays$RNA@counts, meta.data = dendritic@meta.data)
# Verify batch composition table(dendritic$Batch)
First an integration method$^1$ is applied on the dendritic data. You
can apply other integration methods to the your data, as long as the
correct PCs are stored in your Seurat object, i.e.
Reductions(seu.integrated, "pca")
or seu.integrated@reductions$pca
.
seu.list <- SplitObject(dendritic, split.by = "Batch") for (i in 1:length(seu.list)) { seu.list[[i]] <- NormalizeData(seu.list[[i]], verbose = FALSE) seu.list[[i]] <- FindVariableFeatures(seu.list[[i]], selection.method = "vst", nfeatures = 1000, verbose = FALSE) } seu.anchors <- FindIntegrationAnchors(object.list = seu.list, dims = 1:15, verbose = FALSE) seu.integrated <- IntegrateData(anchorset = seu.anchors, dims = 1:15, verbose = FALSE) DefaultAssay(seu.integrated) <- "integrated" seu.integrated <- ScaleData(seu.integrated, verbose = FALSE) seu.integrated <- RunPCA(seu.integrated, verbose = FALSE) seu.integrated <- RunTSNE(seu.integrated, reduction = "pca", dims = 1:5)
Clear the intermediate outcome.
rm(seu.list, seu.anchors) gc()
CIDER evaluates integration results in three steps.
Clustering based on the corrected PCs (hdbscan.seurat
). This step uses
HDBSCAN, which is a density-based clustering algorithm$^2$. The
clustering results are stored in seu.integrated$dbscan_cluster
.
Clusters are further divided into batch-specific clusters by
concatenating dbscan_cluster and batch, stored in
seu.integrated$initial_cluster
.
seu.integrated <- hdbscan.seurat(seu.integrated)
Compute IDER-based similarity matrix (getIDEr
) among the
batch-specific initial clusters. If multiple CPUs are availble, you can
set use.parallel = TRUE
and n.cores
to the number of available cores
to speed it up.
ider <- getIDEr(seu.integrated, use.parallel = FALSE, verbose = FALSE)
Assign the similarity and estimate empirical p values (estimateProb
)
for the correctness of integration. High similarity values and low p
values indicate that the cell are similar to the surrounding cells and
likely integrated correctly.
seu.integrated <- estimateProb(seu.integrated, ider)
The evaluation scores can be viewed by the scatterPlot
as below. As
shown cells with dbscan_cluster of 2 and 3 have low regional similarity
and high empirical p values, suggesting that they can be incorrectly
integrated.
p1 <- scatterPlot(seu.integrated, "tsne", "dbscan_cluster") p2 <- scatterPlot(seu.integrated, "tsne", colour.by = "similarity") + labs(fill = "Similarity") p3 <- scatterPlot(seu.integrated, "tsne", colour.by = "pvalue") + labs(fill = "Prob of \nrejection") plot_grid(p1, p2, p3, ncol = 3)
Interpretation Guide:
✅ High similarity + Low p-value: Well-integrated regions
❌ Low similarity + High p-value: Potential integration errors
To have more insight, we can view the IDER-based similarity matrix by
functions plotNetwork
or plotHeatmap
. Both of them require the input
of a Seurat object and the output of getIDEr
. In this example,
1_Batch1 and 1_Batch2 as well as 4_Batch1 and 4_Batch2 have high
similarity.
plotNetwork
generates a graph where vertexes are initial clusters and
edge widths are similarity values. The parameter weight.factor
controls the scale of edge widths; larger weight.factor
will give
bolder edges proportionally.
plotNetwork(seu.integrated, ider, weight.factor = 3)
plotHeatmap
generates a heatmap where each cell is coloured and
labeled by the similarity values.
plotHeatmap(seu.integrated, ider)
So far the evaluation have completed and CIDER has not used the ground truth at all!
Let's peep at the ground truth before the closure of this vignette. As shown in the figure below, the clusters having low IDER-based similarity and high p values actually have at least two populations (CD1C and CD141), verifying that CIDER spots the wrongly integrated cells.
scatterPlot(seu.integrated, "tsne", colour.by = "Group") + labs(fill = "Group\n (ground truth)")
Parameter Tuning:
Adjust hdbscan.seurat
parameters if initial clustering is too
granular
Modify cutree.h
in estimateProb
to change confidence thresholds
Interpretation Tips:
Always validate suspicious joint clusters with marker genes
Scalability:
For large datasets (>10k cells), enable parallel processing with
use.parallel=TRUE
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.