In this vignette, we demonstrate how to use Canek
to correct batch effects from SingleCellExperiment
objects following the recommendations from Bioconductor multi-sample analysis.
The toy data sets used in this vignette are included in the Canek
R package.
knitr::opts_chunk$set( fig.width = 7, fig.align = "center", collapse = TRUE, comment = "#>" )
library(Canek) library(SingleCellExperiment) library(scater) library(batchelor) library(scran) library(ggplot2) library(patchwork)
#Create independent SingleCellExperiment objects sce <- lapply(names(Canek::SimBatches$batches), function(batch) { counts <- SimBatches$batches[[batch]] #get the counts SingleCellExperiment(list(counts=counts), #create the sce object mainExpName = "example_sce") }) #Add batch labels names_batch <- c("B1", "B2") names(sce) <- names_batch sce$B1[["batch"]] <- "B1" sce$B2[["batch"]] <- "B2" #Add cell-type labels (included in Canek's package) celltypes <- Canek::SimBatches$cell_types ct_B1_idx <- 1:ncol(sce$B1) ct_B2_idx <- (ncol(sce$B1)+1):length(celltypes) sce$B1[["celltype"]] <- celltypes[ct_B1_idx] sce$B2[["celltype"]] <- celltypes[ct_B2_idx]
Let's check the cells distribution across batches.
table(sce$B1$batch) table(sce$B2$batch)
In the pre-processing steps, we perform the normalization of counts and the variance analysis of genes independently for each batch. Then, we find common variable genes to use in downstream analyses.
We perform scaling normalization within each batch using the multiBatchNorm
function from batchelor Bioconductor package.
sce <- batchelor::multiBatchNorm(sce[["B1"]], sce[["B2"]]) names(sce) <- names_batch
We use the combineVar
function from scran Bioconductor package. This function receives independent variance models of genes for each of the batches. This allows us to select highly variable genes while preserving within-batch differences.
We found 273 combined variable features.
#Get independent features' variant models sce_var <- lapply(X = sce, FUN = scran::modelGeneVar) #Combine the independent variables combined_features <- scran::combineVar(sce_var[["B1"]], sce_var[["B2"]]) #Select features using the `bio` statistics (see combineVar documentation for further details) combined_features <- combined_features$bio > 0 sum(combined_features)
Then, we merge the two datasets and use the combined variable features for the Principal Components Analysis.
uncorrected <- cbind(sce[["B1"]], sce[["B2"]]) uncorrected <- scater::runPCA(x = uncorrected[combined_features,], scale = TRUE)
Let's check out the elbow plot of the PCs.
pca_uncorrected <- SingleCellExperiment::reducedDim(x = uncorrected, type = "PCA") variance <- apply(X = pca_uncorrected, MARGIN = 2, FUN = var) plot(x = 1:30, y = variance[1:30], xlab = "Principal components", ylab = "Variance")
The variance stabilizes after the first seven PCs. In this test, we will use 10 PCs to calculate the UMAP visualization, but feel free to try other numbers.
set.seed(777) uncorrected <- scater::runUMAP(x = uncorrected, dimred = "PCA", pca = 10)
We can now visualize the uncorrected data by batch and cell-type labels.
p1 <- scater::plotUMAP(object = uncorrected, colour_by="batch") p2 <- scater::plotUMAP(object = uncorrected, colour_by="celltype") p1 + p2
We can observe that the cells labeled as Cell Type 1 got divided into two groups that correlated with batch labels. Let's minimize this batch differences with Canek.
Canek accepts a SingleCellExperiments
object with a batch
identifier. We can pass a vector of variable features to use in the integration.
features <- rownames(uncorrected)[combined_features] #RunCanek in a list of SingleCellExperiment objects #corrected <- Canek::RunCanek(x = sce, features = features) #RunCanek in a single object with a batch identifier corrected <- Canek::RunCanek(x = uncorrected, batches = "batch", features = features)
To perform PCA it's important to use the corrected log counts. These are saved in the assay Canek
in the SingleCellExperiment
object and can be specified by changing the exprs_values
parameter.
corrected <- scater::runPCA(x = corrected, scale = TRUE, exprs_values = "Canek")
Let's check out the elbow plot after correction.
pca_corrected <- SingleCellExperiment::reducedDim(x = corrected, type = "PCA") variance <- apply(X = pca_corrected, MARGIN = 2, FUN = var) plot(x = 1:30, y = variance[1:30], xlab = "Principal components", ylab = "Variance")
After correction, the elbow plot has changed and 5 PCs will be enough to calculate the UMAP visualization.
set.seed(777) corrected <- scater::runUMAP(x = corrected, dimred = "PCA", pca = 5)
We can now visualize the corrected data by batch and cell-type labels. We can observe that after batch-correction with Canek, we minimize the batch difference in cells from Cell Type 1 while preserving the other cell types.
p1 <- scater::plotUMAP(object = corrected, colour_by="batch") p2 <- scater::plotUMAP(object = corrected, colour_by="celltype") p1 + p2
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.