Getting Start with Assisted CIDER (asCIDER)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette performs asCIDER, a meta-clustering method, on a cross-species pancreas dataset. AsCIDER is aimed to achieve the clustering task in data confounded by unwanted variables. In this scenario, the unwanted variable is species effects.

asCIDER is short for assisted CIDER, assisted by the prior batch-specific annotations or clustering results.

Set up

In addition to CIDER, we will load the following packages:

library(CIDER)
library(Seurat)
library(pheatmap)
library(ggplot2)
library(cowplot)
library(viridis)

Pancreas data

The example data can be downloaded from https://figshare.com/s/d5474749ca8c711cc205.

This data$^1$ contain single-cell RNA-seq pancreatic data from human (8241 cells) and mouse (1886 cells) with 12474 rows (genes).

# Download the data
data_url <- "https://figshare.com/ndownloader/files/31469387"
data_file <- file.path(tempdir(), "pancreas_counts.RData")

if (!file.exists(data_file)) {
  message("Downloading count data to temporary directory...")
  download.file(data_url, destfile = data_file, mode = "wb")
}
# Load counts matrix and metadata
data_env <- new.env()
loaded_objects <- load(file = data_file, envir = data_env)
pancreas_counts <- data_env[[loaded_objects[1]]]

load("../data/pancreas_meta.RData") # meta data/cell information
seu <- CreateSeuratObject(counts = pancreas_counts, meta.data = pancreas_meta)
table(seu$Batch)

Exam if data are confounded

Prior to use CIDER (or other integration methods for clustering) it is important to exam if clustering will be confounded by the cross-species factors.

Her we first perform the conventional Seurat$^2$ clustering pipeline.

seu <- NormalizeData(seu, verbose = FALSE)
seu <- FindVariableFeatures(seu, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
seu <- ScaleData(seu, verbose = FALSE)
seu <- RunPCA(seu, npcs = 20, verbose = FALSE)
seu <- RunTSNE(seu, reduction = "pca", dims = 1:12)

Confounded dimension reduction

The PCA and t-SNE plots both showed that the data are confounded by species.

The scatterPlot function is used to generate dimension reduction figures. It takes input of a Seurat object (seu here), the name of reduction (pca and tsne here), the variable deciding dot colours (Batch here) and the title of plots. See more information by ?scatterPlot

p1 <- scatterPlot(seu, "pca",colour.by = "Batch", title = "PCA") 
p2 <- scatterPlot(seu, "tsne",colour.by = "Batch", title = "t-SNE") 
plot_grid(p1, p2)

Confounded clustering results

seu <- FindNeighbors(seu, dims = 1:12)
seu <- FindClusters(seu)

scatterPlot(seu, "tsne",colour.by = "seurat_clusters", title = "t-SNE") 
res <- data.frame(table(seu$seurat_clusters, seu$Batch))
ggplot(res, aes(fill=Var2, y=Freq, x=Var1)) + 
    geom_bar(position="stack", stat="identity") + xlab("CIDER_cluster") + ylab("Proportions")

asCIDER

Prepare initial clusters

asCIDER uses existing within-batch clustering results. Here we concatenate the batch ID and the within-batch cluster ID to obtain cluster-specific groups (i.e. initial clusters).

seu$initial_cluster <- paste(seu$Group, seu$Batch, sep = "_")
table(seu$initial_cluster)

Calculate of IDER similarity matrix

The function getIDEr calculate the IDER-based distance matrix.

By default, it will use the column called "initial_cluster" as initial clusters, and "Batch" as batch. If you are using columns other than this two, please revise these two parameters.

For this step, you can choose to use parallel computation by setting use.parallel = TRUE or not (default use.parallel = FALSE). The default number of cores used for parallel computation is detectCores(logical = FALSE) - 1.

ider <- getIDEr(seu, 
                group.by.var = "initial_cluster",
                batch.by.var = "Batch",
                downsampling.size = 35, 
                use.parallel = FALSE, verbose = FALSE)

Visualise the similarity matrix

groups <- c("alpha","beta","delta", "gamma","ductal","endothelial", "activated_stellate", "quiescent_stellate", "macrophage")
idx1 <- paste0(groups, "_human")
idx2 <- paste0(groups, "_mouse")

pheatmap::pheatmap(
  ider[[1]][idx1, idx2],
  color = inferno(10),
  border_color = NA,
  display_numbers = TRUE,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  width = 7,
  height = 5,
  cellwidth = 22,
  cellheight = 22
)

Perform final Clustering

Next we put both Seurat object and the similarity matrix list into finalClustering. You can either cut trees by height (set cutree.by = 'h) or by k (set cutree.by = 'k). The default is cutting by the height of 0.45.

The final clustering results are stored in Colume cider_final_cluster of Seurat object metadata and can be extracted using seu$cider_final_cluster.

seu <- finalClustering(seu, ider, cutree.h = 0.45)
head(seu@meta.data)

Visualise clustering results

plot_list <- list()
plot_list[[1]] <- scatterPlot(seu, "tsne", colour.by = "CIDER_cluster", title = "asCIDER clustering results") 
plot_list[[2]] <- scatterPlot(seu, "tsne", colour.by = "Group", title = "Ground truth of cell populations") 
plot_grid(plotlist = plot_list, ncol = 2)
res <- data.frame(table(seu$CIDER_cluster, seu$Batch))
ggplot(res, aes(fill=Var2, y=Freq, x=Var1)) + 
    geom_bar(position="stack", stat="identity") + xlab("CIDER_cluster") + ylab("Proportions")

Reproducibility

sessionInfo()

References

  1. Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346–360.e4 (2016).
  2. Satija R, et al. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33, 495-502 (2015).
  3. The count matrix and sample information were downloaded from NCBI GEO accession GSE84133.


Try the CIDER package in your browser

Any scripts or data that you put into this service are public.

CIDER documentation built on April 4, 2025, 2:28 a.m.