In bioinfoDZ/RISC: Robust Integration of Single-Cell RNA-Seq Datasets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Single-cell RNA sequencing (scRNA-seq) has become an essential genomic technology for resolving gene expression heterogeneity in single cells and has been widely used in many biological domains. It remains challenging to integrate scRNA-seq datasets with inter-sample heterogeneity, for example, different cell subpopulation compositions among datasets or gene expression difference in the same cell groups across datasets.

We find that the distortion in the integration of heterogeneous data is due to the lack of a consistent global reference space for projecting all cells from individual datasets. To overcome this issue, we develop a novel approach, named reference principal component integration (RPCI) and implemente it in a new scRNA-seq analysis package called “RISC”, for robust integration of scRNA-seq data.

library(RISC)
library(RColorBrewer)

data("raw.mat")
mat0 = raw.mat[[1]]
coldata0 = raw.mat[[2]]

coldata1 = coldata0[coldata0$Batch0 == 'Batch1',]
coldata2 = coldata0[coldata0$Batch0 == 'Batch2',]
coldata3 = coldata0[coldata0$Batch0 == 'Batch3',]
coldata4 = coldata0[coldata0$Batch0 == 'Batch4',]
coldata5 = coldata0[coldata0$Batch0 == 'Batch5',]
coldata6 = coldata0[coldata0$Batch0 == 'Batch6',]
mat1 = mat0[,rownames(coldata1)]
mat2 = mat0[,rownames(coldata2)]
mat3 = mat0[,rownames(coldata3)]
mat4 = mat0[,rownames(coldata4)]
mat5 = mat0[,rownames(coldata5)]
mat6 = mat0[,rownames(coldata6)]

Heterogeneous Simulated data

To show the advantages of RPCI and evaluate its performances quantitatively, we start with simulated data and control the degrees of gene expression difference in two of the three cell groups between datasets: more DE genes in c/c' than that in b/b', and no DE genes in group a. As expected, the cell groups with increasing DE genes displayed a gradual reduction in cell similarity.

Create RISC objects

We generate RISC objects from the gene-cell matrix (mat), the data frame of the cells (coldata), and the data frame of the genes, using "readsc" function. The RISC objects can also be generated by "read10X_mtx" or "read10X_h5" function for 10X Genomics data.

sce1 = readsc(count = mat1, cell = coldata1, gene = data.frame(Symbol = rownames(mat1), row.names = rownames(mat1)), is.filter = FALSE)
sce2 = readsc(count = mat2, cell = coldata2, gene = data.frame(Symbol = rownames(mat2), row.names = rownames(mat2)), is.filter = FALSE)
sce3 = readsc(count = mat3, cell = coldata3, gene = data.frame(Symbol = rownames(mat3), row.names = rownames(mat3)), is.filter = FALSE)
sce4 = readsc(count = mat4, cell = coldata4, gene = data.frame(Symbol = rownames(mat4), row.names = rownames(mat4)), is.filter = FALSE)
sce5 = readsc(count = mat5, cell = coldata5, gene = data.frame(Symbol = rownames(mat5), row.names = rownames(mat5)), is.filter = FALSE)
sce6 = readsc(count = mat6, cell = coldata6, gene = data.frame(Symbol = rownames(mat6), row.names = rownames(mat6)), is.filter = FALSE)

Processing RISC data

After create RISC objects, we next process the RISC data, here we show the standard processes: (1) filter the cells, remove cells with extremely low or high UMIs and discard cells with extremely low number of expressed genes. Here we use simulated data, so we do not filter out any cell. (2) normalize gene expression, removing the effect of RNA sequencing depth. (3) scale gene expression, the scaled counts merely contain gene signal information for individual cells, and yield column-wise zero empirical mean for each column, thus satisfying the requirement for PCA and SVD. (4) find highly variable genes, identify highly variable genes by Quasi-Poisson model and utilize them for gene-cell matrix decomposition and data integration.

process0 <- function(obj0){

  # filter cells
  obj0 = scFilter(obj0, min.UMI = 0, max.UMI = Inf, min.gene = 10, min.cell = 3, is.filter = FALSE)

  # normalize data
  obj0 = scNormalize(obj0, method = 'robust')

  # find highly variable genes
  obj0 = scDisperse(obj0)

  # here replace highly variable genes by all the genes for integraton
  obj0@vargene = rownames(sce1@rowdata)

  return(obj0)

}

sce1 = process0(sce1)
sce2 = process0(sce2)
sce3 = process0(sce3)
sce4 = process0(sce4)
sce5 = process0(sce5)
sce6 = process0(sce6)

RPCI integration

The core principle of RPCI is very different from existing methods, RPCI introduces an effective formula to calibrate cell similarity by a global reference, and directly projects all cells into a reference RPCI space.

set.seed(1)
var.genes = rownames(sce1@assay$count)
pcr0 = list(sce1, sce2, sce3, sce4, sce5, sce6)
pcr0 = scMultiIntegrate(pcr0, eigens = 9, var.gene = var.genes, align = 'OLS', npc = 15)
# pcr0 = scLargeIntegrate(pcr0, var.gene = var.genes, align = 'Predict', npc = 8)
pcr0 =scUMAP(pcr0, npc = 9, use = 'PLS', dist = 0.001, neighbors = 15)

pcr0@coldata$Group = factor(pcr0@coldata$Group0, levels = c('Group1', 'Group2', 'Group2*', 'Group3', 'Group3*'), labels = c("a", "b", "b'", "c", "c'"))
pcr0@coldata$Set0 = factor(pcr0@coldata$Set, levels = c('Set1', 'Set2', 'Set3', 'Set4', 'Set5', 'Set6'), labels = c('Set1 rep.1', 'Set2 rep.1', 'Set3 rep.1', 'Set1 rep.2', 'Set2 rep.2', 'Set3 rep.2'))
pcr0 = scCluster(pcr0, slot = "cell.umap", k = 4, method = "density", dc = 0.3)

UMAP plot

The dissimilarity in c/c' is larger than that in b/b' based on our original design, and this cell-cell relationship can be directly reflected in the UMAP plots of the RPCI-integrated data. And the difference in c/c' can be re-clustered from the integrated data.

Here the simulated data contain three sets, each set includes three cell groups and two replicates, with the batches existing among sets and duplicates.

DimPlot(pcr0, colFactor = 'Set0', size = 2)
DimPlot(pcr0, colFactor = 'Group', size = 2, Colors = brewer.pal(5, "Set1"))
DimPlot(pcr0, colFactor = 'Cluster', size = 2, Colors = brewer.pal(6, "Dark2"))