In bioinfoDZ/RISC: Robust Integration of Single-Cell RNA-Seq Datasets

knitr::opts_chunk$set(echo = TRUE)

Introduction

Integrated analysis of single cell RNA-sequencing (scRNA-seq) data from multiple batches or studies is often necessary in order to learn functional changes in cellular states upon experimental perturbations or cell type relationships in a developmental lineage. Here we introduce a new algorithm (RPCI) that uses the gene-eigenvectors from a reference dataset to establish a global frame for projecting all datasets, with a clear advantage in preserving genuine gene expression differences in matching cell types between samples, such as those present in cells at distinct developmental stages or in perturbated vs control studies. This R package “RISC” (Robust Integration of Sinlgle Cell RNA-seq data implements the RPCI algorithm, with additional functions for scRNA-seq analysis, such as clustering cells, identifying cluster marker genes, detecting differentially expressed genes between experimental conditions, and importantly outputting integrated gene expression values for downstream data analysis.

This vignette shows the key steps in analyzing example scRNA-seq datasets from the basal or squamous carcinoma patients before and after anti-PD-1 therapy (GSE123813). The data is consisted of two parts, one was derived from T cells of basal cell carcinoma (BCC) patients by single-cell RNA and T cell receptor (TCR) sequencing, while the other from site-matched tumor cells. The marker CD45+ and CD3+ were used to label tumor-infiltrating T cells, CD45+ and CD3- for tumor-infiltrating lymphocytes, and tumor cells identified by CD45- and CD3-. In the original study, 16 scRNA-seq datasets were collected from eight patients. For each patient, the study contained scRNA-seq data before and after anti-PD-1 therapy. However, the datasets from some patients only had few cells or low number of expressed genes; therefore, we filtered them out and chose eight datasets from four patients for this test, including pre- and post- anti-PD-1 therapy.

The public data can be obtained from the GEO (GSE123813), this vignette needs three files (you could download and unzip the files):
(a) the matrix file ("bcc_scRNA_counts.txt")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123813/suppl/GSE123813%5Fbcc%5FscRNA%5Fcounts%2Etxt%2Egz
(b) meta_cell file ("bcc_all_metadata.txt")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123813/suppl/GSE123813%5Fbcc%5Fall%5Fmetadata%2Etxt%2Egz
(c) annotation file (“bcc_annotation.tsv”), enclosed in our manuscript revision and the RISC release

The three files will be required in this test and need to be put in the path below:

## Create a directory named "GSE123813" and 
## a sub-directory named "Raw_Data" in the directory "GSE123813".
# Point the path to "GSE123813"
print("PATH = '/Your path to directory/GSE123813'")

## The vignette needs the packages below.
# install.packages(c("Matrix", "ggplot2", "RColorBrewer"))

## The packages are required to install RISC
# install.packages(c(
#   "Matrix", "matrixStats", "irlba", "doParallel", "foreach", 
#   "data.table", "Rtsne", "umap", "MASS", "pbmcapply", "Rcpp", 
#   "RcppEigen", "densityClust", "FNN", "igraph", "RColorBrewer", 
#   "ggplot2", "gridExtra", "pheatmap"
# ))

## Install RISC package locally, 
## RISC is developed in R (v3.6.3)
# install.packages("/your path to/RISC_1.0.tar.gz", repos = NULL, type = "source")

Notice, RISC package is developed in R (v3.6.3), we test this vignette in the same R version.

This vignette contains two parts: (1) the codes in R environment, (2) the codes in python environment

## In R environment ### Prepare individual RISC objects We first read the raw counts and filter out the cells we do not use in this test. We also check the number of expressed genes and make sure each gene expressed at least in one cell. Then, we create RISC objects, each object contains gene expression values in gene-cell matrix, the information of cells and of expressed genes. We show an example of dat1 below wzxhzdk:2 #### Data quality We check the data quality before pre-processing wzxhzdk:3
#### Proportion of UMIs per gene We further examine the UMI/gene distributions in the datasets and check UMI distribution of individual genes in each dataset. In most case, the proportion < 0.1 (i.e. the UMIs of any gene is less than 10% of the total UMIs). If any gene with extremely high UMIs, we will consider to remove this gene from the gene-cell matrix. It is most likely to be artificial and will affect the biological complexity of datasets. wzxhzdk:4
### Processing RISC objects After creating RISC objects, we next process the RISC data, here we show the standard processes:
(1) Filter the cells. we remove cells with extremely low or high UMIs and discard cells with extremely low number of expressed genes.
(2) Normalize gene expression. Here we normalize the raw counts to remove the effect of RNA sequencing depth, using cosine normalization method here, and we also have other methods to normalize the counts, see help file.
(3) Find highly variable genes. We identify highly variable genes by Quasi-Poisson model, and utilize them to decompose gene-cell matrix and to generate gene-eigenvectors for data integration.
wzxhzdk:5 #### Check data quality after processing wzxhzdk:6
#### Check gene expression variance When we search the highly variable genes, the meta-data of gene expression variance data is already calculated and kept in RISC object wzxhzdk:7
### RPCI integration The core step. RPCI introduces an effective formula to calibrate cell similarity by a global reference gene-eigenvectors, and directly projects all cells into a reference RPCI space. We also have a function "InPlot" to help users to choose a good reference dataset and select the sufficient number of eigenvectors for data integration. #### Select the reference We utilize the "InPlot" function to find the optimal reference. The set in the first place will be the reference during data integration. wzxhzdk:8
The outputs of "InPlot" function show several information: (1) how many eigenvectors in each dataset can explain 95% gene expression variance, indicating the number of eigenvectors we want to use in data integration, (2) which datasets contain most cell populations, cell clustering based on "louvain" method, (3) in which dataset the gene expression variance would be explained by more eigenvectors, (4) whether the eigenvectors in each dataset are distributed normally or not. Each plot contain the corresponding score, we choose the reference based on the scores, weighted by "Cluster Num." score > "Stv by PCs" score > "Kolmogorov-Smirnov" score, but if "Kolmogorov-Smirnov" value of one set (the red values in the violin plot) are extremely high, we do not use the set as the reference, since the eigenvector distribution in this set is abnormal. #### Integration We change the order of datasets and make sure the reference in the first position. wzxhzdk:9 ### Annotation and cell clustering We can directly utilized the integrated cell-eigenvectors to cluster the cells. Here we use the cell-type annotation to label the cells for convenience. #### UMAP The calculation of UMAP is based on the integrated cell-eigenvectors. wzxhzdk:10 The integrated cell-eigenvectors. wzxhzdk:11 #### Annotation wzxhzdk:12 #### Patient and treatment wzxhzdk:13
#### Cell populations wzxhzdk:14
#### Cell clustering by igraph The integrated cell-eigenvectors can be extracted and be used to run cell clustering, using other packages. wzxhzdk:15 #### Cell clustering wzxhzdk:16 ### Identify marker genes As the gene expression values are already corrected during data integration, so we could use them directly to find the marker genes or detect DE genes. wzxhzdk:17 #### Expression patterns wzxhzdk:18
### Detect DE genes The differentially expressed (DE) genes are detected from the integrated data directly, using "Negative Binomial" model or "Quasi-Poisson" model. #### Output corrected gene expression matrix The corrected gene expression values can also be used by other packages wzxhzdk:19 #### Find DE genes in macrophage before and after therapy wzxhzdk:20 #### Heatmap of DE genes wzxhzdk:21
### Output for PAGA We output the integrated data (cell eigenvectors and gene expression values) for scanpy. wzxhzdk:22 The part in python environment is shown in next page.

## In Python environment ### Trajectory of macrophage We construct a trajectory of macrophage according to integrated cell-eigenvectors and corrected gene expression values. wzxhzdk:23 wzxhzdk:24

### Session Information wzxhzdk:25

bioinfoDZ/RISC documentation built on March 30, 2024, 9:19 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com