knitr::opts_chunk$set(echo = TRUE)
Integrated analysis of single cell RNA-sequencing (scRNA-seq) data from multiple batches or studies is often necessary in order to learn functional changes in cellular states upon experimental perturbations or cell type relationships in a developmental lineage. Here we introduce a new algorithm (RPCI) that uses the gene-eigenvectors from a reference dataset to establish a global frame for projecting all datasets, with a clear advantage in preserving genuine gene expression differences in matching cell types between samples, such as those present in cells at distinct developmental stages or in perturbated vs control studies. This R package “RISC” (Robust Integration of Sinlgle Cell RNA-seq data implements the RPCI algorithm, with additional functions for scRNA-seq analysis, such as clustering cells, identifying cluster marker genes, detecting differentially expressed genes between experimental conditions, and importantly outputting integrated gene expression values for downstream data analysis.
This vignette shows the key steps in analyzing example scRNA-seq datasets from the basal or squamous carcinoma patients before and after anti-PD-1 therapy (GSE123813). The data is consisted of two parts, one was derived from T cells of basal cell carcinoma (BCC) patients by single-cell RNA and T cell receptor (TCR) sequencing, while the other from site-matched tumor cells. The marker CD45+ and CD3+ were used to label tumor-infiltrating T cells, CD45+ and CD3- for tumor-infiltrating lymphocytes, and tumor cells identified by CD45- and CD3-. In the original study, 16 scRNA-seq datasets were collected from eight patients. For each patient, the study contained scRNA-seq data before and after anti-PD-1 therapy. However, the datasets from some patients only had few cells or low number of expressed genes; therefore, we filtered them out and chose eight datasets from four patients for this test, including pre- and post- anti-PD-1 therapy.
The public data can be obtained from the GEO (GSE123813), this vignette needs three files (you could download and unzip the files):
(a) the matrix file ("bcc_scRNA_counts.txt")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123813/suppl/GSE123813%5Fbcc%5FscRNA%5Fcounts%2Etxt%2Egz
(b) meta_cell file ("bcc_all_metadata.txt")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123813/suppl/GSE123813%5Fbcc%5Fall%5Fmetadata%2Etxt%2Egz
(c) annotation file (“bcc_annotation.tsv”), enclosed in our manuscript revision and the RISC release
The three files will be required in this test and need to be put in the path below:
## Create a directory named "GSE123813" and ## a sub-directory named "Raw_Data" in the directory "GSE123813". # Point the path to "GSE123813" print("PATH = '/Your path to directory/GSE123813'") ## The vignette needs the packages below. # install.packages(c("Matrix", "ggplot2", "RColorBrewer")) ## The packages are required to install RISC # install.packages(c( # "Matrix", "matrixStats", "irlba", "doParallel", "foreach", # "data.table", "Rtsne", "umap", "MASS", "pbmcapply", "Rcpp", # "RcppEigen", "densityClust", "FNN", "igraph", "RColorBrewer", # "ggplot2", "gridExtra", "pheatmap" # )) ## Install RISC package locally, ## RISC is developed in R (v3.6.3) # install.packages("/your path to/RISC_1.0.tar.gz", repos = NULL, type = "source")
This vignette contains two parts: (1) the codes in R environment, (2) the codes in python environment
## In R environment
### Prepare individual RISC objects
We first read the raw counts and filter out the cells we do not use in this test. We also check the number of expressed genes and make sure each gene expressed at least in one cell. Then, we create RISC objects, each object contains gene expression values in gene-cell matrix, the information of cells and of expressed genes. We show an example of dat1 below
wzxhzdk:2
#### Data quality
We check the data quality before pre-processing
wzxhzdk:3
#### Proportion of UMIs per gene
We further examine the UMI/gene distributions in the datasets and check UMI distribution of individual genes in each dataset. In most case, the proportion < 0.1 (i.e. the UMIs of any gene is less than 10% of the total UMIs). If any gene with extremely high UMIs, we will consider to remove this gene from the gene-cell matrix. It is most likely to be artificial and will affect the biological complexity of datasets.
wzxhzdk:4
### Processing RISC objects
After creating RISC objects, we next process the RISC data, here we show the standard processes:
(1) Filter the cells. we remove cells with extremely low or high UMIs and discard cells with extremely low number of expressed genes.
(2) Normalize gene expression. Here we normalize the raw counts to remove the effect of RNA sequencing depth, using cosine normalization method here, and we also have other methods to normalize the counts, see help file.
(3) Find highly variable genes. We identify highly variable genes by Quasi-Poisson model, and utilize them to decompose gene-cell matrix and to generate gene-eigenvectors for data integration.
wzxhzdk:5
#### Check data quality after processing
wzxhzdk:6
#### Check gene expression variance
When we search the highly variable genes, the meta-data of gene expression variance data is already calculated and kept in RISC object
wzxhzdk:7
### RPCI integration
The core step. RPCI introduces an effective formula to calibrate cell similarity by a global reference gene-eigenvectors, and directly projects all cells into a reference RPCI space. We also have a function "InPlot" to help users to choose a good reference dataset and select the sufficient number of eigenvectors for data integration.
#### Select the reference
We utilize the "InPlot" function to find the optimal reference. The set in the first place will be the reference during data integration.
wzxhzdk:8
The outputs of "InPlot" function show several information: (1) how many eigenvectors in each dataset can explain 95% gene expression variance, indicating the number of eigenvectors we want to use in data integration, (2) which datasets contain most cell populations, cell clustering based on "louvain" method, (3) in which dataset the gene expression variance would be explained by more eigenvectors, (4) whether the eigenvectors in each dataset are distributed normally or not. Each plot contain the corresponding score, we choose the reference based on the scores, weighted by "Cluster Num." score > "Stv by PCs" score > "Kolmogorov-Smirnov" score, but if "Kolmogorov-Smirnov" value of one set (the red values in the violin plot) are extremely high, we do not use the set as the reference, since the eigenvector distribution in this set is abnormal.
#### Integration
We change the order of datasets and make sure the reference in the first position.
wzxhzdk:9
### Annotation and cell clustering
We can directly utilized the integrated cell-eigenvectors to cluster the cells. Here we use the cell-type annotation to label the cells for convenience.
#### UMAP
The calculation of UMAP is based on the integrated cell-eigenvectors.
wzxhzdk:10
The integrated cell-eigenvectors.
wzxhzdk:11
#### Annotation
wzxhzdk:12
#### Patient and treatment
wzxhzdk:13
#### Cell populations
wzxhzdk:14
#### Cell clustering by igraph
The integrated cell-eigenvectors can be extracted and be used to run cell clustering, using other packages.
wzxhzdk:15
#### Cell clustering
wzxhzdk:16
### Identify marker genes
As the gene expression values are already corrected during data integration, so we could use them directly to find the marker genes or detect DE genes.
wzxhzdk:17
#### Expression patterns
wzxhzdk:18
### Detect DE genes
The differentially expressed (DE) genes are detected from the integrated data directly, using "Negative Binomial" model or "Quasi-Poisson" model.
#### Output corrected gene expression matrix
The corrected gene expression values can also be used by other packages
wzxhzdk:19
#### Find DE genes in macrophage before and after therapy
wzxhzdk:20
#### Heatmap of DE genes
wzxhzdk:21
### Output for PAGA
We output the integrated data (cell eigenvectors and gene expression values) for scanpy.
wzxhzdk:22
The part in python environment is shown in next page.
## In Python environment ### Trajectory of macrophage We construct a trajectory of macrophage according to integrated cell-eigenvectors and corrected gene expression values. wzxhzdk:23 wzxhzdk:24
### Session Information
wzxhzdk:25
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.