This vignette demonstrates how to run CIPR on Seurat objects. If you use CIPR, please cite:

CIPR: a web-based R/shiny app and R package to annotate cell clusters in single cell RNA sequencing experiments

H. Atakan Ekiz, Christopher J. Conley, W. Zac Stephens & Ryan M. O'Connell

BMC Bioinformatics, 2020.

doi: 10.1186/s12859-020-3538-2

Github: https://github.com/atakanekiz/CIPR-Package

knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 95),
  message = FALSE,
  warning = FALSE
)
remotes::install_github("atakanekiz/CIPR-Package")

Summary

This vignette describes how to use CIPR package with 3k PBMC data freely available from 10X genomics. Here, we recycle the code described in Seurat's guided clustering tutorial to help users perform analyses from scratch. Using this dataset we will demonstrate the capabilities of CIPR to annotate single cell clusters in single cell RNAseq (scRNAseq) experiments. For further information about other clustering methods, please see Seurat's comprehensive website

Install CIPR

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

# Use this option if you want to build vignettes during installation
# This can take a long time due to the installation of suggested packages.
remotes::install_github("atakanekiz/CIPR-Package", build_vignettes = TRUE)

# Use this if you would like to install the package without vignettes
# remotes::install_github("atakanekiz/CIPR-Package")

Seurat pipeline

Setup Seurat object

library(dplyr)
library(Seurat)
library(SeuratData)
library(CIPR)
# Load data
InstallData("pbmc3k")
pbmc <- pbmc3k

Pre-processing

The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.

# Calculate mitochondrial gene representation (indicative of low quality cells)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Normalizing data

pbmc <- NormalizeData(pbmc)

Variable gene detection and scaling

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Perform PCA

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
ElbowPlot(pbmc)

Cluster cells

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

Run non-linear dimensionality reduction (tSNE)

pbmc <- RunTSNE(pbmc, dims = 1:10)
pbmc$unnamed_clusters <- Idents(pbmc)
# saveRDS(pbmc, "pbmc.rds")

Find differentially expressed genes

This is the step where we generate the input for CIPR's log fold change (logFC) comparison methods.

allmarkers <- FindAllMarkers(pbmc)

Calculate average gene expression per cluster

This is the step where we generate the input for CIPR's all-genes correlation methods.

avgexp <- AverageExpression(pbmc)
avgexp <- as.data.frame(x = avgexp$RNA)
avgexp$gene <- rownames(avgexp)

Visualize Seurat pbject

DimPlot(pbmc)

CIPR analysis

The user can select one of the 7 provided reference data sets:

| Reference | reference argument | |-------------------------------------------|----------------------| | Immunological Genome Project (ImmGen) | "immgen" | | Presorted cell RNAseq (various tissues) | "mmrnaseq" | | Blueprint/ENCODE | "blueprint" | | Human Primary Cell Atlas | "hpca" | | Database of Immune Cell Expression (DICE) | "dice" | | Hematopoietic differentiation | "hema" | | Presorted cell RNAseq (PBMC) | "hsrnaseq" | | User-provided custom reference | "custom" |

Standard logFC comparison method

In this method CIPR accepts allmarkers data frame created above and performs the following analytical steps:

Plot all identity scores per cluster-reference cell pairs

The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots

CIPR results can be summarized for each cluster in scatter plots.

CIPR(input_dat = allmarkers,
     comp_method = "logfc_dot_product", 
     reference = "hsrnaseq", 
     plot_ind = T,
     plot_top = F, 
     global_results_obj = T, 
     global_plot_obj = T,
     # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters
     )

Plot identity scores for a select cluster

ind_clu_plots object is created in the global environment to help users can visualize results for a desired cluster and manipulate graphing parameters. ggplot2 functions can be iteratively added to individual plots to create annotations etc.

library(ggplot2)
ind_clu_plots$cluster6 +
    theme(axis.text.y = element_text(color="red"),
          axis.text.x = element_text(color="blue")) +
    labs(fill="Reference")+
    ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") +
    annotate("text", label="2 sd range", x=10, y= 700, size=8, color = "steelblue")+
    annotate("text", label= "1 sd range", x=10, y=200, size=8, color ="orange2")+
  geom_rect(aes(xmin=94, xmax=99, ymin=1000, ymax=1300), fill=NA, size=3, color="red")

Plot top scoring refernce subsets for each cluster

CIPR(input_dat = allmarkers,
     comp_method = "logfc_dot_product", 
     reference = "hsrnaseq", 
     plot_ind = F,
     plot_top = T, 
     global_results_obj = T, 
     global_plot_obj = T)

Tabulate CIPR results

CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (CIPR_top_results and CIPR_all_results respectively) to allow users to explore the outputs and generate specific plots and tables.

head(CIPR_top_results)
head(CIPR_all_results)

Standard all-genes correlation method

CIPR also implements a simple correlation approach in which overall correlation in gene expression is calculated for the pairs of unknown clusters and the reference samples (regardless of the differential expression status of the gene). This approach is conceptually similar to some other automated identity prediction pipelines such as SingleR and scMCA.

Plot all identity scores per cluster-reference cell pairs

The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots

CIPR results can be summarized for each cluster in scatter plots.

CIPR(input_dat = avgexp,
     comp_method = "all_genes_spearman", 
     reference = "hsrnaseq", 
     plot_ind = T,
     plot_top = F, 
     global_results_obj = T, 
     global_plot_obj = T)

Plot top scoring refernce subsets for each cluster

CIPR(input_dat = avgexp,
     comp_method = "all_genes_spearman", 
     reference = "hsrnaseq", 
     plot_ind = F,
     plot_top = T, 
     global_results_obj = T, 
     global_plot_obj = T)

Tabulate CIPR results

CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (CIPR_top_results and CIPR_all_results respectively) to allow users to explore the outputs and generate specific plots and tables.

head(CIPR_top_results)
head(CIPR_all_results)

Limiting analysis to the select subsets of reference data

Sometimes excluding irrelevant reference cell types from the analysis can be helpful. Especially when the logFC comparison methods are utilized, removing irrelevant subsets may improve discrimination of closely related subsets, since the reference logFC values will be calculated after subsetting the data frame. Filtering out reference subsets should not impact results of the all-genes correlation methods, but it can make the graphical outputs easier to look at

3k PBMC dataset may not be the best example to demonstrate benefits of reference dataset subsetting, but the code below serves as an example for this functionality.

CIPR(input_dat = allmarkers,
     comp_method = "logfc_dot_product", 
     reference = "hsrnaseq", 
     plot_ind = T,
     plot_top = F, 
     global_results_obj = T, 
     global_plot_obj = T,
     select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell"))

Filtering out lowly variable genes

Genes that have a low expression variance across the reference data frame has weaker discriminatory potential. Thus, excluding these genes from the analysis can reduce the noise and improve the prediction scores, especially when using all-genes correlation based methods.

We implemented a variance filtering parameter, keep_top_var, which allows users to keep top Nth% variable reference genes in the analysis. For instance, by setting this argument to 10, CIPR can be instructed to use only the top 10% highly variable genes in identity score calculations. In our experience (Ekiz HA, BMC Bioinformatics, in revision) limiting the analysis to highly variable genes does not significantly impact the identity scores of the top-scoring reference cell subsets, but it reduces the identity scores of intermediate/low-scoring reference cells leading to an improvement of z-scores. The "best" value for this parameter remains to be determined by the user in individual studies.

CIPR(input_dat = avgexp,
     comp_method = "all_genes_spearman", 
     reference = "hsrnaseq", 
     plot_ind = T,
     plot_top = F, 
     global_results_obj = T, 
     global_plot_obj = T,
     keep_top_var = 10)

Session Info

sessionInfo()



satijalab/seurat-wrappers documentation built on April 10, 2024, 3:25 p.m.