Compiled date: r Sys.Date()
Last edited: 2020-03-26
License: r packageDescription("matchSCore2")[["License"]]
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = FALSE, warning = FALSE, eval = TRUE, message = FALSE, fig.width = 10, fig.height = 8, fig.align = "center" ) options(width = 100)
This vignette describes how to use the r BiocStyle::Biocpkg("matchSCore2")
package for the comparison of single cell RNA-seq data across experiments.
The package allows a gene marker-based projection of single cells onto a reference sample and, thus, the identification of cell types in unknown cells.
A more detailed version of the method is at the bioRxiv paper: http://dx.doi.org/10.1101/630087 [@Mereu2019].
By using a reference dataset in which the cellular type of individual cells is defined and gene markers are computed, r BiocStyle::Biocpkg("matchSCore2")
trains a multinomial logistic model to classify unknown cells in new datasets from similar tissues.
The method consists of the following steps:
knitr::include_graphics("overview_package.png")
To install this package, start R and enter:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("matchSCore2") # the development version is available also on GitHub BiocManager::install("elimereu/matchSCore2")
Once installed, the package can be loaded and attached to your current workspace as follows:
library("matchSCore2")
In this vignette, we showcase the functionality of r BiocStyle::Biocpkg("matchSCore2")
using the PBMC+HEK293T data from our benchmarking of 13 protocols [@Mereu2019].
The list of all seurat objects, including UMAPs and tSNEs are provided here: https://www.dropbox.com/sh/wgu6kcioqkkva4g/AABHJaFsiqYZeYc-xv6qtg3ba?dl=0.
From the same link, you can also download the SingleCellExperiment
object from which you can extract the counts from each protocol by following the instructions below.
library(scater) url <- "https://www.dropbox.com/s/lrfoux7uxundcy1/sce.all_classified.technologies.RData?raw=true" download.file(url = url, destfile = "~/Downloads/sce.all_classified.technologies.RData", mode = "wb") load(file = "~/Downloads/sce.all_classified.technologies.RData") ## load the data
After you load it into R, if you check the colData(object)
there are three metadata, which are: nnet2
, ident
, and batch
.
nnet2
is the annotation from the MatchSCore2()
classifier. ident
is the Seurat clustering result. The clusters are manually annotated by looking at the expression of known gene markers.batch
is the sequencing protocol. colData(sce) ### give access to the metadata DataFrame table(colData(sce)$nnet2) ## Number of cells from each classified cell type (by matchSCore2) table(colData(sce)$ident) ## Number of cells from each Seurat cluster table(colData(sce)$batch) ## Number of cells from each protocol
You can extract counts, PCA and UMAP of the integrated dataset, by the following commands:
counts <- sce@assays$data$counts logcounts <- sce@assays$data$logcounts umap <- sce@reducedDims$UMAP pca <- sce@reducedDims$PCA ## if you are interested in a specific protocol (e.g. Chromium) chromium <- filter(sce, batch == "Chromium") logcounts <- chromium@assays$data@listData$logcounts
In case you are interested in working with the seurat objects, all datasets are in the data.list
list, whose elements represent the protocols.
Let us suppose Chromium is the reference and Smart-seq2 the test data we want to annotate.
The cell type specific markers from each protocol are also provided.
library(Seurat) url <- "https://www.dropbox.com/s/2ketqbof6tvnv07/data.list_seurat.obj_all_datasets.RData?raw=true" download.file( url = url, destfile = "~/Downloads/data.list_seurat.obj_all_datasets.RData", mode = "wb" ) load(file = "~/Downloads/data.list_seurat.obj_all_datasets.RData") ref <- data.list$Chromium nnet2 <- ref$nnet2 smartseq2 <- data.list$`Smart-Seq2`
To train the multinomial logistic model implemented in r BiocStyle::Biocpkg("matchSCore2")
, we use the function train_model
, which requires the following inputs:
scale.data
: A matrix of log-normalized or scaled gene expression values from the reference dataset (in the manuscript we used the scale.data slot in the seurat objects.
Make sure that the markers are all included in the scale.data).clus
: A named factor with reference identities (like in the "ident" slot in the Seurat object).gene_cl.ref
: A named list of markers.
Each element of the list contains cell type specific gene markers (Usually top100 ranked markers of each cell type).
If you have the output of FindAllMarkers
from Seurat, you could use the cut_markers function to get gene_cl.ref
by the function cut_markers
.library("matchSCore2") library("nnet") url <- "https://www.dropbox.com/s/wijdjhbm17fd7fp/markers_all_datasets.RData?raw=true" download.file(url = url, destfile = "~/Downloads/markers_all_datasets.RData", mode = "wb") load(file = "~/Downloads/markers_all_datasets.RData") ref.markers <- markers$Chromium gene_cl.ref <- cut_markers(levels(ref.markers$cluster), markers = ref.markers, ntop = 100 ) scaled <- ref@assays$RNA@scale.data
mod <- train_model( scale.data = scaled, clus = nnet2, gene_cl.ref = gene_cl.ref, prop = 0.5 )
Once the model is done, it can be used to predict the identity of unknown cells from another dataset.
Here, we will test the model in MARS-seq data.
With the function identity_map
, we assign the cell identity in the test data based on the highest probability predicted by the model.
scaled <- ScaleData(smartseq2, features = unlist(gene_cl.ref)) scaled <- scaled@assays$RNA@scale.data ## Cell projection out <- identity_map( scale.data = scaled, model = mod, gene_cl.ref = gene_cl.ref ) identity_heatmap(out)
knitr::include_graphics("identity_heatmaps.png")
From the heatmap, you can observe the probability of cells to be assigned to each cell identity. Cells assigned to a specific identity have the highest probability (magenta color) for that class and probability close to zero (white-blue colors) in other classes.
matchSCore2
and clustering annotationsr BiocStyle::Biocpkg("matchSCore2")
can be used to improve the clustering annotation, detecting subtle differences between cell types that are difficult to be detected by clustering (e.g. NK and CD8 T cells, CD4 and CD8 T cells, etc..).
smseq.clus <- smartseq2$ident col <- c("aquamarine", "orange", "green4", "blueviolet", "black", "maroon", "coral2", "deepskyblue3", "lightgray") summary_barplot(class.fac = out$ids, obs.fac = smseq.clus) + scale_fill_manual(values = col)
knitr::include_graphics("summary_barplots.png")
Here, you can see the distribution of the assigned cell identities per cluster. This is helpful to visualize the agreement between clustering and cell identity projection. Also, it highlights differences in cluster resolution (e.g. the cluster of monocytes contains both types CD14+ and FCGR3A+).
We can also measure the grade of matching across clusters by looking at the level of similarity of their cluster-specific gene markers, measured by the Jaccard Index. For example, you could use the top 100 ranked markers per cluster from the Chromium and MARS-Seq dataset.
## And the MARS-Seq as test test.markers <- markers$`Smart-Seq2` gene_cl.test <- cut_markers(levels(test.markers$cluster), markers = test.markers, ntop = 100) ## The matchSCore2 function computes the clustering comparison and produce the heatmap table with Jaccard Indexes for each group combination matchSCore2(gene_cl.ref = gene_cl.ref, gene_cl.obs = gene_cl.test, ylab = "Chromium", xlab = "Smart-seq2")
knitr::include_graphics("Cluster-sp_markers_comparison.png")
In the heatmap, we can observe the Jaccard similarity between cluster-specific markers from the two experiments. Notably, you can see how the jaccard indexes between markers reflect the cellular composition of clusters, in agreement with the previous plot.
While with the previous approach we can fastly transfer annotations from one dataset to another, sometimes we are interested in comparing gene measurements between cells coming from different experiments (e.g. WT vs KO, disease vs control, different protocols, etc..).
In the r BiocStyle::Biocpkg("matchSCore2")
package, we provide a mathematical framework to align datasets, by projecting them to the same coordinate space of an assigned reference dataset.
The method is based on the single value decomposition and allows a direct comparison of datasets, by finding the optimal linear transformation between data points.
By using the function align_run
two or more gene expression matrixes are compared and aligned to the same reference dataset.
In this example, we want to align 3 of the PBMC+HEK293T experiments sequenced with different protocols: Chromium, Smart-Seq2 and ddSEQ. In this case, the choice of the reference is not so important, we will choose the Chromium dataset. The method requires the clusters (more than 1 cell), they will be used to compute a centroid point at each dataset, useful for their comparison.
url <- "https://www.dropbox.com/s/lrfoux7uxundcy1/sce.all_classified.technologies.RData?dl=0" download.file(url = url, destfile = "~/Downloads/sce.all_classified.technologies.RData", mode = "wb") load(file = "~/Downloads/sce.all_classified.technologies.RData") ## load the data chromium <- scater::filter(sce, batch == "Chromium") #chromium <- sce[,sce$batch=="Chromium"] smartseq2 <- scater::filter(sce, batch == "Smart-Seq2") #smartseq2 <- sce[,sce$batch=="Smart-Seq2"] ddseq <- scater::filter(sce, batch == "ddSEQ") # chromium$cluster <- factor(chromium$nnet2) smartseq2$cluster <- factor(smartseq2$nnet2) ddseq$cluster <- factor(ddseq$nnet2) list_data <- list(Chromium = chromium, Smartseq2 = smartseq2, ddseq = ddseq) markers.dds <- markers$ddSEQ gene_cl.dds <- cut_markers(levels(markers.dds$cluster), markers = markers.dds, ntop = 100) markers_list <- unique(c(unlist(gene_cl.ref), unlist(gene_cl.test), unlist(gene_cl.dds))) out_align <- align_run(dataset_list = list_data, marker_list = markers_list, ref = "Chromium")
The output contains a SingleCellExperiment
object with the joint raw counts and the integrated matrix from the 3 datasets.
See the manual of the align_run
function for more details about the output.
Once data are integrated, you can use the normalized matrix to cluster your cells.
For example, if you want to use Seurat, you can provide that matrix in the @data
slot, as in the seurat3_run
function.
col2 <- c("maroon", "orange", "cyan") data <- seurat3_run( out_align = out_align, dims = 1:10, res = 0.2, col_anno = col, col_data = col2 ) DimPlot(data, group.by = "cluster", cols = col) + theme(legend.text = element_text(size = 14)) DimPlot(data, group.by = "dataset", cols = col2) + theme(legend.text = element_text(size = 14))
knitr::include_graphics("UMAP_celltype_after_integration.png")
knitr::include_graphics("UMAP_dataset_after_integration.png")
We can explore some of the markers we used for the integration with the Seurat FeaturePlot
function:
FeaturePlot(data, features = c( "CD79B", "IL7R", "CD3D", "NKG7", "GNLY", "CD14", "FCGR3A", "CTTN" ))
knitr::include_graphics("UMAP_markers_after_integration.png")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.