MatchSCore2
is an R package for the comparison of single cell RNA-seq data across tools and experiments. The package allows a gene marker-based projection of single cells onto a reference sample and, thus, the identification of cell types in unknown cells. A more detailed version of the method is at the bioRxiv paper: http://dx.doi.org/10.1101/630087. The code to reproduce the downstream analyses of the paper can be found at the folder Benchmarking_of_scRNA-seq_protocols in this repository. The snakemake workflow and additional code used for the manuscript is also at the following github: https://github.com/ati-lz/HCA_Benchmarking.
If you want to use our PBMC+HEK293T QCed data (UMI counts + MatchSCore2
annotations) from all 13 protocols, you can download the data object at the following dropbox link: https://www.dropbox.com/s/i8mwmyymchx8mn8/sce.all_classified.technologies.RData?dl=0.
It’s a unique SingleCellExperiment
object from which you can extract the counts from each protocol, by the following commands.
library(scater)
load(file="sce.all_classified.technologies.RData") ## load the data
After you load it into R, if you check the `colData(object)`` there are three metadata, which are: nnet2, ident, batch .
nnet2
is the annotation from the MatchSCore2
classifier. ident
is the Seurat clustering result. The clusters are manually annotated by looking are known gene markers.batch
is the protocol.
colData(sce) ### give access to the metadata DataFrame
table(colData(sce)$batch) ## Number of cells from each protocol
table(colData(sce)$nnet2) ## Number of cells from each classified cell type (by matchSCore2 classification)
table(colData(sce)$ident) ## Number of cells from each Seurat cluster
If you want to filter a specific technology, for example Quartz-Seq2:
quartzseq2 <- scater::filter(sce,batch=="Quartz-Seq2") ### new object with cells from Quartz-Seq2 protocol.
counts <- quartzseq2@assays$data@listData$counts ## Quartz-Seq2 UMI counts
logcounts <- quartzseq2@assays$data@listData$logcounts ## Quartz-Seq2 logcounts
meta.data <- as.data.frame(colData(quartzseq2)) ## Quartz-Seq2 meta.data dataframe
MatchSCore2
is a new version of the older matchSCore package. This is a development version running on R (>= 3.5.1). The package makes use of nnet, Matrix, splatter, MixSim and ggplot2 libraries. Other libraries are also used for clustering and identification of markers (e.g. Seurat).
MatchSCore2
can be installed by devtools:
```{r,eval=FALSE}
library("devtools") install_github("elimereu/matchSCore2")
## Functionalities of `MatchSCore2`
`MatchSCore2` has the following main functions:
1. Annotation of cells throught a supervised classification model based on a reference dataset. (**Cell annotations**)
2. Weighted GO pathways enrichment of cluster-specific markers based on the Kolmogorov-Smirnov test of genes p-values. (**GOannotation function**)
3. Measure and visualize the matching between two clustering. (**Cluster matching**)
4. Track the accuracy trend of a tool in clustering and marker identification compared with the optimal solution provided by a simulated data set. In this case, matchSCore2 works in combination with the Splatter package - https://github.com/Oshlack/splatter/blob/master/vignettes/splatter.Rmd - (**Benchmarking**)
## Usage
To predict cell identities `MatchSCore2` requires:
1. `scale.data`: A matrix of log-normalized and scaled gene expression values from the reference dataset (like the matrix scale.data in Seurat object).
2. `clus`: A named factor with reference identities (like in the @ident slot in Seurat object).
3. `gene_cl.ref`: A named list of markers. Each element of the list contains cell type specific gene markers (Usually top100 ranked markers of each cell type). An example of gene_cl.ref can be found at the data folder of this repository. If you have the output of FindAllMarkers from Seurat, you could use the cut_markers function to get gene_cl.ref by the command
cut_markers(levels(output.seu$cluster),output.seu,ntop=100)
.
![Scheme](inst/extdata/matchSCore2_Overview.png)
```{r,eval=FALSE}
library(matchSCore2)
library(nnet)
library(Matrix)
### Training of the model
mod <- train_model(scale.data = ref@scale.data,clus = ref@ident,gene_cl.ref = gene_cl.ref,prop = 0.75)
## Cell projection
out <- identity_map(scale.data = obs@scale.data,model = mod,gene_cl.ref)
### cell identities
ids <- out$ids
## To each cell probabilities are assigned for any possible identity class
probabilities <- out$fit.prob
MatchSCore2
can be used to improve the clustering annotation, detecting subtle differences between cell types.
We can also measure the grade of matching across clusters from two considered datasets. For example, you could use the top 100 ranked markers we got (by using Seurat) from the Chromium and MARS-Seq dataset. You can load files directly from the data folder in this repository.
```{r,eval=TRUE}
load(file="data/gene_cl.ref_Chromium.RData")
load(file="data/gene_cl_MARS-Seq.RData")
ms <- matchSCore2(gene_cl.ref = gene_cl.ref,gene_cl.obs = gene_cl,ylab = "Chromium",xlab = "MARS-Seq")
ms$ggplot
```
Please note that the matchSCore2
project is released with a Contributor Code of Conduct.
By contributing to this project, you agree to abide by its terms.
MIT © Elisabetta Mereu
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.