sctree: Tree based analysis of single rna-seq clusters

title: "scTree: An R package to generate antibody-compatible classifiers from single-cell sequencing data" tags: - R - Bioinformatics - Single cell - Flow Cytometry authors: - name: J. Sebastian Paez orcid: 0000-0002-0065-1474 affiliation: "1, 2" - name: Michael K. Wendt orcid: 0000-0002-3665-7413 affiliation: "1, 2" - name: Nadia Atallah Lanman orcid: 0000-0002-1819-7070 affiliation: "1, 3" affiliations: - name: Purdue University, Center for Cancer Research index: 1 - name: Purdue University, Department of Medicinal Chemistry and Molecular Pharmacology index: 2 - name: Purdue University, Department of Comparative Pathobiology index: 3 date: 14 January 2020 bibliography: bibliography.bib

Summary

Single-cell RNA sequencing (scRNA-seq) is now a commonly used technique to measure the transcriptome of populations of cells. Clustering heterogeneous cells based on these transcriptomes enables identification of cell populations [@Trapnell2014;@Butler2018]. There are multiple methods available to identify "marker" genes that differ between these populations [@Butler2018;@Love2014;@Robinson2009]. However, there are usually too many genes in these lists to directly suggest an experimental follow-up strategy for selecting them from a bulk population (e.g. via FACS [@RN1]). Here we present scTree, a tool that aims to provide biologists using the R programming language and scRNA-seq analysis programs a minimal set of genes that can be used in downstream experiments. The package is free, open source and available though GitHub at github.com/jspaezp/sctree

Implementation and results

The underlying model behind scTree is a combination of random forest for variable selection and a classification tree; having this model as a classifier relies on the fact that classification trees are analogous to many approaches in biology such as the gating strategy employed in flow cytometry or Fluorescence assisted cell sorting (FACS) experiments. In flow cytometry and FACS experiments, populations are identified and sorted based on expression levels of distinct markers that entail the identity or state of the chosen population. Usually such experiments use only relative levels of marker expression, using terms such as "High" and "Low" [@Coquery2012;@Robertson2005].

In a similar manner, scTree produces accurate, biologically relevant, and easily interpretable results, which can be used for subsequent subpopulation sorting and biological validation by fitting shallow decision trees analogous to FACS sorting strategies and is able to output these classifiers in a format easily interpretable in a wet-lab setting.

The method to calculate variable importances based on random forests has been previously described, and has been implemented in R by the ranger package [@Altmann2010;@Janitza2018;@Wright2017]. The suggestion of gating strategies is achieved by fitting a classification tree using the implementation provided by the partykit R package [@Zeileis2015].

In order to benchmark the quality of markers, we utilized a recall-based strategy. Briefly, each dataset was split randomly into two sections, a training set with 80% of the cells and a testing set consisting of the 20% remaining. A classifier was trained by selecting the top 5 markers suggested for each cluster by either scTree (Altman method) or by two of the most commonly used marker gene detection methods for scRNA-seq data: t-tests or wilcoxon-tests (as implemented by Seurat v3.0.1).

These classifiers were then used to predict the identity of the testing set and the quality was assesed by comparing the recall, accuracy and precision of the prediction. We were concerned that the forest-based markers would artificially favor scTree, therefore we utilized several classifiers for the markers derived from either scTree, t-tests or wilcoxon-tests. As shown in Figures 1 and 2, bias was not observed, and regardless of the final classification model, the features selected by using scTree provide a comparable accuracy, precision and recall to those acquired using traditional differential expression methods. It is important to note that many of the wrongly assigned labels happen between cell populations that are hard to define in-vivo and are not resolved clusters in the UMAP dimensional reduction, such as macrophage subtypes and between NK and Tc cells.

**Depiction of the classification performance achieved in the Jurkat:293 50:50 dataset.** A number of machine learning algorithms were tested to ensure that scTree performed as well as traditional marker identification approaches, regardless of the classifier used.

As mentioned previously, a main focus in the development of scTree was the biological interepretability of the models. Therefore the models can be expressed as a Garnett file, as shown in Code Section 1, as specified originally in the Garnett manuscript by the Trapell lab [@Pliner2019]. Visualizations are designed to resemble flow cytometry results, as show in Figure 3 and connections with several antibody vendors are provided to query the availability of probes for the genes found to be usefull for classification.

as.garnett(tree_model, rules_keep = "^NK")
# > NK Cells    (n = 88)
# expressed above: CST7 1.636, TYROBP 2.451

Code Section 1. Suggested classification scheme for NK cell cluster of the PBMC dataset. The data depicts how the cluster corresponding to NK cells can be predominantly identified as GNLY High/GZMB High.

**Scatterplot showing the progressive gating that would be used to classify node 11 in the 3K PBMC dataset.** Filtering in each pane is done on the gene presented on the X-axis of the plot and cells kept during that filtering step are highlighted

Despite scTree being originally developed for single cell sequencing, we recognize it could also be used for other high content single-cell workflows, such as CyTOF or data driven multiple-channel flow cytometry.

The provided interface with antibody databases, further enhances the utility of scTree by fulfilling the need to interface in silico models and data with in vitro followup. Therefore, a package interface with common antibody vendors and search engines are provided. This interface is exemplified in Code section 2.

require(sctree)
head(query_biocompare_antibodies("CD11b"))
#>                                   title            vendor
#> 1         Anti-CD11b antibody [EPR1344]             Abcam
#> 2             Anti-CD11b/ITGAM Antibody         BosterBio
#> 3    Anti-CD11b/ITGAM Picoband Antibody         BosterBio
#> 4 Anti-CD11b Rabbit Monoclonal Antibody         BosterBio
#> 5  Monoclonal Antibody to CD11b (human)   MyBioSource.com
#> 6                Anti-CD11b (Mouse) mAb MBL International
#>                                                         specification
#> 1 Applications: WB, IHC-p; Reactivity: Hu, Ms, Rt, Pg, RhMk; Conjugat
#> 2             Applications: Western Blot (WB); Reactivity: Hu, Ms, Rt
#> 3   Applications: WB, FCM, ICC, IHC-fr, IHC-p; Reactivity: Hu, Ms, Rt
#> 4                       Applications: WB, IF, IHC; Reactivity: Hu, Ms
#> 5              Applications: Flow Cytometry (FCM); Reactivity: Human 
#> 6              Applications: Flow Cytometry (FCM); Reactivity: Mouse

Code Section 2. Example of the automated antibody query interface

Additional usage cases and up-to-date code snippets of the common functions can be found in the package documentation website (jspaezp.github.io/sctree/) and the readme file hosted in the github repository (github.com/jspaezp/sctree).

Methods

The filtered raw counts for each dataset were downloaded from the 10x website single cell expression datasets [@tenxgenomics] and were processed by the standard Seurat workflow, as described in the package tutorial [@satijalab]. This process was carried out for the following datasets:

3k PBMC, Peripheral Blood Mononuclear Cells (PBMC)
50\%:50\% Jurkat:293T Cell Mixture, originally published by Wan, H. et al. in 2017

Briefly, each dataset was split into a testing and a training set. For each cluster, each of the different marker identification methodologies was used and the top five markers were selected. These five markers were used to train a prediction model on the training set and the predicitons were carried out on the testing set. These predictions were compared with the assigned cluster identity and performance metrics were calculated.

$$ precision = \frac{True\ Positives}{True\ Positives + False\ Positives} $$

$$ recall = \frac{True\ Positives}{True\ Positives + False\ Negatives} $$

$$ accuracy = \frac{True\ Positives + True\ Negatives}{Total} $$

Acknowledgments

This study was supported by the Computational Genomics Shared Resource at the Purdue University Center for Cancer Research (NIH grant P30 433 CA023168), IU Simon Cancer Center (NIH grant P30 CA082709), and the Walther Cancer Foundation.

References

jspaezp/sctree documentation built on April 30, 2020, 10:36 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jspaezp/sctree
Tree based analysis of single rna-seq clusters

vignettes/paper/paper.md
In jspaezp/sctree: Tree based analysis of single rna-seq clusters

Summary

Implementation and results

Example Output from the package

Predictor generation

Antibody querying interface

Methods

Testing dataset processing

Description of the benchmarking process

Formulas defining the prediction quality

Acknowledgments

References

R Package Documentation

Browse R Packages

We want your feedback!

jspaezp/sctree Tree based analysis of single rna-seq clusters

vignettes/paper/paper.md In jspaezp/sctree: Tree based analysis of single rna-seq clusters

Summary

Implementation and results

Example Output from the package

Predictor generation

Antibody querying interface

Methods

Testing dataset processing

Description of the benchmarking process

Formulas defining the prediction quality

Acknowledgments

References

R Package Documentation

Browse R Packages

We want your feedback!

jspaezp/sctree
Tree based analysis of single rna-seq clusters

vignettes/paper/paper.md
In jspaezp/sctree: Tree based analysis of single rna-seq clusters