scSubset: an R package to evaluate optimal cell count in single cell transcriptomics
shiny (>= 1.5.0), shinythemes (>= 1.1.2), Seurat (>= 4.0.0), DT (>= 0.17), shinycssloaders (>= 0.3), shinydashboard (>= 0.7.1), shinyjs (>= 2.0.0), shinybusy (>= 0.2.0), aricode (>= 0.1.2), ggplot2 (>= 3.3.1), reshape2 (>= 1.4.4), UpSetR (>= 1.4.0), hdf5r (>= 1.3.2), tidyverse (>= 1.3.0), metap (>= 1.3), shinyWidgets (>= 0.5.4), cowplot (>= 1.0.0), patchwork (>= 1.0.0), shinyalert (>= 2.0.0), multtest (>= 2.42.0), stringr (>= 1.4.0), MAST (>= 1.16.0)
library("devtools")
install_github("rzaied/scSubset")
# Run the application
library(scSubset)
scSubsetGo()
scSubset is designed to help users identify the sufficient number of cells to use in their scRNA-seq experiments. The package interactively interrogates deposited single cell datasets and down-samples them into smaller subsets having 20%, 40%, 60% and 80% of the parent dataset, respectively. Clustering projections of each subset is compared to that of the reference using the adjusted Rand index (ARI) and normalized mutual information (NMI) scores. The degree of overlap of marker genes (MGs), differentially expressed genes (DEGs), and conserved marker genes (CMGs) between subsets and the reference dataset will also be computed.
User case 1: Bob is an early career researcher and for his fellowship he wants to propose a new study based on scRNA-seq lung data. Single cell experiments are expensive and Bob’s budget is limited. He also requires to provide details of the number of biological replicates and sequencing costs. Bob decided to use a publicly available scRNA-seq data to check what's the lowest number of cells he can sequence and still keep the population of interest. He applies scSubset and finds out that reducing the number of sequenced cell to just 50% would be sufficient. This allows Bob to increase the number of replicates to obtain more robust results.
User case 2: Alice is working on clinical samples from a large cohort study. Before she can proceed with sequencing of over 100 patient samples, Alice completes a preliminary study. While the results look promising, Alice is wondering if she can optimize the number of sequenced cells as the full experiment is coming to a much higher cost than anticipated. Alice tries scSubset to check if she can obtain the same biological insight with fewer cells. Sadly, it turns out that in this case Alice would still need to use 100%.
10K peripheral mononuclear cells (PBMCs) from a healthy donor obtained from 10X Genomics was used for this example. At a resolution of 0.3, scSubset analysis shows that using 60% of the dataset reduces the adjusted rand index (ARI) and normalized mutual information scores (NMI) from 1.0 to to 0.8 and 0.85, respectively:
The top 10 marker genes (MGs) from each cluster in the full dataset were compared with the MGs of each subset and an UpSet plot was used to demonstrate the degree of overlap. In this dataset, 70 of 150 MGs from the reference dataset were resolved across all subsets. 20 MGs were unique to the reference data set and another 20 were uniquely shared between the 40%, 60%, 80% and the full dataset:
The 60% and 80% subsets have the same number of shared MGs with the reference. However, the identities of the shared MGs differ. scSubset provides summary statistics allowing users to identify subsets that sufficiently resolve genes of biological interest:
Considering the slight improvement in ARI/NMI scores when increasing the dataset size from 60% to 80%, and the percent overlap of MGs, we reasoned that 60% would have been a sufficient coverage relative to a 10K PBMC dataset. Such an allocation would have had the capacity of saving ~£1700 of sequencing costs. Sequencing cost is estimated by scSubset depending on user input and displayed in a summary table as shown below:
Number of cells | number of clusters | % Overlapping markers | ARI | NMI | Sequencing cost ------------ | ------------- | ------------- | ------------- | ------------- | ------------- 2102 (20%) | 9 | 60.00 | 0.602 | 0.701 | 840.8 4204 (40%) | 11 | 73.33 | 0.790 | 0.825 | 1,681.6 6306 (60%) | 12 | 80.00 | 0.799 | 0.857 | 2,522.4 8408 (80%) | 14 | 80.00 | 0.919 | 0.916 | 3,363.2 10510 | 15 | 100.00 | 1.000 | 1.000 | 4,204.0
1) Subsets resulting in a high marginal increase of the NMI/ARI scores. 2) Subsets whose identified conserved marker genes and/or differentially expressed genes have a high degree of overlap with the full dataset. 3) Subsets that can sufficiently resolve genes of specific biological interest in a given dataset.
Users can upload single or paired datasets; for each dataset, an .h5 file or a matrix.mtx, genes.tsv/features.tsv and barcodes.tsv files will be accepted (output from a Cell Ranger run). The pattern of the mitochondrial genes should be specified e.g. use "^MT-" for human datasets and "^mt-" for mouse datasets, etc. The desired number of genes to resolve (default is top 10) from the reference dataset should be selected. The resolution should also be selected (default is 0.5). For a single dataset consisting of 10K cells, the analysis could take ~20 minutes. The same is true for an integrated dataset of 10K cells. Computation of conserved marker genes for integrated datasets is optional and could add around an extra hour to the analysis.
This project is licensed under the MIT License.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.