'scrnabench' provides tools for the creation of reference and metamorphic datasets for benchmarking of scRNA-seq data analysis methods. This R package focuses on metamorphic stability of cluster analysis and data integration tools.
The BiocManager is needed for installing scrnabench. Execute the following commands:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("batchelor")
If you do not have devtools installed, run this command:
install.packages("devtools")
In the R terminal, execute the following command to install the package:
devtools::install_github("NWhitener/scrnabench")
To use the package:
library(scrnabench)
To download the raw data, use the download_data() function. This will download 48 raw reference datasets from Zenodo. The datasets are downloaded as one R object called 'gene_counts.RData'. To save the dataset, provide a filepath to the desired location, for example, path = "/User/Downloads". This function will download and automatically load the raw datasets into scrnabench. This function only needs to be run once.
datasets = download_data(path = "/Users/Downloads")
Below are some of the solutions to common issues that may arise when installing or using the package.
This issue may occur during the installation. To fix this issue, execute the following commands:
remotes::install_github('satijalab/seurat-wrappers')
devtools::install_github("NWhitener/scrnabench")
This issue may occur when you call a GitHub API more than 5,000 times within a 60-minute window, even if you are authenticated. To fix this issue, see: post.
This issue may occur when R does not have a large enough max vector heap size. To fix this issue, execute the following commands:
library(usethis)
usethis::edit_r_environ()
This will open the R environment file. Add the following command, requesting larger memory (example 100 Gb) and restart your R session.
R_MAX_VSIZE=100Gb
If you have already downloaded the raw data to a directory on your local computer or would like to use the demo dataset, call the load_data() function. Provide the path of the downloaded data file. This will load the dataset for benchmarking. This function should be used anytime the data needs to be loaded.
#Load the demo dataset
datasets = load_data(demo = TRUE, path = "/Users/Downloads")
#Load the full dataset
datasets = load_data(demo = FALSE, path = "/Users/Downloads")
The demo file is smaller than the full dataset, comprising of 2 raw reference datasets, and can be used for rapid testing and experimentation.
Below we have provided sample workflows to demonstrate the functionality of the scrnabench package.
The clustering workflow performs the baseline cluster analysis pipeline. The following steps are performed: filtering, annotation, data transformation, Highly Variable Gene (HVG) selection, scaling, dimensionality reduction, and clustering. The package supports 2 baseline clustering methods, Kmeans clustering and Seurat's Phenograph clustering.
For example, to cluster the demo datasets into 10 clusters using Kmeans algorithm, the following command can be used:
#Clustering Workflow
datasets = load_data(demo = FALSE, path = "/Users/Downloads")
dataList = extract_datasets(datasets)
dataList = run_clustering_workflow(dataList, method = 'kmeans', transformationType = 'log', seed = 1, numberClusters = 10)
To cluster the demo datasets using Seurat's Phenograph, do the following:
#Clustering Workflow
datasets = load_data(demo = FALSE, path = "/Users/Downloads")
dataList = extract_datasets(datasets)
dataList = run_clustering_workflow(dataList, method = 'seurat', transformationType = 'log', seed = 1)
The metamorphic benchmarking workflow performs metamorphic cluster analysis. This workflow perturbs raw datasets using up to 6 metamorphic relations, completes preprocessing and cluster analysis. See protocols for details about metamorphic testing. The benchmarking report summarizes clustering changes of reference and metamorphic datasets.
For example, to cluster the data into 10 clusters using Kmeans algorithm, the following command can be used:
#Metamorphic Testing Workflow
datasets = load_data(demo = F, path = 'Users/Downloads')
dataList = extract_datasets(datasets)
MetamorphicReport = run_metamorphic_test_workflow(dataList, method = 'kmeans', transformation = "log", metamorphicTests= c(6,5,4,3,2,1))
To cluster the data using Seurat's Phenograph, do the following: ``` datasets = load_data(demo = F, path = '/Users/Downloads') dataList = extract_datasets(datasets) MetamorphicReport = run_metamorphic_test_workflow(dataList, method = 'seurat', transformation = "log", metamorphicTests= c(6,5,4,3,2,1))
### Harmony Integration
The Harmony integration workflow performs data integration using the [harmony](https://github.com/immunogenomics/harmony) package. The workflow steps include: extraction of common genes, dataset merging, filtering, annotation, data transformation,
HVG selection, scaling, harmonization, dimensionality reduction, and clustering. Two baseline clustering algorithms are supported, Kmeans and Seurat Phenograph.
For example, to integrate the data and complete Kmeans clustering, the following command can be used:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_harmony_integration_workflow(dataList, method = 'kmeans', seed = 1, numberClusters = 10)
To integrate the data and complete Seurat's Phenograph, do the following:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_harmony_integration_workflow(dataList, method = 'seurat', seed = 1)
### fastMNN Integration
The fastMNN integration workflow completes the data integration pipeline, using the SeuratWrapper's [fastMNN](https://github.com/satijalab/seurat-wrappers) functions (LINK). Here the steps include: extraction of common genes, dataset merging, filtering, annotation, data transformation,
HVG selection, integration via fastmnn, dimensionality reduction, and clustering. The cluster method can be KMeans Clustering or Seurat's Phenograph clustering.
To integrate the data and complete Kmeans clustering, the following command can be used:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_fastmnn_integration_workflow(dataList, method = 'kmeans', seed = 1, numberClusters = 10)
To integrate the data and complete Seurat's Phenograph, do the following:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_fastmnn_integration_workflow(dataList, method = 'seurat', seed = 1)
### CCA Integration
The cca integration workflow completes the data integration usign Seurat's CCA pipeline. Each dataset is preproceesed separately, then integrated via CCA, followed by scaling, dimensionality reduction, and clustering. The preprocess consists of filtering, annotation, data transformation, and HVG selection. Two baseline clustering algorithms are supported, Kmeans and Seurat Phenograph.
To integrate the data and complete Kmeans clustering, the following command can be used:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_cca_integration_workflow(dataList, method = 'kmeans', seed = 1, numberClusters = 10)
To integrate the data and complete Seurat's Phenograph, do the following:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_cca_integration_workflow(dataList, method = 'seurat', seed = 1)
### sctransform Integration
The sctransform integration workflow completes the Seurat scransform pipeline. This includes filtering, annotation, integration via sctransform, scaling, dimensionality reduction, and clustering. Two baseline clustering algorithms are supported, Kmeans and Seurat Phenograph.
To integrate the data and complete Kmeans clustering, the following command can be used:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_sctransform_integration_workflow(dataList, method = 'kmeans', seed = 1, numberClusters = 10)
To integrate the data and complete Seurat's Phenograph, do the following:
datasets = load_data(demo = FALSE, path = "/Users/Downloads") dataList = extract_datasets(datasets) dataList = run_sctransform_integration_workflow(dataList, method = 'Seurat', seed = 1)
### Publications
The srnabench has been used in the following publications.
@inproceedings{zhao2023ensemble, title={An Ensemble Machine Learning Approach for Benchmarking and Selection of scRNA-seq Integration Methods}, author={Zhao, Konghao and Bhandari, Sapan and Whitener, Nathan P and Grayson, Jason M and Khuri, Natalia}, booktitle={Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics}, pages={1--10}, year={2023} }
@Article{jpm13020183, AUTHOR = {Zhao, Konghao and Grayson, Jason M. and Khuri, Natalia}, TITLE = {Multi-Objective Genetic Algorithm for Cluster Analysis of Single-Cell Transcriptomes}, JOURNAL = {Journal of Personalized Medicine}, VOLUME = {13}, YEAR = {2023}, NUMBER = {2}, ARTICLE-NUMBER = {183}, URL = {https://www.mdpi.com/2075-4426/13/2/183}, PubMedID = {36836417}, ISSN = {2075-4426}, DOI = {10.3390/jpm13020183} } ```
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.