knitr::opts_chunk$set( collapse = TRUE, comment = "# " )
Evaluation or comparison of Bayesian models over tens of thousands of data points (cells) would be useful but very time consuming. The true utility of scRATE lies in its functions that facilitate the throughput on a cluster computing resource and enable the model selection across entire data set, typically containing thousands of genes. For this you will have to first pull singularity (or docker) container:
```{bash eval=FALSE} $ singularity pull --name scRATE.sif shub://churchill-lab/scRATE
You can launch a shell with the singularity container followed by an R session on an interactive session: ```{bash eval=FALSE} $ singularity shell scRATE.sif Singularity scRATE.sif:~> R
R version 3.6.3 (2020-02-29) -- "Holding the Windsock" Copyright (C) 2020 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
Once an R session launches,
library(scRATE)
Loading required package: rstanarm Loading required package: Rcpp rstanarm (Version 2.19.3, packaged: 2020-02-11 05:16:41 UTC) - Do not expect the default priors to remain the same in future rstanarm versions. Thus, R scripts should specify priors explicitly, even if they are just the defaults. - For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()) - bayesplot theme set to bayesplot::theme_default() * Does _not_ affect other ggplot2 plots * See ?bayesplot_theme_set for details on theme setting Loading required package: brms Loading 'brms' package (version 2.11.1). Useful instructions can be found by typing help('brms'). A more detailed introduction to the package is available through vignette('brms_overview'). Attaching package: 'brms' The following objects are masked from 'package:rstanarm': dirichlet, exponential, get_y, lasso, loo_R2, ngrps
To use scRATE's cluster computing support, UMI count matrix (gene x cell) and some minimal meta-data should be encoded in loom file format (See scRATE::create_loom
function). We use loom format files since it is supported in both R
and python
and take only the best of two worlds downstream. In scRATE
, we also provide handy features for processing model selection results and store them in the same loom file.
library(loomR)
Loading required package: R6 Loading required package: hdf5r
The idea is simple: chunk the data into many small gene expression data matrices and run the analysis on a cluster system in parallel.
prepare_job_array(loomfile = 'DC-like_cells.loom', num_chunks = 1000, outdir='.', dryrun = FALSE)
[prepare_job_array] Counts from main layer will be loaded. [prepare_job_array] No covariates will be used. [prepare_job_array] 5476 genes (between Gene 1 and 27998) will be processed. [prepare_job_array] Chunk 1 to 1000 (out of 1000) will be created. [prepare_job_array] Created input file: ./_chunk.00001 [prepare_job_array] Created input file: ./_chunk.00002 [prepare_job_array] Created input file: ./_chunk.00003 [prepare_job_array] Created input file: ./_chunk.00004 ... [prepare_job_array] Created input file: ./_chunk.00999 [prepare_job_array] Created input file: ./_chunk.01000 [prepare_job_array] Generated bash script, ./run_subjobs.sh.pbs, for submitting array jobs. Modify the file if needed.
This command generates .RData files that contains UMI count matrix for subset of genes and accompanying meta-data in addition to two bash scripts run_subjobs.sh.slurm
and run_subjobs.sh.pbs
: one for slurm and the other for pbs/torque cluster system. These are scripts to submit a job array, for slurm for example,
```{bash eval=FALSE}
module load singularity
ARRAY_ID=printf %05d $SLURM_ARRAY_TASK_ID
singularity run --app Rscript ${CONTAINER} ${RFILE} _chunk.${ARRAY_ID} _scrate.${ARRAY_ID} ${CORES} ${SEED}
To submit this job array, you `sbatch` it with environment variables: `CONTAINER` (the name of scRATE singularity container file we pulled from singularity hub), `RFILE` (an R script file that specifies tasks to execute using scRATE functions), `CORES` (the number of cores to use), and `SEED` (seed number for random number generation). Depending on the R script you created, you will have to modify the arguments for it in those bash scripts. ```{bash eval=FALSE} $ sbatch --export=CONTAINER=scRATE.sif,RFILE=run_model_comparison.R,CORES=4,SEED=1004 run_subjobs.sh.slurm
This will execute run_model_comparison.R
for every data chunk we made with prepare_job_array
function above. This R script executes scRATE::run_model_comparison
function for genes in every data chunk and store the model fit parameters and model selection results in an RDS format file for each data chunk.
#!/usr/bin/env Rscript args = commandArgs(trailingOnly=TRUE) if (length(args)!=4) { stop("Four arguments must be supplied (<cntfile> <outfile> <cores> <seed>).n", call.=FALSE) } library(scRATE) cntfile <- args[1] outfile <- args[2] nCores <- as.integer(args[3]) seed <- as.integer(args[4]) run_model_comparison(cntfile, nCores=nCores, seed=seed, adapt_delta=0.95, outfile=outfile)
Once the job array is successfully finished, we collate_results
of the model fittings into a single file for later downstream analyses, e.g. to perform_model_selection
.
collate_results(loo_dir='.', loo_outfile='DC-like_cells.scrate.rds')
Based on the model fitting results, we select models that best fits each gene. First, we need to load the model parameters and elpd_loo results for all the genes.
results <- readRDS('DC-like_cells.scrate.rds') head(names(results))
[1] "Mrpl15" "Lypla1" "Tcea1" "Atp6v1h" [5] "Rb1cc1" "4732440D04Rik"
Then we perform_model_selection
for each gene to determine the best model at different stringency level (0 SE, 1 SE, 2 SE, etc.) of calling more complex models. Our primary interest were whether a gene fits best with either of zero-inflated models (ZIP or ZINB) over P or NB in respect to the expected log predictive density. If the original UMI matrix file (loom format) is passed to it, the model selection results will be stored in the file as a gene attribute.
model_call <- list() model_call[['0SE']] <- perform_model_selection(fit_list=results, margin=0.0, loomfile='DC-like_cells.loom', attr_name='ModelCall_0SE', verbose=TRUE)
5810 Poisson genes 2859 Negative Binomial genes 809 Zero-Inflated Poisson genes 389 Zero-Inflated Neg. Binomial genes 9867 Total Adding: ModelCall_0SE
model_call[['1SE']] <- perform_model_selection(fit_list=results, margin=1.0, loomfile='DC-like_cells.loom', attr_name='ModelCall_1SE', verbose=TRUE)
9003 Poisson genes 790 Negative Binomial genes 62 Zero-Inflated Poisson genes 12 Zero-Inflated Neg. Binomial genes 9867 Total Adding: ModelCall_1SE
model_call[['2SE']] <- perform_model_selection(fit_list=results, margin=2.0, loomfile='DC-like_cells.loom', attr_name='ModelCall_2SE', verbose=TRUE)
9737 Poisson genes 124 Negative Binomial genes 6 Zero-Inflated Poisson genes 0 Zero-Inflated Neg. Binomial genes 9867 Total Adding: ModelCall_1SE
All set!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.