In churchill-lab/scRATE: A Bayesian model selection approach to test zero inflation from scRNA-seq data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "# "
)

Evaluation or comparison of Bayesian models over tens of thousands of data points (cells) would be useful but very time consuming. The true utility of scRATE lies in its functions that facilitate the throughput on a cluster computing resource and enable the model selection across entire data set, typically containing thousands of genes. For this you will have to first pull singularity (or docker) container:

```{bash eval=FALSE} $ singularity pull --name scRATE.sif shub://churchill-lab/scRATE

You can launch a shell with the singularity container followed by an R session on an interactive session:

```{bash eval=FALSE}
$ singularity shell scRATE.sif
Singularity scRATE.sif:~> R

R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

Once an R session launches,

library(scRATE)

Loading required package: rstanarm
Loading required package: Rcpp
rstanarm (Version 2.19.3, packaged: 2020-02-11 05:16:41 UTC)
- Do not expect the default priors to remain the same in future rstanarm versions.
Thus, R scripts should specify priors explicitly, even if they are just the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores())
- bayesplot theme set to bayesplot::theme_default()
   * Does _not_ affect other ggplot2 plots
   * See ?bayesplot_theme_set for details on theme setting
Loading required package: brms
Loading 'brms' package (version 2.11.1). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').

Attaching package: 'brms'
The following objects are masked from 'package:rstanarm':

    dirichlet, exponential, get_y, lasso, loo_R2, ngrps

To use scRATE's cluster computing support, UMI count matrix (gene x cell) and some minimal meta-data should be encoded in loom file format (See scRATE::create_loom function). We use loom format files since it is supported in both R and python and take only the best of two worlds downstream. In scRATE, we also provide handy features for processing model selection results and store them in the same loom file.

library(loomR)

Loading required package: R6
Loading required package: hdf5r

The idea is simple: chunk the data into many small gene expression data matrices and run the analysis on a cluster system in parallel.

prepare_job_array(loomfile = 'DC-like_cells.loom', num_chunks = 1000, outdir='.', dryrun = FALSE)

[prepare_job_array] Counts from main layer will be loaded.
[prepare_job_array] No covariates will be used.
[prepare_job_array] 5476 genes (between Gene 1 and 27998) will be processed.
[prepare_job_array] Chunk 1 to 1000 (out of 1000) will be created.
[prepare_job_array] Created input file: ./_chunk.00001
[prepare_job_array] Created input file: ./_chunk.00002
[prepare_job_array] Created input file: ./_chunk.00003
[prepare_job_array] Created input file: ./_chunk.00004
...
[prepare_job_array] Created input file: ./_chunk.00999
[prepare_job_array] Created input file: ./_chunk.01000
[prepare_job_array] Generated bash script, ./run_subjobs.sh.pbs, for submitting array jobs. Modify the file if needed.

This command generates .RData files that contains UMI count matrix for subset of genes and accompanying meta-data in addition to two bash scripts run_subjobs.sh.slurm and run_subjobs.sh.pbs: one for slurm and the other for pbs/torque cluster system. These are scripts to submit a job array, for slurm for example,

```{bash eval=FALSE}

!/bin/bash

SBATCH --qos=batch

SBATCH --partition=compute

SBATCH --job-name=scRATE

SBATCH --nodes=1

SBATCH --ntasks=1

SBATCH --cpus-per-task=4

SBATCH --mem=16gb

SBATCH --time=23:59:59

SBATCH --array=1-6

module load singularity

Run your R script that uses scRATE singularity container

ARRAY_ID=printf %05d $SLURM_ARRAY_TASK_ID singularity run --app Rscript ${CONTAINER} ${RFILE} _chunk.${ARRAY_ID} _scrate.${ARRAY_ID} ${CORES} ${SEED}

To submit this job array, you `sbatch` it with environment variables: `CONTAINER` (the name of scRATE singularity container file we pulled from singularity hub), `RFILE` (an R script file that specifies tasks to execute using scRATE functions), `CORES` (the number of cores to use), and `SEED` (seed number for random number generation). Depending on the R script you created, you will have to modify the arguments for it in those bash scripts.

```{bash eval=FALSE}
$ sbatch --export=CONTAINER=scRATE.sif,RFILE=run_model_comparison.R,CORES=4,SEED=1004 run_subjobs.sh.slurm

This will execute run_model_comparison.R for every data chunk we made with prepare_job_array function above. This R script executes scRATE::run_model_comparison function for genes in every data chunk and store the model fit parameters and model selection results in an RDS format file for each data chunk.

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
if (length(args)!=4) {
  stop("Four arguments must be supplied (<cntfile> <outfile> <cores> <seed>).n", call.=FALSE)
}

library(scRATE)

cntfile <- args[1]
outfile <- args[2]
nCores <- as.integer(args[3])
seed <- as.integer(args[4])

run_model_comparison(cntfile, nCores=nCores, seed=seed, adapt_delta=0.95, outfile=outfile)

Once the job array is successfully finished, we collate_results of the model fittings into a single file for later downstream analyses, e.g. to perform_model_selection.

collate_results(loo_dir='.', loo_outfile='DC-like_cells.scrate.rds')

Based on the model fitting results, we select models that best fits each gene. First, we need to load the model parameters and elpd_loo results for all the genes.

results <- readRDS('DC-like_cells.scrate.rds')
head(names(results))

[1] "Mrpl15"        "Lypla1"        "Tcea1"         "Atp6v1h"      
[5] "Rb1cc1"        "4732440D04Rik"

Then we perform_model_selection for each gene to determine the best model at different stringency level (0 SE, 1 SE, 2 SE, etc.) of calling more complex models. Our primary interest were whether a gene fits best with either of zero-inflated models (ZIP or ZINB) over P or NB in respect to the expected log predictive density. If the original UMI matrix file (loom format) is passed to it, the model selection results will be stored in the file as a gene attribute.

model_call <- list()
model_call[['0SE']] <- perform_model_selection(fit_list=results, margin=0.0, loomfile='DC-like_cells.loom', attr_name='ModelCall_0SE', verbose=TRUE)

 5810 Poisson genes
 2859 Negative Binomial genes
  809 Zero-Inflated Poisson genes
  389 Zero-Inflated Neg. Binomial genes
 9867 Total
Adding: ModelCall_0SE

model_call[['1SE']] <- perform_model_selection(fit_list=results, margin=1.0, loomfile='DC-like_cells.loom', attr_name='ModelCall_1SE', verbose=TRUE)

 9003 Poisson genes
  790 Negative Binomial genes
   62 Zero-Inflated Poisson genes
   12 Zero-Inflated Neg. Binomial genes
 9867 Total
Adding: ModelCall_1SE

model_call[['2SE']] <- perform_model_selection(fit_list=results, margin=2.0, loomfile='DC-like_cells.loom', attr_name='ModelCall_2SE', verbose=TRUE)

 9737 Poisson genes
  124 Negative Binomial genes
    6 Zero-Inflated Poisson genes
    0 Zero-Inflated Neg. Binomial genes
 9867 Total
Adding: ModelCall_1SE

All set!

churchill-lab/scRATE documentation built on Aug. 12, 2020, 2:52 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com