knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

malacoda

The goal of malacoda is to enable Bayesian analysis of high-throughput genomic assays like massively parallel reporter assays (MPRA) and CRISPR screens.

It uses a negative-binomial-based Bayesian model shown in the Kruschke diagram below. This model offers numerous advantages over traditional null hypothesis significance testing based methods:

Other features include:

knitr::include_graphics('man/figures/kruschke_latex.png')
# Not sure why this is still a little blurry...

Cite

The manuscript associated with this work was published in PLOS Computational Biology on July 21, 2020. If you use this software, please cite this source:

Ghazi, A. R., Kong, X., Chen, E. S., Edelstein, L. C., & Shaw, C. A. (2020). Bayesian modelling of high-throughput sequencing assays with malacoda. PLOS Computational Biology, 16(7), 1–18. https://doi.org/10.1371/journal.pcbi.1007504

The HTML versions of the supplements for the manuscript are available in this repository's man/supplements/ directory.

Example

This is a basic example which shows you how to fit the simplest form of the model. The input needs to provide the MPRA counts for each barcode of each allele of each variant with a column for each sequencing sample. This example only shows 8 rows, but realistic datasets will have thousands.

|variant_id |allele |barcode | DNA1| DNA2| RNA1| RNA2| RNA3| RNA4| |:----------------|:------|:------------|----:|----:|----:|----:|----:|----:| |1_205247315_2-3 |ref |CGATATAGTTAC | 19| 19| 3| 0| 39| 4| |1_205247315_2-3 |ref |ACGCATTCACCT | 56| 51| 38| 19| 38| 46| |1_205247315_2-3 |alt |TAGTATCAACTA | 76| 50| 14| 44| 11| 20| |1_205247315_2-3 |alt |GTCGTCAGCGAT | 106| 69| 80| 32| 163| 57| |10_101274365_1-3 |ref |ATAAGCCGTCCG | 54| 37| 43| 2| 12| 24| |10_101274365_1-3 |ref |CCGACGGCGTTG | 61| 35| 18| 7| 19| 10| |10_101274365_1-3 |alt |GATACGAACACA | 42| 34| 29| 48| 194| 129| |10_101274365_1-3 |alt |TAACGATCGGTC | 33| 25| 14| 9| 11| 409|

The specific format requirements for the input can be found on the help page for ?fit_mpra_model. The code below will fit the basic malacoda model using a dataset that comes with the package:

library(malacoda)
marg_prior = fit_marg_prior(umpra_example)
fit_mpra_model(mpra_data = umpra_example,
               out_dir = '/path/to/outputs/',
               priors = marg_prior,
               n_cores = getOption('mc.cores', 2L),
               tot_samp = 1000,
               n_chains = 3,
               vb_pass = TRUE,
               save_nonfunctional = TRUE)

This will fit the model to each input in the assay (using some example variants from Ulirsch et al., Cell, 2016 in the built-in dataset umpra_example) using a marginal prior, save the outputs for each variant at the specified directory, and return a data frame of summary statistics for each variant, including binary calls of functional/non-functional, posterior means on activity levels & transcription shift.

|variant_id | ts_post_mean| ref_post_mean| alt_post_mean|is_functional | hdi_lower| hdi_upper| |:----------------|------------:|-------------:|-------------:|:-------------|---------:|---------:| |22_32880585_2-3 | 1.67| -1.15| 0.52|TRUE | 0.95| 2.32| |15_66068074_1-3 | 0.73| -0.83| -0.10|FALSE | -0.15| 1.56| |12_121167039_1-3 | 0.56| -0.69| -0.13|FALSE | -0.25| 1.26| |16_68092850_2-3 | -0.44| 0.18| -0.26|FALSE | -1.15| 0.40| |8_42424746_1-2 | 0.40| -0.33| 0.07|FALSE | -0.12| 0.87| |19_33204544_1-3 | -0.27| -0.15| -0.42|FALSE | -0.91| 0.44|

More sophisticated analyses that use annotations to create informative priors for higher sensitivity are described in the MPRA Analysis vignette accessible with vignette('mpra_vignette', package = 'malacoda'). Other features like annotation checking and traditional NHST analysis are also explained in the vignette. Each function provided by the package have extensive help documentation that should help elucidate what they do and how to use them e.g. ?fit_mpra_model.

Installation

malacoda is intended for use on Mac and Linux. Windows may work aside from parallelization, however we do not intend to support Windows.

It's best to have the most up-to-date version of R (3.6.0 as of May 2 2019).

The first step is to install rstan and Rcpp. The following command will usually suffice to do this, if not you can find more in-depth installation instructions on the rstan documentation. You should have root access.

# Sys.setenv(MAKEFLAGS = "-j4") # This compilation flag can help speed up
# installation on multi-threaded machines. This command uses 4 threads.
install.packages(c('Rcpp', 'rstan'), dependencies = TRUE, type = 'source')

Once malacoda is accepted up on CRAN it will be installable with:

# install.packages('malacoda')

You can install the development version of malacoda from github with:

# install.packages("devtools")
devtools::install_github("andrewGhazi/malacoda")

This should install the dependencies (which are mostly tidyverse packages), compile the malacoda Stan models, and install the package.

count_barcodes() requires the FASTX-Toolkit (installation instructions provided at that link) and sed which comes with most Unix-like operating systems. It can also optionally take advantage of the FreeBarcodes package for barcode error-correction.

Example output

In addition to the summary statistics table output above, the sampler outputs for each variant are saved in the user-defined output directory. These are stanfit objects, thus they can be visualized using all the tools provided in packages like bayesplot.

malacoda also provides several plotting function of its own. mpra_tile_plot can help to visualize the raw MPRA counts (and some summary metrics):

knitr::include_graphics('man/figures/tile_example.png')

And malacoda_intervals() can help to visualize the posterior from the malacoda model fit:

knitr::include_graphics('man/figures/interval_example.png')

Other visualization functions are available for annotation checking. These help visualize the improvement induced by the use of informative conditional priors.

Help

Upcoming Features

Contact

The massively parallel functionalization field is still rapidly changing so there's a lot of experimental structures out there that are not accounted for in this package yet. If you're interested in these methods, feel free to contact me with some example data and I'll be happy to take a look to see if I can adapt the existing models!

Please contact me through Github DM or my BCM email address if you use the package or have feature requests / comments.



andrewGhazi/malacoda documentation built on Aug. 2, 2020, 12:54 a.m.