knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = FALSE
)
library(dplyr)

This vignette presents the MPRA analysis functions present in malacoda. These features include:

malacoda Installation

malacoda is intended for use on Mac and Linux. Windows may work aside from parallelization, however we do not intend to support Windows.

It's best to have the most up-to-date version of R (3.6.0 as of May 2 2019).

The first step is to install rstan and Rcpp. The following command will usually suffice to do this, if not you can find more in-depth installation instructions on the rstan documentation.

install.packages(c('Rcpp', 'rstan'), dependencies = TRUE, type = 'source')

Once malacoda is accepted by CRAN it will be installable with:

# install.packages('malacoda')

You can install the development version of malacoda from github with:

# install.packages("devtools")
devtools::install_github("andrewGhazi/malacoda")

This should install the dependencies (which are mostly tidyverse packages), compile the malacoda Stan models, and install the package.

Barcode counting

The count_barcodes function automates the multi-step process of converting the MPRA barcode-to-variant-allele design to a per-sample count data frame.

A barcode-allele data frame looks like the following, with one column bc_id giving a unique barcode ID and another column barcode giving the raw barcode sequences as character strings:

|bc_id |barcode | |:-------------|:--------------| |rs12345_ref_1 |ATCGTATAGGTAGC | |rs12345_ref_2 |TCATCGGTCCTTAC | |rs12345_ref_3 |CATCTCTGTTGTCA | |rs12345_ref_4 |TAGGTGCGACGGAA | |rs12345_ref_5 |GCAACAGAGCTTCA | |rs12345_ref_6 |CGCAGGATGGTAGA |

The following command will take the MPRA library design in my_design_df and the FASTQs in /path/to/fastqs/ and strip out the barcodes located between bc_start and bc_end of each read. It then discards barcodes where any base qualities are below quality_cutoff and counts the number of occurrences of each barcode. It finally returns a data frame with a row for each barcode and appropriately-named additional columns giving the counts of each barcode in each fastq.

count_barcodes(barcode_allele_df = my_design_df,
               fastq_dir = '/path/to/fastqs/',
               temp_dir = '/path/with/space/',
               bc_start = 1,
               bc_end = 14,
               quality_cutoff = 30,
               n_cores = 1,
               keep_temp = FALSE)

The parameter names should be evocative of their purpose; the help documentation gives more information on each.

This function can also utilize MPRA libraries designed using the freebarcodes feature included in mpradesigntools. Once freebarcodes is installed, setting decode_barcode_set to the path to the appropriate barcode set (e.g. "/usr/local/src/freebarcodes/barcodes/barcodes14-1.txt") will count barcodes directly observed AND decode barcode reads containing sequencing errors. In the author's experience, this yields about 5% more data for free (although you have to have used mpradesigntools!).

High-throughput sequencing assay Negative Binomial model

The main purpose of this package is to provide a fully Bayesian model for count data coming from high-throughput, NGS-based assays. Massively parallel reporter assays (MPRA) are the motivating use-case, but methods are also provided for other assays with similar experimental structure/outputs.

The data outputs from a MPRA can be laid out simply in a data frame as shown below, with columns showing variant ID's, allele indicators, barcode sequence, and sequencing samples. A row for each barcode gives the number of times the given barcode was counted in each sequencing sample. The table below shows some example data from Ulirsch et al., Cell 2016 that is included with the package as the data object umpra_example:

load('../data/umpra_example.RData') 

umpra_example %>% 
  dplyr::select_all(~gsub('K562_|CTRL_', '', .)) %>% 
  dplyr::group_by(variant_id, allele) %>% 
  dplyr::do(head(., 3)) %>% 
  .[1:12,1:8] %>% 
  knitr::kable()

MPRA assays typically have 10 to 100 barcodes per allele for dozens to tens of thousands of variants.

The software included in this software models the counts in each cell of this data frame as coming from a negative binomial distribution. The mean parameter of the negative binomial is first scaled according to sequencing depth. For counts from RNA samples, the mean parameter is also scaled according to the mean of the barcode's DNA input distribution. The remaining variability of the mean parameter for the RNA counts is ascribed to the allele associated with the barcode. The Kruschke diagram below illustrates the model structure graphically:

knitr::include_graphics('../man/figures/kruschke_latex.png')

The formulas on the right show the parts of the model relevant to computing the posterior on the Transcription Shift, TS. The index s refers to samples, and index bc refers to barcodes of a given allele.

Essentially, each count measurement has it's own negative binomial distribution, but information is shared across samples and barcodes.

Prior estimation

The package provides methods for estimating three types of priors: marginal, grouped, and conditional. The following cartoon demonstrates the underlying idea.

knitr::include_graphics('../man/figures/prior_types.png')

In each case, a single dot represents a maximum-likelihood estimate of a given parameter (e.g. a mean of a negative binomial distribution) for a single allele of one variant. The black line(s) represent the empirical prior on $\theta$ that one would derive using the given estimation method.

Marginal priors

The priors for the parameters negative binomial distribution are gamma distributions. These are estimated from the maximum likelihood estimates of means/dispersions observed in the MPRA assay globally, across all variants. This ensures that the priors for one allele are on a reasonable scale in the context of one experiment; the user doesn't need to know on a priori how to set these distributions.

In malacoda, this is done with fit_marg_prior() like so:

marg_prior = fit_marg_prior(mpra_data,
                            rep_cutoff = .15,
                            plot_rep_cutoff = TRUE,
                            n_cores = 1)

This prior fitting includes a pre-filtering step that involves identifying which barcodes are sufficiently well-represented in the DNA samples. It also renders a plot of the DNA barcode representation with the input representation cutoff (defaulting to the 15th percentile) indicated by a dotted line. Users can and should tune this cutoff according to their assay with get_sample_depths() get_well_represented().

For a typical MPRA assay where a majority of variants assayed have practically 0 effect on transcription shift, the marginal prior ends up being fairly non-informative. For both alleles, most of the prior density ends up around the average output of the assay, but the prior ascribes a small amount of density to extreme values.

Conditional priors

The marginal prior is relatively un-informative. malacoda provides methods for conditional density estimation on informative genomic annotations as a means to increase statistical efficiency while avoiding false positives.

This example (included with the package as u_deepsea shows DeepSea's predicted K562 DNase hypersensitivity for the variants in Ulirsch, 2016:

load('../data/u_deepsea.RData')

knitr::kable(head(u_deepsea))

While this example shows only a single annotation, the methods included with this package can utilize an arbitrary number of annotations. A user could also include ENCODE transcription factor binding profiles, additional DeepSea predictions, etc. Whatever set of continuous annotations they think can provide information about their MPRA assay's results.

The function fit_cond_prior estimates conditional priors for each variant with these annotations. It does this in the following way:

  1. Normalize the scale of the annotations.
  2. Compute a pairwise distance matrix between the assay's variants in euclidean "annotation space".
  3. Take the minimum non-zero distance as the initial minimal weighting kernel width.
  4. For each variant,
    1. Assign a weight to all other variants in the assay according to a t-distribution kernel characterized by the initial weight.
    2. If fewer than 30 variants don't meaningfully contribute to the prior, increase the kernel width by 1.3x and re-set the weights. (The 30 and 1.3x are adjustable tuning parameters but these default settings work well in practice).
  5. Perform weighted density estimation for each variant to get the priors.

This function can be called like so:

cond_prior = fit_cond_prior(mpra_data = umpra_example,
                            annotations = example_deepsea,
                            n_cores = 1,
                            rep_cutoff = .15, 
                            plot_rep_cutoff = TRUE,
                            min_neighbors = 30)

This process gets better the more variants are included in the analysis. The more variants included, the more finely the annotation space is explored and the more precisely the nearest variants will give clear indication of the prior belief. Tuning min_neighbors provides a bias-variance tradeoff; decreasing it will increase the most proximal neighbors, but increase the variance of the prior estimation process. Increasing min_neighbors will make the conditional prior estimation more stable, but as min_neighbors grows the conditional prior becomes less informative and gradually approaches the marginal.

In practice, this conditional prior can provide substantial improvements. Here's an example showing the normalized counts for one (strongly functional) variant in Ulirsch 2016 with the conditional and marginal priors:

knitr::include_graphics('../man/figures/cond_prior_improvement.png')

There are only 100 variants included with the example data for the sake of keeping the size of the package small, however published MPRA's have tested tens of thousands of variants (Tewhey et al., Cell, 2016).

Grouped priors

The fit_grouped_prior function uses a similar syntax as above to fit a group-wise prior. Rather than continuous annotations, the annotations here are discrete. It makes a few checks to ensure there are enough variants per group then simply calls fit_marg_prior on each group.

Fitting the model

The model can be fit with a call to fit_mpra_model():

example_result = fit_mpra_model(mpra_data, 
                                out_dir = '/path/to/output/directory/',
                                priors = cond_prior, # or = marg_prior
                                n_cores = 1)

This will:

  1. Run a variational Stan model that fits the model described above for each variant.
  2. If the variational result looks promising, it will run the standard Stan MCMC sampler on the model.
  3. Save the stanfit object for the variant as <variant_id>.RData in the output directory.
  4. Compute some summary statistics for each variant, including a 95% transcription shift highest density interval and (based on that) a binary call for variant function (the is_functional column).
  5. Compile the results into a data frame.
load('../data/example_result.RData')

example_result %>% 
  select(-note, -ts_rope_mass) %>% 
  head %>% 
  knitr::kable(col.names = c('variant_id', 'TS Posterior Mean', 'Ref Activity', 'Alt Activity', 'is_functional', 'HDI lower', 'HDI upper'))

Filtering the result table like so will give the set of hits from the analysis:

 example_result %>% 
  dplyr::filter(is_functional) %>% 
  select(-note, -ts_rope_mass) %>% 
  head %>% 
  knitr::kable(col.names = c('variant_id', 'TS Posterior Mean', 'Ref Activity', 'Alt Activity', 'is_functional', 'HDI lower', 'HDI upper'))

The MCMC parameters are adjustable (see ?fit_mpra_model).

A region of practical equivalence (ROPE) parameter ts_rope can be set to receive p-value-esque output in order to get some sense of how the result compares to a "null" model. Defining a ROPE requires experiment-specific statistical knowledge, so the default is intentionally fairly liberal at +/- log(3/2). This means that any posterior belief that the alternate allele causes less than a +/- 50% shift is deemed "practically equivalent" to no shift at all. Ideally users should analyze the assay's noise and use some reasonable justification to set this argument themselves. Statisticians seeking to use this feature can find more information on ROPE's in Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan by John Kruschke.

fit_mpra_model has several other parameters that should be inspected by the user.

Traditional Analysis

The package also provides functions for traditional t-test based analyses. See ?compute_activities and ?run_activity_tests. These functions compute the "Activity", $log(mRNA/DNA)$, after depth-correction for each count measurement. A simple t-test can be used to assess the statistical significance of the difference in mean activity. This method has several disadvantages:

Additional details on the statistical properties of these methods can be found in the supplement of Ghazi et al., Bioinformatics, 2018. They work, but not amazingly well.

Output visualization

The stanfit objects output by the MCMC sampler in fit_mpra_model are written to the user-supplied output directory. This makes them amenable for use with all of the plotting functions in packages like bayesplot.

malacoda also provides a number of plotting functions for visualizing the outputs.

malacoda intervals

The function malacoda_intervals() can be used to easily visualize the posterior intervals for all model parameters for the fit of a given variant. It also includes the posteriors on the derived quantities activity and transcription shift.

knitr::include_graphics('../man/figures/mi_example.png')

Posterior beeswarms

Posterior beeswarms via posterior_beeswarm overlay the activity point estimates (log(RNA/DNA) by barcode and sequencing sample from compute_activities) with the posteriors on mean activity (from fit_mpra_model). Setting color_by_sample = TRUE results in a very colorful plot that makes it easy to see if the posterior aligns with the point estimates and to ensure that none of the samples vary systematically.

knitr::include_graphics('../man/figures/posterior_beeswarm_example.png')

Conditional prior visualization

It's natural to ask how much the conditional prior improves inference over the marginal prior. One metric to get at this is the "prior ratio": for a given parameter, we compute the density of the maximum likelihood estimate under the conditional relative to the density under the marginal.

knitr::include_graphics(path = '../man/figures/cond_hexs.png')

It's worth noting that the fraction of estimates above the dotted line will not approach 1 with increasingly informative annotations. The maximum likelihood estimates are themselves noisy, so we a priori expect that sometimes the estimate will be wrong in the direction favored by the marginal prior. Increasingly better annotation sources will instead make the fraction above the line approach some hard-to-define limit imposed by the noise in the assay.

The example below shows a similar plot with a conditional prior trained on random noise. As can be seen, the points are NOT systematically above the dotted line.

knitr::include_graphics(path = '../man/figures/noise_hexs.png')

Sample correlations

A simple QC visualization, this plot shows the pairwise count correlations of barcodes across the different samples. RNA samples should correlate highly with RNA samples, and DNA with DNA. However if RNA samples correlate highly with DNA samples, that can indicate DNA contamination of the RNA samples.

knitr::include_graphics(path = '../man/figures/mpra_corr.png')

CRISPR analysis

The model structure shown in the Kruschke diagram works for NGS assays beyond MPRA. We have extended the analysis method to a traditional type of CRISPR screen that looks for the effect of CRISPR-induced gene knockout on cell survival.

Survival/Dropout model

We have implemented a model for the most basic type of CRISPR screen, usable with fit_dropout_model. This model analyzes groups of gRNAs targeting pre-specified genes, looking for significantly altered abundance in input/output sequencing pools.

Future directions

Improved CRISPR support

We are developing improved support for CRISPR-focused methods with different experimental structures and support for grouped/conditional priors.

Optimization / speed

While MCMC has a reputation for slow computation, the slowest functionality in this package is typically fitting large conditional priors. The conditional prior fitting process can involve optimizing multiple gamma distributions for thousands of weighted points thousands of times, so it can take a long time on laptops for large MPRA's -- upwards of a minute per variant.

The actual sampling process is quite fast for the amount of computation involved. The Stan models are automatically compiled to C++, and under the hood Stan implements a No-U-Turn-Sampler which very efficiently explores parameter space. With parallelization on an Amazon EC2 instance with a large number of cores, it's possible to analyze the full Ulirsch, 2016 dataset in less than an hour for a few dollars. MPRAs are big, slow, expensive experiments. They're worth the comparatively small, fast, and cheap computation costs!



andrewGhazi/malacoda documentation built on Aug. 2, 2020, 12:54 a.m.