fit_mpra_model: Fit a Bayesian MPRA model
In andrewGhazi/malacoda: Bayesian Analysis of High-Throughput Genomic Assays

Description Usage Arguments Details Value Note Examples

This function fits a negative-binomial based Bayesian model to MPRA data. Optional annotations can be included to allow for more informative conditional priors.

fit_mpra_model(
  mpra_data,
  annotations = NULL,
  group_df = NULL,
  out_dir = NULL,
  save_nonfunctional = FALSE,
  priors = NULL,
  n_cores = 1,
  n_chains = 4,
  tot_samp = 2000,
  n_warmup = 200,
  vb_pass = TRUE,
  vb_prob = 0.8,
  ts_hdi_prob = 0.95,
  ts_rope = c(-0.405, 0.405),
  rep_cutoff = 0.15,
  adaptive_precision = TRUE,
  verbose = TRUE
)

`mpra_data`	a data frame of MPRA data with 1 column called variant_id, an allele column, a barcode column, and additional columns per sequencing sample. Each row is for a single barcode.
`annotations`	an optional data frame of annotations with identical variant_ids and an arbitrary number of functional annotations. If omitted, the prior for a given variant is influenced by all other variants in the assay equally.
`group_df`	an optional data frame giving group identity by variant_id in mpra_data
`out_dir`	path to output directory
`save_nonfunctional`	logical indicating whether or not to save the sampler results for variants identified as non-functional
`priors`	optional objects provided by either fit_marg_prior() or fit_cond_prior.
`n_cores`	number of cores across which to parallelize variant MPRA samplers
`n_chains`	number of MCMC chains to run in each sampler
`tot_samp`	total number of MCMC draws to take, spread evenly across chains
`n_warmup`	total number of warmup draws to take from each MCMC chain
`vb_pass`	logical indicating whether to use a variational first pass
`vb_prob`	numeric 0 - 1 indicating probability mass to use as a TS HDI for identifying "promising" candidates for MCMC followup
`ts_hdi_prob`	probability mass to include in the highest density interval on transcription shift to call MPRA-functional variants.
`ts_rope`	length 2 numeric vector describing the boundaries of the transcription shift region of practical equivalence (ROPE), defaulting to +/- log(3/2)
`rep_cutoff`	a representation cutoff quantile (0 to 1)
`adaptive_precision`	logical indicating whether to adaptively adjust the length of the posterior MCMC chain for borderline functional variants
`verbose`	logical indicating whether to print messages

mpra_data must contain the following groups of columns:

variant_id
allele - either 'ref' or 'alt'
barcode - a unique index sequence for that row (ideally the same barcode used in the assay)
at least one column of MPRA counts whose column name(s) matches 'DNA'
at least one column of MPRA counts whose column name(s) matches 'RNA'

annotations must contain the same variant_id's used in mpra_data. Additional columns are used as informative predictors: when estimating the priors for one variant, other variants with similar annotations will be upweighted in the prior-fitting process.

If priors is provided, any annotations input will be ignored. This can be useful when you want to fit models again without having to spend time re-fitting the priors.

Sampler results will be saved to out_dir. By default, only the sampler results for variants identified as MPRA-functional will be saved. This behavior can be changed by setting save_nonfunctional to TRUE.

We've set the sampler parameters (n_chains to n_warmup) to values that work reasonably well at reasonable speeds for typical MPRA data on typical hardware. Final analyses and/or models fit to larger MPRA experiments will likely want to increase n_chains and tot_samp considerably to ensure precise convergence.

vb_pass indicates whether to use a first pass variational check to see if a given variant is worth running the MCMC sampler. It does this by checking if a 40 excludes 0. This speeds up posterior evaluation considerably, but gives approximate results. If vb_pass is set to FALSE, all variants get MCMC.

ts_rope can be used to define a "Region Of Practical Equivalence" for transcription shift. This is some small-ish region around 0 where observed posterior samples are "practically equivalent" to 0. The output column ts_rope_mass returns the fraction of transcription shift posterior samples that fall within the defined ROPE along with the usual model outputs. If this fraction is small, one can say that there is very little posterior belief that the variant's transcription shift is practically equivalent to 0. The user must be cognizant of defining the region in accordance with observed noise and effect size levels. Note that the output ROPE fractions are NOT p-values.

Barcodes below the rep_cutoff quantile of representation in the DNA pools are discarded.

adaptive_precision indicates whether to adaptively increase the length of the MCMC chains for borderline functional variants. The edges of a 95 one edge of a variant's HDI interval is close to 0, this setting will tell the sampler to double the length of the MCMC chain. Using the default 95 HDI, this means that if a 92.5 the sampler will double the length of the MCMC chain. This argument should be turned off if tot_samp is already set to a high value.

a data frame with a row for each variant_id that specifies the posterior mean TS, upper and lower HDI bounds, a binary call of functional or non-functional, and other appropriate outputs. The output column is_functional is defined by the TS HDI excluding 0.

Sampler results for individual variants will be saved to the specified out_dir as they can be several megabytes each. The table of summary statistics that this function returns will also be saved into this directory into an object called "analysis_res.RData".

# This example fits the malacoda model on 3 variants with too-short MCMC chains

example_variants = c("1_205247315_2-3", "10_101274365_1-3", "10_45966422_2-3")

examples_to_evaluate = umpra_example[umpra_example$variant_id %in% example_variants,]

# tot_samp should be set to >50,000 to ensure the posterior chains converge
example_result = fit_mpra_model(mpra_data = examples_to_evaluate,
 priors = marg_prior_example,
 vb_pass = FALSE,
 tot_samp = 20,
 n_warmup = 10,
 adaptive_precision = FALSE)

print(example_result)