Simulate_bakRData: Simulating nucleotide recoding data

View source: R/Simulations.R

Simulate_bakRDataR Documentation

Simulating nucleotide recoding data

Description

Simulate_bakRData simulates a bakRData object. It's output also includes the simulated values of all kinetic parameters of interest. Only the number of genes (ngene) has to be set by the user, but an extensive list of additional parameters can be adjusted.

Usage

Simulate_bakRData(
  ngene,
  num_conds = 2L,
  nreps = 3L,
  eff_sd = 0.75,
  eff_mean = 0,
  fn_mean = 0,
  fn_sd = 1,
  kslog_c = 0.8,
  kslog_sd = 0.95,
  tl = 60,
  p_new = 0.05,
  p_old = 0.001,
  read_lengths = 200L,
  p_do = 0,
  noise_deg_a = -0.3,
  noise_deg_b = -1.5,
  noise_synth = 0.1,
  sd_rep = 0.05,
  low_L2FC_ks = -1,
  high_L2FC_ks = 1,
  num_kd_DE = c(0L, as.integer(rep(round(as.integer(ngene)/2), times =
    as.integer(num_conds) - 1))),
  num_ks_DE = rep(0L, times = as.integer(num_conds)),
  scale_factor = 150,
  sim_read_counts = TRUE,
  a1 = 5,
  a0 = 0.01,
  nreads = 50L,
  alpha = 25,
  beta = 75,
  STL = FALSE,
  STL_len = 40,
  lprob_U_sd = 0,
  lp_sd = 0
)

Arguments

ngene

Number of genes to simulate data for

num_conds

Number of experimental conditions (including the reference condition) to simulate

nreps

Number of replicates to simulate

eff_sd

Effect size; more specifically, the standard deviation of the normal distribution from which non-zero changes in logit(fraction new) are pulled from.

eff_mean

Effect size mean; mean of normal distribution from which non-zero changes in logit(fraction new) are pulled from. Note, setting this to 0 does not mean that some of the significant effect sizes will be 0, as any exact integer is impossible to draw from a continuous random number generator. Setting this to 0 just means that there is symmetric stabilization and destabilization

fn_mean

Mean of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this mean

fn_sd

Standard deviation of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this sd

kslog_c

Synthesis rate constants will be drawn from a lognormal distribution with meanlog = kslog_c - mean(log(kd_mean)) where kd_mean is determined from the fraction new simulated for each gene as well as the label time (tl).

kslog_sd

Synthesis rate lognormal standard deviation; see kslog_c documentation for details

tl

metabolic label feed time

p_new

metabolic label (e.g., s4U) induced mutation rate. Can be a vector of length num_conds

p_old

background mutation rate

read_lengths

Total read length for each sequencing read (e.g., PE100 reads correspond to read_lengths = 200)

p_do

Rate at which metabolic label containing reads are lost due to dropout; must be between 0 and 1

noise_deg_a

Slope of trend relating log10(standardized read counts) to log(replicate variability)

noise_deg_b

Intercept of trend relating log10(standardized read counts) to log(replicate variability)

noise_synth

Homoskedastic variability of L2FC(ksyn)

sd_rep

Variance of lognormal distribution from which replicate variability is drawn

low_L2FC_ks

Most negative L2FC(ksyn) that can be simulated

high_L2FC_ks

Most positive L2FC(ksyn) that can be simulated

num_kd_DE

Vector where each element represents the number of genes that show a significant change in stability relative to the reference. 1st entry must be 0 by definition (since relative to the reference the reference sample is unchanged)

num_ks_DE

Same as num_kd_DE but for significant changes in synthesis rates.

scale_factor

Factor relating RNA concentration (in arbitrary units) to average number of read counts

sim_read_counts

Logical; if TRUE, read counts are simulated as coming from a heterodisperse negative binomial distribution

a1

Heterodispersion 1/reads dependence parameter

a0

High read depth limit of negative binomial dispersion parameter

nreads

Number of reads simulated if sim_read_counts is FALSE

alpha

shape1 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are drawn for each gene.

beta

shape2 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are drawn for each gene.

STL

logical; if TRUE, simulation is of STL-seq rather than a standard TL-seq experiment. The two big changes are that a short read length is required (< 60 nt) and that every read for a particular feature will have the same number of Us. Only one read length is simulated for simplicity.

STL_len

Average length of simulated STL-seq length. Since Pol II typically pauses about 20-60 bases from the promoter, this should be around 40

lprob_U_sd

Standard deviation of the logit(probability nt is a U) for each sequencing read. The number of Us in a sequencing read are drawn from a binomial distribution with prob drawn from a logit-Normal distribution with this logit-sd.

lp_sd

Standard deviation of logit(probability a U is mutated) for each U. The number of mutations in a given read is the sum of nU Bernoulli random variables, where nU is the number of Us, and p is drawn from a logit-normal distribution with lp_sd standard deviation on logit scale.

Details

Simulate_bakRData simulates a bakRData object using a realistic generative model with many adjustable parameters. Average RNA kinetic parameters are drawn from biologically inspired distributions. Replicate variability is simulated by drawing a feature's fraction new in a given replicate from a logit-Normal distribution with a heteroskedastic variance term with average magnitude given by the chosen read count vs. variance relationship. For each replicate, a feature's ksyn is drawn from a homoskedastic lognormal distribution. Read counts can either be set to the same value for all simulated features or can be simulated according to a heterodisperse negative binomial distribution. The latter is the default

The number of Us in each sequencing read is drawn from a binomial distribution with number of trials equal to the read length and probability of each nucleotide being a U drawn from a beta distribution. Each read is assigned to the new or old population according to a Bernoulli distribution with p = fraction new. The number of mutations in each read are then drawn from one of two binomial distributions; if the read is assigned to the population of new RNA, the number of mutations are drawn from a binomial distribution with number of trials equal to the number of Us and probability of mutation = p_new; if the read is assigned to the population of old RNA, the number of mutations is instead drawn from a binomial distribution with the same number of trials but with the probability of mutation = p_old. p_new must be greater than p_old because mutations in new RNA arise from both background mutations that occur with probability p_old as well as metabolic label induced mutations

Simulated read counts should be treated as if they are spike-in and RPKM normalized, so the same scale factor of 1 can be applied to each sample when comparing the sequencing reads (e.g., if you are performing differential expression analysis).

Function to simulate a bakRData object according to a realistic generative model

Value

A list containing a simulated bakRData object as well as a list of simulated kinetic parameters of interest. The contents of the latter list are:

  • Effect_sim; Dataframe meant to mimic formatting of Effect_df that are part of bakRFit(StanFit = TRUE), bakRFit(HybridFit = TRUE) and bakRFit(bakRData object) output.

  • Fn_mean_sim; Dataframe meant to mimic formatting of Regularized_ests that is part of bakRFit(bakRData object) output. Contains information about the true fraction new simulated in each condition (the mean of the normal distribution from which replicate fraction news are simulated)

  • Fn_rep_sim; Dataframe meant to mimic formatting of Fn_Estimates that is part of bakRFit(bakRData object) output. Contains information about the fraction new simulated for each feature in each replicate of each condition.

  • L2FC_ks_mean; The true L2FC(ksyn) for each feature in each experimental condition. The i-th column corresponds to the L2FC(ksyn) when comparing the i-th condition to the reference condition (defined as the 1st condition) so the 1st column is always all 0s

  • RNA_conc; The average number of normalized read counts expected for each feature in each sample.

Examples


# 2 replicate, 2 experimental condition, 1000 gene simulation
sim_2reps <- Simulate_bakRData(ngene = 1000, nreps = 2)

# 3 replicate, 2 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics
sim_3reps <- Simulate_bakRData(ngene = 1000, num_kd_DE = c(0, 100))

# 2 replicates, 3 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics in the 1st
# condition and no instances of differential degradation kinetics in the
# 2nd condition
sim_3es <- Simulate_bakRData(ngene = 1000,
                             nreps = 2,
                             num_conds = 3,
                             num_kd_DE = c(0, 100, 0))



bakR documentation built on June 22, 2024, 6:55 p.m.