Nothing
#' Simulating nucleotide recoding data
#'
#' \code{Simulate_bakRData} simulates a `bakRData` object. It's output also includes the simulated
#' values of all kinetic parameters of interest. Only the number of genes (\code{ngene}) has to be set by the
#' user, but an extensive list of additional parameters can be adjusted.
#'
#' \code{Simulate_bakRData} simulates a `bakRData` object using a realistic generative model with many
#' adjustable parameters. Average RNA kinetic parameters are drawn from biologically inspired
#' distributions. Replicate variability is simulated by drawing a feature's
#' fraction new in a given replicate from a logit-Normal distribution with a heteroskedastic
#' variance term with average magnitude given by the chosen read count vs. variance relationship.
#' For each replicate, a feature's ksyn is drawn from a homoskedastic lognormal distribution. Read counts
#' can either be set to the same value for all simulated features or can be simulated according to
#' a heterodisperse negative binomial distribution. The latter is the default
#'
#' The number of Us in each sequencing read is drawn from a binomial distribution with number of trials
#' equal to the read length and probability of each nucleotide being a U drawn from a beta distribution. Each read is assigned to the
#' new or old population according to a Bernoulli distribution with p = fraction new. The number of
#' mutations in each read are then drawn from one of two binomial distributions; if the read is assigned to the
#' population of new RNA, the number of mutations are drawn from a binomial distribution with number of trials equal
#' to the number of Us and probability of mutation = \code{p_new}; if the read is assigned to the population of old RNA,
#' the number of mutations is instead drawn from a binomial distribution with the same number of trials but with the probability
#' of mutation = \code{p_old}. \code{p_new} must be greater than \code{p_old} because mutations in new RNA
#' arise from both background mutations that occur with probability \code{p_old} as well as metabolic label induced mutations
#'
#' Simulated read counts should be treated as if they are spike-in and RPKM normalized, so the same scale factor of 1 can be applied
#' to each sample when comparing the sequencing reads (e.g., if you are performing differential expression analysis).
#'
#' Function to simulate a `bakRData` object according to a realistic generative model
#' @param ngene Number of genes to simulate data for
#' @param num_conds Number of experimental conditions (including the reference condition) to simulate
#' @param nreps Number of replicates to simulate
#' @param eff_sd Effect size; more specifically, the standard deviation of the normal distribution from which non-zero
#' changes in logit(fraction new) are pulled from.
#' @param eff_mean Effect size mean; mean of normal distribution from which non-zero changes in logit(fraction new) are pulled from.
#' Note, setting this to 0 does not mean that some of the significant effect sizes will be 0, as any exact integer is impossible
#' to draw from a continuous random number generator. Setting this to 0 just means that there is symmetric stabilization and destabilization
#' @param fn_mean Mean of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is
#' metabolically labeled (new) is drawn from a normal distribution with this mean
#' @param fn_sd Standard deviation of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA
#' from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this sd
#' @param kslog_c Synthesis rate constants will be drawn from a lognormal distribution with meanlog = \code{kslog_c} - mean(log(kd_mean)) where kd_mean
#' is determined from the fraction new simulated for each gene as well as the label time (\code{tl}).
#' @param kslog_sd Synthesis rate lognormal standard deviation; see kslog_c documentation for details
#' @param tl metabolic label feed time
#' @param p_new metabolic label (e.g., s4U) induced mutation rate. Can be a vector of length num_conds
#' @param p_old background mutation rate
#' @param read_lengths Total read length for each sequencing read (e.g., PE100 reads correspond to read_lengths = 200)
#' @param p_do Rate at which metabolic label containing reads are lost due to dropout; must be between 0 and 1
#' @param noise_deg_a Slope of trend relating log10(standardized read counts) to log(replicate variability)
#' @param noise_deg_b Intercept of trend relating log10(standardized read counts) to log(replicate variability)
#' @param noise_synth Homoskedastic variability of L2FC(ksyn)
#' @param sd_rep Variance of lognormal distribution from which replicate variability is drawn
#' @param low_L2FC_ks Most negative L2FC(ksyn) that can be simulated
#' @param high_L2FC_ks Most positive L2FC(ksyn) that can be simulated
#' @param num_kd_DE Vector where each element represents the number of genes that show a significant change in stability relative
#' to the reference. 1st entry must be 0 by definition (since relative to the reference the reference sample is unchanged)
#' @param num_ks_DE Same as num_kd_DE but for significant changes in synthesis rates.
#' @param scale_factor Factor relating RNA concentration (in arbitrary units) to average number of read counts
#' @param sim_read_counts Logical; if TRUE, read counts are simulated as coming from a heterodisperse negative binomial distribution
#' @param a1 Heterodispersion 1/reads dependence parameter
#' @param a0 High read depth limit of negative binomial dispersion parameter
#' @param nreads Number of reads simulated if sim_read_counts is FALSE
#' @param alpha shape1 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are
#' drawn for each gene.
#' @param beta shape2 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are
#' drawn for each gene.
#' @param STL logical; if TRUE, simulation is of STL-seq rather than a standard TL-seq experiment. The two big changes are that a short read length is required
#' (< 60 nt) and that every read for a particular feature will have the same number of Us. Only one read length is simulated for simplicity.
#' @param STL_len Average length of simulated STL-seq length. Since Pol II typically pauses about 20-60 bases
#' from the promoter, this should be around 40
#' @param lprob_U_sd Standard deviation of the logit(probability nt is a U) for each sequencing read. The number of Us in a
#' sequencing read are drawn from a binomial distribution with prob drawn from a logit-Normal distribution with this logit-sd.
#' @param lp_sd Standard deviation of logit(probability a U is mutated) for each U. The number of mutations in a given read is the sum of
#' nU Bernoulli random variables, where nU is the number of Us, and p is drawn from a logit-normal distribution with lp_sd standard deviation
#' on logit scale.
#' @importFrom magrittr %>%
#' @import data.table
#' @return A list containing a simulated `bakRData` object as well as a list of simulated kinetic parameters of interest.
#' The contents of the latter list are:
#' \itemize{
#' \item Effect_sim; Dataframe meant to mimic formatting of Effect_df that are part of \code{bakRFit(StanFit = TRUE)}, \code{bakRFit(HybridFit = TRUE)} and \code{bakRFit(bakRData object)} output.
#' \item Fn_mean_sim; Dataframe meant to mimic formatting of Regularized_ests that is part of \code{bakRFit(bakRData object)} output. Contains information
#' about the true fraction new simulated in each condition (the mean of the normal distribution from which replicate fraction news are simulated)
#' \item Fn_rep_sim; Dataframe meant to mimic formatting of Fn_Estimates that is part of \code{bakRFit(bakRData object)} output. Contains information
#' about the fraction new simulated for each feature in each replicate of each condition.
#' \item L2FC_ks_mean; The true L2FC(ksyn) for each feature in each experimental condition. The i-th column corresponds to the L2FC(ksyn) when comparing
#' the i-th condition to the reference condition (defined as the 1st condition) so the 1st column is always all 0s
#' \item RNA_conc; The average number of normalized read counts expected for each feature in each sample.
#' }
#' @examples
#' \donttest{
#' # 2 replicate, 2 experimental condition, 1000 gene simulation
#' sim_2reps <- Simulate_bakRData(ngene = 1000, nreps = 2)
#'
#' # 3 replicate, 2 experimental condition, 1000 gene simulation
#' # with 100 instances of differential degradation kinetics
#' sim_3reps <- Simulate_bakRData(ngene = 1000, num_kd_DE = c(0, 100))
#'
#' # 2 replicates, 3 experimental condition, 1000 gene simulation
#' # with 100 instances of differential degradation kinetics in the 1st
#' # condition and no instances of differential degradation kinetics in the
#' # 2nd condition
#' sim_3es <- Simulate_bakRData(ngene = 1000,
#' nreps = 2,
#' num_conds = 3,
#' num_kd_DE = c(0, 100, 0))
#'
#' }
#' @export
#'
Simulate_bakRData <- function(ngene, num_conds = 2L, nreps = 3L, eff_sd = 0.75, eff_mean = 0,
fn_mean = 0, fn_sd = 1, kslog_c = 0.8, kslog_sd = 0.95,
tl = 60, p_new = 0.05, p_old = 0.001, read_lengths = 200L,
p_do = 0, noise_deg_a = -0.3, noise_deg_b = -1.5, noise_synth = 0.1, sd_rep = 0.05,
low_L2FC_ks = -1, high_L2FC_ks = 1,
num_kd_DE = c(0L, as.integer(rep(round(as.integer(ngene)/2), times = as.integer(num_conds)-1))),
num_ks_DE = rep(0L, times = as.integer(num_conds)),
scale_factor = 150,
sim_read_counts = TRUE, a1 = 5, a0 = 0.01,
nreads = 50L, alpha = 25, beta = 75,
STL = FALSE, STL_len = 40,
lprob_U_sd = 0, lp_sd = 0){
# Bind variables locally to resolve devtools::check() Notes
S <- R <- MT <- FN <- TC <- nT <- TP <- NULL
`.` <- list
#### Parameter checks
# STL
if(!is.logical(STL)){
"STL must be logical (TRUE or FALSE)"
}
# STL_len
if(STL){
if(STL_len < 25){
stop("STL_len is less than 25. Must be between 25 and 55.")
}else if(STL_len > 55){
stop("STL_len is greater than 55. Must be between 25 and 55.")
}
}
# alpha parameter of beta distribution
if(!is.numeric(alpha)){
stop("alpha must be numeric!")
}else if(alpha <= 0){
stop("alpha must be > 0")
}
if(!is.numeric(beta)){
stop("beta must be numeric!")
}else if(beta <= 0){
stop("beta must be > 0")
}
if(beta < 1 & alpha < 1){
warning("beta and alpha are both less than 1. This will lead to an unusual bimodality in the distribution of transcript U-contents")
}
if(alpha/(alpha + beta) < 0.1){
warning("alpha and beta are such that the average U-content is less than 0.1. That is unreasonably low and may make many simulated transcripts unanalyzable.")
}else if(alpha/(alpha + beta) > 0.75){
warning("alpha and beta are such that the average U-content is > 0.75. I wish U-contents were this high, your simulation may not reflect real data though.")
}
# Number of genes (features) to simulate
if(!is.numeric(ngene)){
stop("ngene must be numeric")
}else if(!is.integer(ngene)){
ngene <- as.integer(ngene)
}
if(ngene < 1){
stop("ngene must be > 0; it represents the number of features to simulate")
}
# Number of experimental conditions
if(!is.numeric(num_conds)){
stop("num_conds must be an numeric")
}else if(!is.integer(num_conds)){
num_conds <- as.integer(num_conds)
}
if(num_conds < 1){
stop("num_conds must be > 0; it represents the number of experimental conditions")
}else if(num_conds == 1){
warning("You are only simluating a reference condition with no experimental conditions.")
}
# Number of replicates
if(!is.numeric(nreps)){
stop("nreps must be numeric")
}else if(!is.integer(nreps)){
nreps <- as.integer(nreps)
}
if(nreps < 1){
stop("nreps must be > 0; it represents the number of replicates")
}else if(nreps == 1){
warning("You are only simulating a single replicate; All statistical models implemented in bakR require > 1 replicate.")
}
# Effect size distribution standard deviation
if(!is.numeric(eff_sd)){
stop("eff_sd must be numeric")
}else if(eff_sd < 0){
stop("eff_sd must be >= 0; it will be the sd parameter of a call to rnorm when simulating effect sizes")
}else if (eff_sd > 4){
warning("You are simulating an unusually large eff_sd (> 4)")
}
# Fraction new distribution sd
if(!is.numeric(fn_sd)){
stop("fn_sd must be numeric")
}else if(fn_sd < 0){
stop("fn_sd must be >= 0; it will be the sd parameter of a call to rnorm when simulating reference fraction news")
}else if (fn_sd > 2){
warning("You are simulating an unusually large fn_sd (> 2). This will lead to lots of extreme fraction news (close to 0 and 1).")
}
# Label time
if(!is.numeric(tl)){
stop("tl must be numeric")
}else if(tl <= 0){
stop("tl must be > 0; it represents the label time in minutes.")
}
# s4U mutation rate
if(!all(is.numeric(p_new))){
stop("p_new must be numeric")
}else if(!all(p_new > 0)){
stop("p_new must be > 0; it represents the mutation rate of s4U labeled transcripts")
}else if(!all(p_new <= 1)){
stop("p_new must be <= 1; it represents the mutation rate of s4U labeled transcripts")
}else if(!all(p_new < 0.2)){
warning("You have simulated an unusually large s4U induced mutation rate (>= 0.2)")
}else if(!all((p_new - p_old) > 0)){
stop("All p_new must be > p_old, since the background mutation rate (p_old) should always be less than the labeled mutation rate (p_new)")
}
# Background mutation rate
if(!all(is.numeric(p_old))){
stop("p_old must be numeric")
}else if(!all(p_old >= 0)){
stop("p_old must be >= 0; it represents the background mutation rate")
}else if(!all(p_old <= 1)){
stop("p_old must be <= 1; it represents the background mutation rate")
}else if(!all(p_old < 0.01)){
warning("You have simulated an unusually large background mutation rate (>= 0.01)")
}
# Read length
if(!all(is.numeric(read_lengths))){
stop("read_lengths must be numeric")
}else if(!all(is.integer(read_lengths))){
read_lengths <- as.integer(read_lengths)
}
if(!all(read_lengths > 0)){
stop("read_lengths must be > 0; it represents the total length of sequencing reads")
}else if(!all(read_lengths >= 20)){
warning("You have simulated an unusually short read length (< 20 nucleotides)")
}
# Dropout probability
if(!all(is.numeric(p_do))){
stop("p_do must be numeric")
}else if(!all(p_do <= 1)){
stop("p_do must be <= 1; it represents the fraction of s4U containing RNA lost during library prep")
}else if(!all(p_do >= 0)){
stop("p_do must be >= 0; it represents the fraction of s4U containing RNA lost during library prep")
}else if(!all(p_do == 0)){
warning("You are simulating dropout. Simulate_bakRData only properly simulates the effect of dropout on fraction news. Simulate_relative_bakRData implements a more realistic dropout simulation that also accurately affects read counts.")
}
# Heteroskedastic Slope
if(!is.numeric(noise_deg_a)){
stop("noise_deg_a must be numeric")
}else if(noise_deg_a >= 0){
stop("noise_deg_a must be >= 0; it represents the slope of the log10(read depth) vs. log(replicate variability) trend")
}else if(noise_deg_a == 0){
warning("You are simulating fraction new homoskedasticity, which is not reflective of actual nucleotide recoding data")
}
# Synthesis variability
if(!is.numeric(noise_synth)){
stop("noise_synth must be numeric")
}else if(noise_synth < 0){
stop("noise_synth must be >= 0; it represents the homoskeastic variability in the synthesis rate")
}
# Replicate variability variability
if(!is.numeric(sd_rep)){
stop("sd_rep must be numeric")
}else if(sd_rep < 0){
stop("sd_rep must be >= 0; it represents the sdlog parameter of the log-normal distribution from which replicate variabilites are simulated")
}else if(sd_rep == 0){
warning("You are simulating no variability in the replicate variability. This can cause minor convergence issues in the completely hierarchical
models implemented by bakR.")
}
# L2FC(ksyn) Bounds
if(!all(is.numeric(c(high_L2FC_ks, low_L2FC_ks)))){
stop("high_L2FC_ks and low_L2FC_ks must be numeric")
}else if(high_L2FC_ks < low_L2FC_ks){
stop("high_L2FC_ks must be greater than low_L2FC_ks; they represent the upper and lower bound of a uniform distribution respectively")
}
# Number of differentially stabilized features
if(!all(is.numeric(num_kd_DE))){
stop("All elements of num_kd_DE must be numeric")
}else if(!all(is.integer(num_kd_DE))){
num_kd_DE <- as.integer(num_kd_DE)
}
if (length(num_kd_DE) > num_conds){
stop("num_kd_DE has too many elements; it should be a vector of length == num_conds")
}else if (length(num_kd_DE) < num_conds){
stop("num_kd_DE has too few elements; it should be a vector of length == num_conds")
}else if(num_kd_DE[1] != 0){
stop("The 1st element of num_kd_DE must equal 0 by definition, as it represents the number of features differentially stabilized in the reference
condition relative to the reference condition")
}else if(!all(num_kd_DE <= ngene)){
stop("Not all elements of num_kd_DE are less than or equal to the total number of simulated features")
}else if(!all(num_kd_DE >= 0)){
stop("Not all elements of num_kd_DE are > 0")
}
# Number of differentially synthesized features
if(!all(is.numeric(num_ks_DE))){
stop("All elements of num_ks_DE must be numeric")
}else if(!all(is.integer(num_ks_DE))){
num_ks_DE <- as.integer(num_ks_DE)
}
if(length(num_ks_DE) > num_conds){
stop("num_ks_DE has too many elements; it should be a vector of length == num_conds")
}else if (length(num_ks_DE) < num_conds){
stop("num_ks_DE has too few elements; it should be a vector of length == num_conds")
}else if(num_ks_DE[1] != 0){
stop("The 1st element of num_ks_DE must equal 0 by definition, as it represents the number of features differentially stabilized in the reference
condition relative to the reference condition")
}else if(!all(num_ks_DE <= ngene)){
stop("Not all elements of num_ks_DE are less than or equal to the total number of simulated features")
}else if(!all(num_ks_DE >= 0)){
stop("Not all elements of num_ks_DE are > 0")
}
# Scale factor
if(!is.numeric(scale_factor)){
stop("scale_factor must be numeric")
}else if(scale_factor <=0){
stop("scale_factor must be > 0; it represents the factor by which the RNA concentration is multiplied to yield the average number of sequencing reads")
}else if(scale_factor < 10){
warning("You are simulating an unusually low scale_factor (< 10); small scale factors will lead to low read counts.")
}
# Simulate read count Boolean
if(!is.logical(sim_read_counts)){
stop("sim_read_counts must be a logical (TRUE or FALSE)")
}else if(sim_read_counts){
if(!is.integer(nreads)){
stop("nreads must be an integer")
}else if(nreads <= 0){
stop("nreads must be > 0; it represents the number of sequencing reads to be simulated for all features")
}
}
# Heterodispersion parameters
if(!all(is.numeric(c(a1, a0)))){
stop("a1 and a0 must be numeric")
}else if(a1 < 0){
stop("a1 must be >= 0; it relates the average read count to the negative binomial dispersion parameter")
}else if(a0 <= 0){
stop("a0 must be > 0; it represents the high read count limit of the negative binomial dispersion parameter")
}
# Parameters without numeric bounds
if(!is.numeric(noise_deg_b)){
stop("noise_deg_b must be numeric")
}
if(!is.numeric(eff_mean)){
stop("eff_mean must be numeric")
}
if(length(p_new) == 1){
p_new <- rep(p_new, times=num_conds)
}else if(length(p_new) != num_conds){
stop("p_new must be of length 1 or length == num_conds")
}
if(length(p_old) == 1){
p_old <- rep(p_old, times=num_conds)
}else if(length(p_old) != num_conds){
stop("p_old must be of length 1 or length == num_conds")
}
if(length(read_lengths) == 1){
read_lengths <- rep(read_lengths, times=num_conds)
}else if(length(read_lengths) != num_conds){
stop("read_lengths must be of length 1 or length == num_conds")
}
if(length(p_do) == 1){
p_do <- rep(p_do, times=num_conds)
}else if(length(p_do) != num_conds){
stop("p_do must be of length 1 or length == num_conds")
}
#### Simulation function from bakR
# Average number of Us in reads from each simulated feature
U_cont <- stats::rbeta(ngene, alpha, beta)
# Define helper functions:
logit <- function(x) log(x/(1-x))
inv_logit <- function(x) exp(x)/(1+exp(x))
#Initialize matrices
fn <- rep(0, times=ngene*nreps*num_conds)
dim(fn) <- c(ngene, num_conds, nreps)
Counts <- fn
kd <- fn
ks <- fn
#Initialize vectors of mean values for each gene and condition
fn_mean <- inv_logit(stats::rnorm(n=ngene, mean=fn_mean, sd=fn_sd))
kd_mean <- -log(1-fn_mean)/tl
ks_mean <- stats::rlnorm(n=ngene, meanlog = kslog_c + log(kd_mean), sdlog = kslog_sd)
effect_mean <- rep(0, times = ngene*num_conds)
dim(effect_mean) <- c(ngene, num_conds)
L2FC_ks_mean <- effect_mean
L2FC_kd_mean <- effect_mean
# Simualte true kinetic parameter values
for(i in 1:ngene){
#Make sure the user didn't input the wrong
#number of significant genes
if (i == 1 ){
if (length(num_kd_DE) > num_conds){
stop("num_kd_DE has too many elements")
} else if (length(num_kd_DE) < num_conds){
stop("num_kd_DE has too few elements")
}
if (length(num_ks_DE) > num_conds){
stop("num_ks_DE has too many elements")
} else if (length(num_ks_DE) < num_conds){
stop("num_ks_DE has too few elements")
}
}
# Simulate effect sizes (differences in kinetic parameters)
# effect_mean = difference in logit(fraction new)s
# L2FC_ks_mean = difference in L2FC(ksyn)
for(j in 1:num_conds){
if(j == 1){
effect_mean[i,1] <- 0
L2FC_ks_mean[i,1] <- 0
}else{
if(i < (ngene-num_kd_DE[j] + 1)){
effect_mean[i,j] <- 0
}else{
effect_mean[i,j] <- stats::rnorm(n=1, mean=eff_mean, sd=eff_sd)
}
if(i < (ngene-num_ks_DE[j] + 1)){
L2FC_ks_mean[i,j] <- 0
}else{
if (stats::runif(1) < 0.5){
L2FC_ks_mean[i,j] <- stats::runif(n=1, min=low_L2FC_ks, max=high_L2FC_ks)
}else{
L2FC_ks_mean[i,j] <- stats::runif(n=1, min=-high_L2FC_ks, max=-low_L2FC_ks)
}
}
}
}
}
# Difference in L2FC(kdeg)
L2FC_kd_mean <- log2(log(1 - inv_logit(logit(fn_mean) + effect_mean))/log(1- fn_mean))
#Simulate read counts
if (sim_read_counts == TRUE){
L2FC_tot_mean <- L2FC_ks_mean - L2FC_kd_mean
RNA_conc <- (ks_mean*2^(L2FC_ks_mean))/(kd_mean*2^(L2FC_kd_mean))
a1 <- 5
a0 <- 0.01
for(i in 1:ngene){
for(j in 1:num_conds){
for(k in 1:nreps){
# DESeq2 model of heterodisperse read counts
Counts[i, j, k] <- stats::rnbinom(n=1, size=1/((a1/(scale_factor*RNA_conc[i,j])) + a0), mu = scale_factor*RNA_conc[i,j])
if(Counts[i, j, k] < 5){
Counts[i, j, k] <- Counts[i, j, k] + stats::rpois(n=1, lambda = 2) + 1
}
}
}
}
} else{
Counts <- rep(nreads, times= ngene*num_conds*nreps)
dim(Counts) <- c(ngene, num_conds, nreps)
}
standard_RNA <- matrix(0, nrow = ngene, ncol = num_conds)
mean_RNA <- rep(0, times = num_conds)
sd_RNA <- rep(0, times = num_conds)
# Simulate kdeg and ksyn for each feature and replicate.
# replicate variability is simulated here
for(j in 1:num_conds){
mean_RNA[j] <- mean(log10(RNA_conc[,j]*scale_factor))
sd_RNA[j] <- stats::sd(log10(RNA_conc[,j]*scale_factor))
for(i in 1:ngene){
standard_RNA[i,j] <- (log10(RNA_conc[i,j]*scale_factor) - mean_RNA[j])/sd_RNA[j]
for(k in 1:nreps){
fn[i, j, k] <- inv_logit(stats::rnorm(1, mean=(logit(fn_mean[i]) + effect_mean[i,j]), sd = stats::rlnorm(1, noise_deg_a*standard_RNA[i,j] + noise_deg_b, sd_rep )))
ks[i,j,k] <- exp(stats::rnorm(1, mean=log((2^L2FC_ks_mean[i,j])*ks_mean[i]), sd=noise_synth))
}
}
}
# kdeg
kd <- -log(1 - fn)/tl
l <- ngene
p_do <- matrix(rep(0, times = num_conds*nreps), nrow = nreps, ncol = num_conds)
fn_real <- fn
# Simulate dropout
for(j in 1:num_conds){
for(k in 1:nreps){
fn_real[,j,k] <- (fn[,j,k]*(1-p_do[k,j]))/(1 - p_do[k,j]*(1 - fn[,j,k]))
Counts[,j,k] <- Counts[,j,k] - Counts[,j,k]*fn[,j,k]*p_do[k,j]
}
}
## Parameter names from old function
nmir <- l
fn_s4U <- fn_real
p_new_real_tc <- p_new
p_old_real_tc <- p_old
read_length <- read_lengths
nreads <- Counts
nsamp <- (nreps*num_conds) + num_conds
ctl <- c(rep(1, times=nreps*num_conds), rep(0, times=num_conds))
mt <- c(rep(1:num_conds, each=nreps),seq(from=1,to=num_conds,by=1))
replicate <- c(rep(seq(from=1, to=nreps), times=num_conds), rep(1, times=num_conds))
### Simulate for each sequencing read, whether it is new or old
# Read counts
Reads_vect <- as.vector(Counts)
# True average fraction news
fn_vect <- as.vector(fn_real)
# Total number of reads
tot_reads <- sum(Reads_vect)
# Numerical IDs for feature, experimental condition, replicate, and sample
Gene_ID <- rep(1:l, times = num_conds*nreps)
Exp_ID <- rep(rep(1:num_conds, each = l), times = nreps)
Rep_ID <- rep(1:nreps, each = l*num_conds)
Samp_ID <- rep(rep(seq(from = 1, to = num_conds*nreps, by = nreps), times = nreps) + rep(0:(nreps-1), each = num_conds), each = l)
# Avg. number of Us in sequencing reads from each feature
U_contents <- U_cont[Gene_ID]
# For simulating STL-seq, assume a single pause site. This means that the U-content
# is identical for all reads of a given length. To easily simulate this, use U-content
# to calculate average number of Us, and add a bit of Poisson noise to the number of Us
# to simualte variation about the average due to the polymerase pausing at slightly
# different exact positions (i.e., it may go bit past or end up a bit short of the average
# pause location, and thus incorporate a couple extra or a couple fewer Us.
if(STL){
nU <- abs(rep(round(U_contents*STL_len), times = Reads_vect) + sign(unlist(purrr::map(Reads_vect, stats::runif, min = -0.1, max = 0.1)))*unlist(purrr::map(Reads_vect, stats::rpois, lambda = 0.5)))
}else{
if(lprob_U_sd == 0){
nU <- unlist(purrr::pmap(list(n = Reads_vect, size = read_length[Exp_ID],
prob = U_contents),
stats::rbinom))
}else{
# Need function to draw each prob from a logit-normal for binomial
lnorm_binom <- function(n, size, lprob_mean, lprob_sd){
stats::rbinom(n = n,
size = size,
prob = inv_logit(stats::rnorm(n = n, mean = lprob_mean,
sd = lprob_sd)))
}
nU <- unlist(purrr::pmap(list(n = Reads_vect, size = read_length[Exp_ID], lprob_mean = logit(U_contents),
lprob_sd = rep(lprob_U_sd, times = length(Reads_vect))),
lnorm_binom))
}
}
# 1 = new read; 0 = old read
newness <- unlist(purrr::pmap(list(n = Reads_vect, size = 1, p = fn_vect), stats::rbinom))
Gene_ID <- rep(Gene_ID, times = Reads_vect)
Exp_ID <- rep(Exp_ID, times = Reads_vect)
Rep_ID <- rep(Rep_ID, times = Reads_vect)
Samp_ID <- rep(Samp_ID, times = Reads_vect)
# Simulate number of mutations
if(lp_sd == 0){
nmut <- stats::rbinom(n = length(nU), size = nU,
prob = newness*p_new_real_tc[Exp_ID] + (1-newness)*p_old_real_tc[Exp_ID])
}else{
logit_bernoulli_sum <- function(n, lp_mean, lp_sd){
sum(stats::rbinom(n = n, size = 1,
p = inv_logit(stats::rnorm(n = n,
mean = lp_mean,
sd = lp_sd))))
}
nmut <- unlist(purrr::pmap(list(n = nU,
lp_mean = logit(newness*p_new_real_tc[Exp_ID] + (1-newness)*p_old_real_tc[Exp_ID]),
lp_sd = lp_sd),
logit_bernoulli_sum))
}
# Create simualted cB file
cB <- dplyr::tibble(S = Samp_ID,
R = Rep_ID,
MT = Exp_ID,
FN = Gene_ID,
TC = nmut,
nT = nU,
TP = rep(1, times = length(Samp_ID)))
### Simulate -s4U control data
Gene_ID <- rep(1:l, times = num_conds*nreps)
Exp_ID <- rep(rep(1:num_conds, each = l), times = nreps)
Rep_ID <- rep(1:nreps, each = l*num_conds)
Samp_ID <- rep(rep(seq(from = 1, to = num_conds*nreps, by = nreps), times = nreps) + rep(0:(nreps-1), each = num_conds), each = l)
U_contents <- U_cont[Gene_ID]
ctl_data <- which(Rep_ID == 1)
Reads_ctl <- Reads_vect[ctl_data]
Gene_ctl <- Gene_ID[ctl_data]
Exp_ctl <- Exp_ID[ctl_data]
Rep_ctl <- Rep_ID[ctl_data]
Samp_ctl <- rep((max(Samp_ID) + 1):(max(Samp_ID) + num_conds), each = l )
U_ctl <- U_contents[ctl_data]
nU <- unlist(purrr::pmap(list(n = Reads_ctl, size = read_length[Exp_ctl], prob = U_ctl),
stats::rbinom))
Gene_ctl <- rep(Gene_ctl, times = Reads_ctl)
Exp_ctl <- rep(Exp_ctl, times = Reads_ctl)
Rep_ctl <- rep(Rep_ctl, times = Reads_ctl)
Samp_ctl <- rep(Samp_ctl, times = Reads_ctl)
nmut <- stats::rbinom(n = length(nU), size = nU, prob = p_old_real_tc[Exp_ctl])
cB_ctl <- data.frame(S = Samp_ctl,
R = Rep_ctl,
MT = Exp_ctl,
FN = Gene_ctl,
TC = nmut,
nT = nU,
TP = rep(0, times = length(Samp_ctl)))
cB_final <- dplyr::bind_rows(list(cB, cB_ctl))
# Extract simulated cB and summarise data
cB_sim_1 <- data.table::setDT(cB_final)
cols <- colnames(cB_sim_1)
cB_sim_1 <- cB_sim_1[,.N, by = .(S, R, MT, FN, TC, nT, TP)]
colnames(cB_sim_1) <- c(cols, "n")
cB_sim_1 <- cB_sim_1[order(cB_sim_1$S),]
# Define XF column
cB_sim_1$XF <- cB_sim_1$FN
# Map sample list to experimental characteristics
# type = whether s4U labeled (1) or not (0)
# mut = experimental condition ID
# rep = replicate ID
samp_list <- unique(cB_sim_1$S)
type_list <- rep(0, times=length(samp_list))
mut_list <- rep(0, times = length(samp_list))
rep_list <- rep(0, times = length(samp_list))
count <- 1
for(i in samp_list){
type_list[count] <- unique(cB_sim_1$TP[cB_sim_1$S == i])
rep_list[count] <- unique(cB_sim_1$R[cB_sim_1$S == i])
mut_list[count] <- unique(cB_sim_1$MT[cB_sim_1$S == i])
count <- count + 1
}
colnames(cB_sim_1) <- c("sample", "R", "MT", "FN", "TC", "nT", "TP", "n", "XF")
# metadata data frame for bakR
metadf <- data.frame(tl = type_list*tl, Exp_ID = as.integer(mut_list))
rownames(metadf) <- unique(cB_sim_1$sample)
cB_sim_1$sample <- as.character(cB_sim_1$sample)
# Make bakRData object
bakRData <- bakR::bakRData(cB_sim_1, metadf)
## Create data frames storing simulated truths
fn_vect <- c()
L2FC_kd_vect <- c()
effect_vect <- c()
fn_mean_vect <- c()
for(j in 1:num_conds){
if(j > 1){
L2FC_kd_vect <- c(L2FC_kd_vect, L2FC_kd_mean[,j])
effect_vect <- c(effect_vect, effect_mean[,j])
}
fn_mean_vect <- c(fn_mean_vect, logit(fn_mean) + effect_mean[,j])
for(i in 1:ngene){
fn_vect <- c(fn_vect, fn_real[i,j,])
}
}
# Make data frames similar to bakRFit outputs in terms of ordering
Fn_rep_sim <- data.frame(Feature_ID = rep(rep(1:ngene, each = nreps), times = num_conds),
Replicate = rep(1:nreps, times = ngene*num_conds),
Exp_ID = rep(1:num_conds, each = ngene*nreps),
Logit_fn = logit(fn_vect),
fn = fn_vect)
Fn_mean_sim <- data.frame(Feature_ID = rep(1:ngene, times = num_conds),
Exp_ID = rep(1:num_conds, each = ngene),
Avg_logit_fn = fn_mean_vect,
Avg_fn = inv_logit(fn_mean_vect))
if(num_conds > 1){
Effect_sim <- data.frame(Feature_ID = rep(1:ngene, times = (num_conds-1)),
Exp_ID = rep(2:num_conds, each = ngene),
L2FC_kdeg = L2FC_kd_vect,
effect = effect_vect)
Effect_sim <- Effect_sim[order(Effect_sim$Feature_ID, Effect_sim$Exp_ID),]
## Order dataframes as they are in fit output
Fn_rep_sim <- Fn_rep_sim[order(Fn_rep_sim$Feature_ID, Fn_rep_sim$Exp_ID, Fn_rep_sim$Replicate),]
Fn_mean_sim <- Fn_mean_sim[order(Fn_mean_sim$Feature_ID, Fn_mean_sim$Exp_ID),]
sim_data <- list(bakRData = bakRData,
sim_list = list(Effect_sim = Effect_sim,
Fn_mean_sim = Fn_mean_sim,
Fn_rep_sim = Fn_rep_sim,
L2FC_ks_mean = L2FC_ks_mean,
RNA_conc = RNA_conc*scale_factor,
Counts = Counts) )
}else{
## Order dataframes as they are in fit output
Fn_rep_sim <- Fn_rep_sim[order(Fn_rep_sim$Feature_ID, Fn_rep_sim$Exp_ID, Fn_rep_sim$Replicate),]
Fn_mean_sim <- Fn_mean_sim[order(Fn_mean_sim$Feature_ID, Fn_mean_sim$Exp_ID),]
sim_data <- list(bakRData = bakRData,
sim_list = list(Fn_mean_sim = Fn_mean_sim,
Fn_rep_sim = Fn_rep_sim,
L2FC_ks_mean = L2FC_ks_mean,
RNA_conc = RNA_conc*scale_factor,
Counts = Counts) )
}
return(sim_data)
}
#' Simulating nucleotide recoding data with relative count data
#'
#' \code{Simulate_relative_bakRData} simulates a `bakRData` object. It's output also includes the simulated
#' values of all kinetic parameters of interest.
#'
#' The main difference between \code{Simulate_relative_bakRData}
#' and \code{Simulate_bakRData} is that the former requires both the number of
#' genes (\code{ngene}) and the total number of reads (\code{depth}) has to be set.
#' In the latter, only the number of genes is set, and the number of reads for each
#' gene is simulated so that no matter how many genes are simulated, the number of
#' reads given default parameters is reflective of what is seen in 20,000,000 read
#' human RNA-seq libraries. The benefit of \code{Simulate_relative_bakRData} is that it is
#' easier to test the impact of depth on model performance. This can theoretically
#' be done by changing the synthesis rate constant parameters in \code{Simulate_bakRData},
#' but the relationship between these parameters and sequencing depth is unintuitive. The
#' benefit of \code{Simulate_bakRData} is that fewer genes can be simulated
#' while still yielding reasonable per-gene coverage without figuring out what the
#' total depth in the small gene subset should be. This is nice for testing bakR and
#' other analysis tools on small datasets. \code{Simulate_relative_bakRData} is a more
#' realistic simulation that better accounts for the relative nature of RNA-seq read
#' counts (i.e., expected number of reads from a given feature is related to proportion of RNA molecules
#' coming from that feature).
#'
#' Another difference between \code{Simulate_relative_bakRData} and \code{Simulate_bakRData}
#' is that \code{Simulate_relative_bakRData} uses the label time and simulated degradation
#' rate constants to infer the fraction new, whereas \code{Simulate_bakRData} uses simulated
#' fraction news and the label time to infer the degradation rate constants. Thus,
#' \code{Simulate_relative_bakRData} is preferable for assessing the impact of label
#' time on model performance (since it will have a realistic impact on the fraction new,
#' and the distribution of fraction news has a major impact on model performance).
#' Similarly, \code{Simulate_bakRData} is preferable for directly assessing the impact of
#' fraction news on model performance, without having to think about how both the label
#' time and simulated degradation rate constant distribution.
#'
#' If investigating dropout, only \code{Simulate_relative_bakRData} should be used, as the
#' accurate simulation of read counts as being a function of the relative abundance of
#' each RNA feature is crucial to accurately simulate dropout.
#'
#'
#' Function to simulate a `bakRData` object according to a realistic generative model
#' @param ngene Number of genes to simulate data for
#' @param depth Total number of reads to simulate
#' @param num_conds Number of experimental conditions (including the reference condition) to simulate
#' @param nreps Number of replicates to simulate
#' @param eff_sd Effect size; more specifically, the standard deviation of the normal distribution from which non-zero
#' changes in logit(fraction new) are pulled from.
#' @param eff_mean Effect size mean; mean of normal distribution from which non-zero changes in logit(fraction new) are pulled from.
#' Note, setting this to 0 does not mean that some of the significant effect sizes will be 0, as any exact integer is impossible
#' to draw from a continuous random number generator. Setting this to 0 just means that there is symmetric stabilization and destabilization
#' @param kdlog_mean Degradation rate constants will be drawn from lognormal distribution with this logmean
#' @param kdlog_sd Degradation rate constants will be drawn from lognormal distribution with this logsd
#' @param kslog_mean Synthesis rate constants will be drawn from a lognormal distribution with this mean
#' @param kslog_sd Synthesis rate constants will be drawn from a lognormal distribution with this logsd
#' @param tl metabolic label feed time
#' @param p_new metabolic label (e.g., s4U) induced mutation rate. Can be a vector of length num_conds
#' @param p_old background mutation rate
#' @param read_lengths Total read length for each sequencing read (e.g., PE100 reads correspond to read_lengths = 200)
#' @param p_do Rate at which metabolic label containing reads are lost due to dropout; must be between 0 and 1
#' @param noise_deg_a Slope of trend relating log10(standardized read counts) to log(replicate variability)
#' @param noise_deg_b Intercept of trend relating log10(standardized read counts) to log(replicate variability)
#' @param noise_synth Homoskedastic variability of L2FC(ksyn)
#' @param sd_rep Variance of lognormal distribution from which replicate variability is drawn
#' @param low_L2FC_ks Most negative L2FC(ksyn) that can be simulated
#' @param high_L2FC_ks Most positive L2FC(ksyn) that can be simulated
#' @param num_kd_DE Vector where each element represents the number of genes that show a significant change in stability relative
#' to the reference. 1st entry must be 0 by definition (since relative to the reference the reference sample is unchanged)
#' @param num_ks_DE Same as num_kd_DE but for significant changes in synthesis rates.
#' @param sim_read_counts Logical; if TRUE, read counts are simulated as coming from a heterodisperse negative binomial distribution
#' @param a1 Heterodispersion 1/reads dependence parameter
#' @param a0 High read depth limit of negative binomial dispersion parameter
#' @param nreads Number of reads simulated if sim_read_counts is FALSE
#' @param alpha shape1 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are
#' drawn for each gene.
#' @param beta shape2 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are
#' drawn for each gene.
#' @param STL logical; if TRUE, simulation is of STL-seq rather than a standard TL-seq experiment. The two big changes are that a short read length is required
#' (< 60 nt) and that every read for a particular feature will have the same number of Us. Only one read length is simulated for simplicity.
#' @param STL_len Average length of simulated STL-seq length. Since Pol II typically pauses about 20-60 bases
#' from the promoter, this should be around 40
#' @param lprob_U_sd Standard deviation of the logit(probability nt is a U) for each sequencing read. The number of Us in a
#' sequencing read are drawn from a binomial distribution with prob drawn from a logit-Normal distribution with this logit-sd.
#' @param lp_sd Standard deviation of logit(probability a U is mutated) for each U. The number of mutations in a given read is the sum of
#' nU Bernoulli random variables, where nU is the number of Us, and p is drawn from a logit-normal distribution with lp_sd standard deviation
#' on logit scale.
#' @importFrom magrittr %>%
#' @import data.table
#' @return A list containing a simulated `bakRData` object as well as a list of simulated kinetic parameters of interest.
#' The contents of the latter list are:
#' \itemize{
#' \item Effect_sim; Dataframe meant to mimic formatting of Effect_df that are part of \code{bakRFit(StanFit = TRUE)}, \code{bakRFit(HybridFit = TRUE)} and \code{bakRFit(bakRData object)} output.
#' \item Fn_mean_sim; Dataframe meant to mimic formatting of Regularized_ests that is part of \code{bakRFit(bakRData object)} output. Contains information
#' about the true fraction new simulated in each condition (the mean of the normal distribution from which replicate fraction news are simulated)
#' \item Fn_rep_sim; Dataframe meant to mimic formatting of Fn_Estimates that is part of \code{bakRFit(bakRData object)} output. Contains information
#' about the fraction new simulated for each feature in each replicate of each condition.
#' \item L2FC_ks_mean; The true L2FC(ksyn) for each feature in each experimental condition. The i-th column corresponds to the L2FC(ksyn) when comparing
#' the i-th condition to the reference condition (defined as the 1st condition) so the 1st column is always all 0s
#' \item RNA_conc; The average number of normalized read counts expected for each feature in each sample.
#' }
#' @examples
#' \donttest{
#' # 2 replicate, 2 experimental condition, 1000 gene simulation
#' sim_2reps <- Simulate_relative_bakRData(ngene = 1000, depth = 100000,
#' nreps = 2)
#'
#' # 3 replicate, 2 experimental condition, 1000 gene simulation
#' # with 100 instances of differential degradation kinetics
#' sim_3reps <- Simulate_relative_bakRData(ngene = 1000, depth = 100000,
#' num_kd_DE = c(0, 100))
#'
#' # 2 replicates, 3 experimental condition, 1000 gene simulation
#' # with 100 instances of differential degradation kinetics in the 1st
#' # condition and no instances of differential degradation kinetics in the
#' # 2nd condition
#' sim_3es <- Simulate_relative_bakRData(ngene = 1000, depth = 100000,
#' nreps = 2,
#' num_conds = 3,
#' num_kd_DE = c(0, 100, 0))
#'
#' }
#' @export
#'
Simulate_relative_bakRData <- function(ngene, depth, num_conds = 2L, nreps = 3L, eff_sd = 0.75, eff_mean = 0,
kdlog_mean = -1.8, kdlog_sd = 0.65, kslog_mean = 1, kslog_sd = 0.65,
tl = 2, p_new = 0.05, p_old = 0.001, read_lengths = 200L,
p_do = 0, noise_deg_a = -0.3, noise_deg_b = -1.5, noise_synth = 0.1, sd_rep = 0.05,
low_L2FC_ks = -1, high_L2FC_ks = 1,
num_kd_DE = c(0L, as.integer(rep(round(as.integer(ngene)/2), times = as.integer(num_conds)-1))),
num_ks_DE = rep(0L, times = as.integer(num_conds)),
sim_read_counts = TRUE, a1 = 5, a0 = 0.01,
nreads = 50L, alpha = 25, beta = 75,
STL = FALSE, STL_len = 40,
lprob_U_sd = 0, lp_sd = 0){
# Bind variables locally to resolve devtools::check() Notes
S <- R <- MT <- FN <- TC <- nT <- TP <- NULL
`.` <- list
#### Parameter checks
# STL
if(!is.logical(STL)){
"STL must be logical (TRUE or FALSE)"
}
# STL_len
if(STL){
if(STL_len < 25){
stop("STL_len is less than 25. Must be between 25 and 55.")
}else if(STL_len > 55){
stop("STL_len is greater than 55. Must be between 25 and 55.")
}
}
# alpha parameter of beta distribution
if(!is.numeric(alpha)){
stop("alpha must be numeric!")
}else if(alpha <= 0){
stop("alpha must be > 0")
}
if(!is.numeric(beta)){
stop("beta must be numeric!")
}else if(beta <= 0){
stop("beta must be > 0")
}
if(beta < 1 & alpha < 1){
warning("beta and alpha are both less than 1. This will lead to an unusual bimodality in the distribution of transcript U-contents")
}
if(alpha/(alpha + beta) < 0.1){
warning("alpha and beta are such that the average U-content is less than 0.1. That is unreasonably low and may make many simulated transcripts unanalyzable.")
}else if(alpha/(alpha + beta) > 0.75){
warning("alpha and beta are such that the average U-content is > 0.75. I wish U-contents were this high, your simulation may not reflect real data though.")
}
# Number of genes (features) to simulate
if(!is.numeric(ngene)){
stop("ngene must be numeric")
}else if(!is.integer(ngene)){
ngene <- as.integer(ngene)
}
if(ngene < 1){
stop("ngene must be > 0; it represents the number of features to simulate")
}
# Number of experimental conditions
if(!is.numeric(num_conds)){
stop("num_conds must be an numeric")
}else if(!is.integer(num_conds)){
num_conds <- as.integer(num_conds)
}
if(num_conds < 1){
stop("num_conds must be > 0; it represents the number of experimental conditions")
}else if(num_conds == 1){
warning("You are only simluating a reference condition with no experimental conditions.")
}
# Number of replicates
if(!is.numeric(nreps)){
stop("nreps must be numeric")
}else if(!is.integer(nreps)){
nreps <- as.integer(nreps)
}
if(nreps < 1){
stop("nreps must be > 0; it represents the number of replicates")
}else if(nreps == 1){
warning("You are only simulating a single replicate; All statistical models implemented in bakR require > 1 replicate.")
}
# Effect size distribution standard deviation
if(!is.numeric(eff_sd)){
stop("eff_sd must be numeric")
}else if(eff_sd < 0){
stop("eff_sd must be >= 0; it will be the sd parameter of a call to rnorm when simulating effect sizes")
}else if (eff_sd > 4){
warning("You are simulating an unusually large eff_sd (> 4)")
}
# Label time
if(!is.numeric(tl)){
stop("tl must be numeric")
}else if(tl <= 0){
stop("tl must be > 0; it represents the label time in minutes.")
}
# s4U mutation rate
if(!all(is.numeric(p_new))){
stop("p_new must be numeric")
}else if(!all(p_new > 0)){
stop("p_new must be > 0; it represents the mutation rate of s4U labeled transcripts")
}else if(!all(p_new <= 1)){
stop("p_new must be <= 1; it represents the mutation rate of s4U labeled transcripts")
}else if(!all(p_new < 0.2)){
warning("You have simulated an unusually large s4U induced mutation rate (>= 0.2)")
}else if(!all((p_new - p_old) > 0)){
stop("All p_new must be > p_old, since the background mutation rate (p_old) should always be less than the labeled mutation rate (p_new)")
}
# Background mutation rate
if(!all(is.numeric(p_old))){
stop("p_old must be numeric")
}else if(!all(p_old >= 0)){
stop("p_old must be >= 0; it represents the background mutation rate")
}else if(!all(p_old <= 1)){
stop("p_old must be <= 1; it represents the background mutation rate")
}else if(!all(p_old < 0.01)){
warning("You have simulated an unusually large background mutation rate (>= 0.01)")
}
# Read length
if(!all(is.numeric(read_lengths))){
stop("read_lengths must be numeric")
}else if(!all(is.integer(read_lengths))){
read_lengths <- as.integer(read_lengths)
}
if(!all(read_lengths > 0)){
stop("read_lengths must be > 0; it represents the total length of sequencing reads")
}else if(!all(read_lengths >= 20)){
warning("You have simulated an unusually short read length (< 20 nucleotides)")
}
# Dropout probability
if(!all(is.numeric(p_do))){
stop("p_do must be numeric")
}else if(!all(p_do <= 1)){
stop("p_do must be <= 1; it represents the fraction of s4U containing RNA lost during library prep")
}else if(!all(p_do >= 0)){
stop("p_do must be >= 0; it represents the fraction of s4U containing RNA lost during library prep")
}
# Heteroskedastic Slope
if(!is.numeric(noise_deg_a)){
stop("noise_deg_a must be numeric")
}else if(noise_deg_a >= 0){
stop("noise_deg_a must be >= 0; it represents the slope of the log10(read depth) vs. log(replicate variability) trend")
}else if(noise_deg_a == 0){
warning("You are simulating fraction new homoskedasticity, which is not reflective of actual nucleotide recoding data")
}
# Synthesis variability
if(!is.numeric(noise_synth)){
stop("noise_synth must be numeric")
}else if(noise_synth < 0){
stop("noise_synth must be >= 0; it represents the homoskeastic variability in the synthesis rate")
}
# Replicate variability variability
if(!is.numeric(sd_rep)){
stop("sd_rep must be numeric")
}else if(sd_rep < 0){
stop("sd_rep must be >= 0; it represents the sdlog parameter of the log-normal distribution from which replicate variabilites are simulated")
}else if(sd_rep == 0){
warning("You are simulating no variability in the replicate variability. This can cause minor convergence issues in the completely hierarchical
models implemented by bakR.")
}
# L2FC(ksyn) Bounds
if(!all(is.numeric(c(high_L2FC_ks, low_L2FC_ks)))){
stop("high_L2FC_ks and low_L2FC_ks must be numeric")
}else if(high_L2FC_ks < low_L2FC_ks){
stop("high_L2FC_ks must be greater than low_L2FC_ks; they represent the upper and lower bound of a uniform distribution respectively")
}
# Number of differentially stabilized features
if(!all(is.numeric(num_kd_DE))){
stop("All elements of num_kd_DE must be numeric")
}else if(!all(is.integer(num_kd_DE))){
num_kd_DE <- as.integer(num_kd_DE)
}
if (length(num_kd_DE) > num_conds){
stop("num_kd_DE has too many elements; it should be a vector of length == num_conds")
}else if (length(num_kd_DE) < num_conds){
stop("num_kd_DE has too few elements; it should be a vector of length == num_conds")
}else if(num_kd_DE[1] != 0){
stop("The 1st element of num_kd_DE must equal 0 by definition, as it represents the number of features differentially stabilized in the reference
condition relative to the reference condition")
}else if(!all(num_kd_DE <= ngene)){
stop("Not all elements of num_kd_DE are less than or equal to the total number of simulated features")
}else if(!all(num_kd_DE >= 0)){
stop("Not all elements of num_kd_DE are > 0")
}
# Number of differentially synthesized features
if(!all(is.numeric(num_ks_DE))){
stop("All elements of num_ks_DE must be numeric")
}else if(!all(is.integer(num_ks_DE))){
num_ks_DE <- as.integer(num_ks_DE)
}
if(length(num_ks_DE) > num_conds){
stop("num_ks_DE has too many elements; it should be a vector of length == num_conds")
}else if (length(num_ks_DE) < num_conds){
stop("num_ks_DE has too few elements; it should be a vector of length == num_conds")
}else if(num_ks_DE[1] != 0){
stop("The 1st element of num_ks_DE must equal 0 by definition, as it represents the number of features differentially stabilized in the reference
condition relative to the reference condition")
}else if(!all(num_ks_DE <= ngene)){
stop("Not all elements of num_ks_DE are less than or equal to the total number of simulated features")
}else if(!all(num_ks_DE >= 0)){
stop("Not all elements of num_ks_DE are > 0")
}
# Simulate read count Boolean
if(!is.logical(sim_read_counts)){
stop("sim_read_counts must be a logical (TRUE or FALSE)")
}else if(sim_read_counts){
if(!is.integer(nreads)){
stop("nreads must be an integer")
}else if(nreads <= 0){
stop("nreads must be > 0; it represents the number of sequencing reads to be simulated for all features")
}
}
# Heterodispersion parameters
if(!all(is.numeric(c(a1, a0)))){
stop("a1 and a0 must be numeric")
}else if(a1 < 0){
stop("a1 must be >= 0; it relates the average read count to the negative binomial dispersion parameter")
}else if(a0 <= 0){
stop("a0 must be > 0; it represents the high read count limit of the negative binomial dispersion parameter")
}
# Parameters without numeric bounds
if(!is.numeric(noise_deg_b)){
stop("noise_deg_b must be numeric")
}
if(!is.numeric(eff_mean)){
stop("eff_mean must be numeric")
}
if(length(p_new) == 1){
p_new <- rep(p_new, times=num_conds)
}else if(length(p_new) != num_conds){
stop("p_new must be of length 1 or length == num_conds")
}
if(length(p_old) == 1){
p_old <- rep(p_old, times=num_conds)
}else if(length(p_old) != num_conds){
stop("p_old must be of length 1 or length == num_conds")
}
if(length(read_lengths) == 1){
read_lengths <- rep(read_lengths, times=num_conds)
}else if(length(read_lengths) != num_conds){
stop("read_lengths must be of length 1 or length == num_conds")
}
if(length(p_do) == 1){
p_do <- rep(p_do, times=num_conds)
}else if(length(p_do) != num_conds){
stop("p_do must be of length 1 or length == num_conds")
}
#### Simulation function from bakR
# Average number of Us in reads from each simulated feature
U_cont <- stats::rbeta(ngene, alpha, beta)
# Define helper functions:
logit <- function(x) log(x/(1-x))
inv_logit <- function(x) exp(x)/(1+exp(x))
#Initialize matrices
fn <- rep(0, times=ngene*nreps*num_conds)
dim(fn) <- c(ngene, num_conds, nreps)
Counts <- fn
# 1 replicate of -s4U control data (could in theory make number of replicates a parameter)
Counts_ctl <- rep(0, times = ngene*num_conds)
dim(Counts_ctl) <- c(ngene, num_conds)
kd <- fn
ks <- fn
#Initialize vectors of mean values for each gene and condition
kd_mean <- stats::rlnorm(n = ngene, meanlog = kdlog_mean, sdlog = kdlog_sd)
ks_mean <- stats::rlnorm(n=ngene, meanlog = kslog_mean , sdlog = kslog_sd)
effect_mean <- rep(0, times = ngene*num_conds)
dim(effect_mean) <- c(ngene, num_conds)
L2FC_ks_mean <- effect_mean
L2FC_kd_mean <- effect_mean
# Simualte true kinetic parameter values
for(i in 1:ngene){
#Make sure the user didn't input the wrong
#number of significant genes
if (i == 1 ){
if (length(num_kd_DE) > num_conds){
stop("num_kd_DE has too many elements")
} else if (length(num_kd_DE) < num_conds){
stop("num_kd_DE has too few elements")
}
if (length(num_ks_DE) > num_conds){
stop("num_ks_DE has too many elements")
} else if (length(num_ks_DE) < num_conds){
stop("num_ks_DE has too few elements")
}
}
# Simulate effect sizes (differences in kinetic parameters)
# effect_mean = difference in logit(fraction new)s
# L2FC_ks_mean = difference in L2FC(ksyn)
for(j in 1:num_conds){
if(j == 1){
effect_mean[i,1] <- 0
L2FC_ks_mean[i,1] <- 0
}else{
if(i < (ngene-num_kd_DE[j] + 1)){
effect_mean[i,j] <- 0
}else{
effect_mean[i,j] <- stats::rnorm(n=1, mean=eff_mean, sd=eff_sd)
}
if(i < (ngene-num_ks_DE[j] + 1)){
L2FC_ks_mean[i,j] <- 0
}else{
if (stats::runif(1) < 0.5){
L2FC_ks_mean[i,j] <- stats::runif(n=1, min=low_L2FC_ks, max=high_L2FC_ks)
}else{
L2FC_ks_mean[i,j] <- stats::runif(n=1, min=-high_L2FC_ks, max=-low_L2FC_ks)
}
}
}
}
}
# Average WT fraction new
fn_mean <- 1 - exp(-kd_mean*tl)
# No 1s or 0s
fn_mean <- ifelse(fn_mean > 0.999, 0.999, fn_mean)
fn_mean <- ifelse(fn_mean < 0.001, 0.001, fn_mean)
# Difference in L2FC(kdeg)
L2FC_kd_mean <- log2(log(1 - inv_logit(logit(fn_mean) + effect_mean))/log(1- fn_mean))
# RNA concentration assuming steady-state
L2FC_tot_mean <- L2FC_ks_mean - L2FC_kd_mean
RNA_conc <- (ks_mean*2^(L2FC_ks_mean))/(kd_mean*2^(L2FC_kd_mean))
# Simulate dropout
fn_mean_matrix <- inv_logit(logit(fn_mean) + effect_mean)
RNA_conc_s4U <- RNA_conc*fn_mean_matrix*(1-p_do) + RNA_conc*(1-fn_mean_matrix)
RNA_conc_ctl <- RNA_conc
# Relative amount of each RNA species
rel_RNA_c_s4U <- RNA_conc_s4U/matrix(colSums(RNA_conc_s4U), nrow = ngene, ncol = num_conds, byrow = TRUE)
rel_RNA_c_ctl <- RNA_conc/matrix(colSums(RNA_conc_ctl), nrow = ngene, ncol = num_conds, byrow = TRUE)
# Simulate read counts
a1 <- 5
a0 <- 0.01
for(i in 1:ngene){
for(j in 1:num_conds){
## -s4U control data
Counts_ctl[i, j] <- stats::rnbinom(n=1, size=1/((a1/(depth*rel_RNA_c_ctl[i,j])) + a0), mu = depth*rel_RNA_c_ctl[i,j])
if(Counts_ctl[i, j] < 5){
Counts_ctl[i, j] <- Counts_ctl[i, j] + stats::rpois(n=1, lambda = 2) + 1
}
for(k in 1:nreps){
## +s4U data
# DESeq2 model of heterodisperse read counts
Counts[i, j, k] <- stats::rnbinom(n=1, size=1/((a1/(depth*rel_RNA_c_s4U[i,j])) + a0), mu = depth*rel_RNA_c_s4U[i,j])
if(Counts[i, j, k] < 5){
Counts[i, j, k] <- Counts[i, j, k] + stats::rpois(n=1, lambda = 2) + 1
}
}
}
}
standard_RNA <- matrix(0, nrow = ngene, ncol = num_conds)
mean_RNA <- rep(0, times = num_conds)
sd_RNA <- rep(0, times = num_conds)
# Simulate kdeg and ksyn for each feature and replicate.
# replicate variability is simulated here
for(j in 1:num_conds){
mean_RNA[j] <- mean(log10(rel_RNA_c_s4U[,j]*depth))
sd_RNA[j] <- stats::sd(log10(rel_RNA_c_s4U[,j]*depth))
for(i in 1:ngene){
standard_RNA[i,j] <- (log10(rel_RNA_c_s4U[i,j]*depth) - mean_RNA[j])/sd_RNA[j]
for(k in 1:nreps){
fn[i, j, k] <- inv_logit(stats::rnorm(1, mean=(logit(fn_mean[i]) + effect_mean[i,j]), sd = stats::rlnorm(1, noise_deg_a*standard_RNA[i,j] + noise_deg_b, sd_rep )))
ks[i,j,k] <- exp(stats::rnorm(1, mean=log((2^L2FC_ks_mean[i,j])*ks_mean[i]), sd=noise_synth))
}
}
}
# kdeg
kd <- -log(1 - fn)/tl
l <- ngene
# Simulate dropout
fn_real <- (fn*(1-p_do))/((fn*(1-p_do)) + (1-fn))
## Parameter names from old function
nmir <- l
fn_s4U <- fn_real
p_new_real_tc <- p_new
p_old_real_tc <- p_old
read_length <- read_lengths
nreads <- Counts
nsamp <- (nreps*num_conds) + num_conds
ctl <- c(rep(1, times=nreps*num_conds), rep(0, times=num_conds))
mt <- c(rep(1:num_conds, each=nreps),seq(from=1,to=num_conds,by=1))
replicate <- c(rep(seq(from=1, to=nreps), times=num_conds), rep(1, times=num_conds))
### Simulate for each sequencing read, whether it is new or old
# Read counts
Reads_vect <- as.vector(Counts)
# True average fraction news
fn_vect <- as.vector(fn_real)
# Total number of reads
tot_reads <- sum(Reads_vect)
# Numerical IDs for feature, experimental condition, replicate, and sample
Gene_ID <- rep(1:l, times = num_conds*nreps)
Exp_ID <- rep(rep(1:num_conds, each = l), times = nreps)
Rep_ID <- rep(1:nreps, each = l*num_conds)
Samp_ID <- rep(rep(seq(from = 1, to = num_conds*nreps, by = nreps), times = nreps) + rep(0:(nreps-1), each = num_conds), each = l)
# Avg. number of Us in sequencing reads from each feature
U_contents <- U_cont[Gene_ID]
# For simulating STL-seq, assume a single pause site. This means that the U-content
# is identical for all reads of a given length. To easily simulate this, use U-content
# to calculate average number of Us, and add a bit of Poisson noise to the number of Us
# to simualte variation about the average due to the polymerase pausing at slightly
# different exact positions (i.e., it may go bit past or end up a bit short of the average
# pause location, and thus incorporate a couple extra or a couple fewer Us.
if(STL){
nU <- abs(rep(round(U_contents*STL_len), times = Reads_vect) + sign(unlist(purrr::map(Reads_vect, stats::runif, min = -0.1, max = 0.1)))*unlist(purrr::map(Reads_vect, stats::rpois, lambda = 0.5)))
}else{
if(lprob_U_sd == 0){
nU <- unlist(purrr::pmap(list(n = Reads_vect, size = read_length[Exp_ID],
prob = U_contents),
stats::rbinom))
}else{
# Need function to draw each prob from a logit-normal for binomial
lnorm_binom <- function(n, size, lprob_mean, lprob_sd){
stats::rbinom(n = n,
size = size,
prob = inv_logit(stats::rnorm(n = n, mean = lprob_mean,
sd = lprob_sd)))
}
nU <- unlist(purrr::pmap(list(n = Reads_vect, size = read_length[Exp_ID], lprob_mean = logit(U_contents),
lprob_sd = rep(lprob_U_sd, times = length(Reads_vect))),
lnorm_binom))
}
}
# 1 = new read; 0 = old read
newness <- unlist(purrr::pmap(list(n = Reads_vect, size = 1, p = fn_vect), stats::rbinom))
Gene_ID <- rep(Gene_ID, times = Reads_vect)
Exp_ID <- rep(Exp_ID, times = Reads_vect)
Rep_ID <- rep(Rep_ID, times = Reads_vect)
Samp_ID <- rep(Samp_ID, times = Reads_vect)
# Simulate number of mutations
if(lp_sd == 0){
nmut <- stats::rbinom(n = length(nU), size = nU,
prob = newness*p_new_real_tc[Exp_ID] + (1-newness)*p_old_real_tc[Exp_ID])
}else{
logit_bernoulli_sum <- function(n, lp_mean, lp_sd){
sum(stats::rbinom(n = n, size = 1,
p = inv_logit(stats::rnorm(n = n,
mean = lp_mean,
sd = lp_sd))))
}
nmut <- unlist(purrr::pmap(list(n = nU,
lp_mean = logit(newness*p_new_real_tc[Exp_ID] + (1-newness)*p_old_real_tc[Exp_ID]),
lp_sd = lp_sd),
logit_bernoulli_sum))
}
# Create simualted cB file
cB <- dplyr::tibble(S = Samp_ID,
R = Rep_ID,
MT = Exp_ID,
FN = Gene_ID,
TC = nmut,
nT = nU,
TP = rep(1, times = length(Samp_ID)))
### Simulate -s4U control data
Gene_ID <- rep(1:l, times = num_conds*nreps)
Exp_ID <- rep(rep(1:num_conds, each = l), times = nreps)
Rep_ID <- rep(1:nreps, each = l*num_conds)
Samp_ID <- rep(rep(seq(from = 1, to = num_conds*nreps, by = nreps), times = nreps) + rep(0:(nreps-1), each = num_conds), each = l)
U_contents <- U_cont[Gene_ID]
ctl_data <- which(Rep_ID == 1)
Reads_ctl <- as.vector(Counts_ctl)
Gene_ctl <- Gene_ID[ctl_data]
Exp_ctl <- Exp_ID[ctl_data]
Rep_ctl <- Rep_ID[ctl_data]
Samp_ctl <- rep((max(Samp_ID) + 1):(max(Samp_ID) + num_conds), each = l )
U_ctl <- U_contents[ctl_data]
nU <- unlist(purrr::pmap(list(n = Reads_ctl, size = read_length[Exp_ctl], prob = U_ctl),
stats::rbinom))
Gene_ctl <- rep(Gene_ctl, times = Reads_ctl)
Exp_ctl <- rep(Exp_ctl, times = Reads_ctl)
Rep_ctl <- rep(Rep_ctl, times = Reads_ctl)
Samp_ctl <- rep(Samp_ctl, times = Reads_ctl)
nmut <- stats::rbinom(n = length(nU), size = nU, prob = p_old_real_tc[Exp_ctl])
cB_ctl <- data.frame(S = Samp_ctl,
R = Rep_ctl,
MT = Exp_ctl,
FN = Gene_ctl,
TC = nmut,
nT = nU,
TP = rep(0, times = length(Samp_ctl)))
cB_final <- dplyr::bind_rows(list(cB, cB_ctl))
# Extract simulated cB and summarise data
cB_sim_1 <- data.table::setDT(cB_final)
cols <- colnames(cB_sim_1)
cB_sim_1 <- cB_sim_1[,.N, by = .(S, R, MT, FN, TC, nT, TP)]
colnames(cB_sim_1) <- c(cols, "n")
cB_sim_1 <- cB_sim_1[order(cB_sim_1$S),]
# Define XF column
cB_sim_1$XF <- cB_sim_1$FN
# Map sample list to experimental characteristics
# type = whether s4U labeled (1) or not (0)
# mut = experimental condition ID
# rep = replicate ID
samp_list <- unique(cB_sim_1$S)
type_list <- rep(0, times=length(samp_list))
mut_list <- rep(0, times = length(samp_list))
rep_list <- rep(0, times = length(samp_list))
count <- 1
for(i in samp_list){
type_list[count] <- unique(cB_sim_1$TP[cB_sim_1$S == i])
rep_list[count] <- unique(cB_sim_1$R[cB_sim_1$S == i])
mut_list[count] <- unique(cB_sim_1$MT[cB_sim_1$S == i])
count <- count + 1
}
colnames(cB_sim_1) <- c("sample", "R", "MT", "FN", "TC", "nT", "TP", "n", "XF")
# metadata data frame for bakR
metadf <- data.frame(tl = type_list*tl, Exp_ID = as.integer(mut_list))
rownames(metadf) <- unique(cB_sim_1$sample)
cB_sim_1$sample <- as.character(cB_sim_1$sample)
# Make bakRData object
bakRData <- bakR::bakRData(cB_sim_1, metadf)
## Create data frames storing simulated truths
fn_vect <- c()
L2FC_kd_vect <- c()
effect_vect <- c()
fn_mean_vect <- c()
for(j in 1:num_conds){
if(j > 1){
L2FC_kd_vect <- c(L2FC_kd_vect, L2FC_kd_mean[,j])
effect_vect <- c(effect_vect, effect_mean[,j])
}
fn_mean_vect <- c(fn_mean_vect, logit(fn_mean) + effect_mean[,j])
for(i in 1:ngene){
fn_vect <- c(fn_vect, fn[i,j,])
}
}
# Make data frames similar to bakRFit outputs in terms of ordering
Fn_rep_sim <- data.frame(Feature_ID = rep(rep(1:ngene, each = nreps), times = num_conds),
Replicate = rep(1:nreps, times = ngene*num_conds),
Exp_ID = rep(1:num_conds, each = ngene*nreps),
Logit_fn = logit(fn_vect),
fn = fn_vect)
Fn_mean_sim <- data.frame(Feature_ID = rep(1:ngene, times = num_conds),
Exp_ID = rep(1:num_conds, each = ngene),
Avg_logit_fn = fn_mean_vect,
Avg_fn = inv_logit(fn_mean_vect))
if(num_conds > 1){
Effect_sim <- data.frame(Feature_ID = rep(1:ngene, times = (num_conds-1)),
Exp_ID = rep(2:num_conds, each = ngene),
L2FC_kdeg = L2FC_kd_vect,
effect = effect_vect)
Effect_sim <- Effect_sim[order(Effect_sim$Feature_ID, Effect_sim$Exp_ID),]
## Order dataframes as they are in fit output
Fn_rep_sim <- Fn_rep_sim[order(Fn_rep_sim$Feature_ID, Fn_rep_sim$Exp_ID, Fn_rep_sim$Replicate),]
Fn_mean_sim <- Fn_mean_sim[order(Fn_mean_sim$Feature_ID, Fn_mean_sim$Exp_ID),]
sim_data <- list(bakRData = bakRData,
sim_list = list(Effect_sim = Effect_sim,
Fn_mean_sim = Fn_mean_sim,
Fn_rep_sim = Fn_rep_sim,
L2FC_ks_mean = L2FC_ks_mean,
RNA_conc_s4U = RNA_conc_s4U,
RNA_conc_ctl = RNA_conc_ctl,
rel_RNA_c_s4U = rel_RNA_c_s4U,
rel_RNA_c_ctl = rel_RNA_c_ctl,
Counts = Counts) )
}else{
## Order dataframes as they are in fit output
Fn_rep_sim <- Fn_rep_sim[order(Fn_rep_sim$Feature_ID, Fn_rep_sim$Exp_ID, Fn_rep_sim$Replicate),]
Fn_mean_sim <- Fn_mean_sim[order(Fn_mean_sim$Feature_ID, Fn_mean_sim$Exp_ID),]
sim_data <- list(bakRData = bakRData,
sim_list = list(Fn_mean_sim = Fn_mean_sim,
Fn_rep_sim = Fn_rep_sim,
L2FC_ks_mean = L2FC_ks_mean,
RNA_conc_s4U = RNA_conc_s4U,
RNA_conc_ctl = RNA_conc_ctl,
rel_RNA_c_s4U = rel_RNA_c_s4U,
rel_RNA_c_ctl = rel_RNA_c_ctl,
Counts = Counts) )
}
return(sim_data)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.