Nothing
#' Fit 'Stan' models to nucleotide recoding RNA-seq data analysis
#'
#' \code{TL_stan} is an internal function to analyze nucleotide recoding RNA-seq data with a fully
#' Bayesian hierarchical model implemented in the PPL `Stan`. \code{TL_stan} estimates
#' kinetic parameters and differences in kinetic parameters between experimental
#' conditions. When assessing differences, a single reference sample is compared to
#' each collection of experimental samples provided.
#'
#' Two implementations of a similar model can be fit with TL_stan: a complete nucleotide recoding RNA-seq
#' analysis and a hybrid analysis that takes as input results from \code{fast_analysis}.
#' In the complete analysis (referred to in the bakR publication as the MCMC implementation),
#' U-to-C mutations are modeled as coming from a Poisson distribution
#' with rate parameter adjusted by the empirical U-content of each feature analyzed. Features
#' represent whatever the user defined them to be when constructing the bakR data object.
#' Typical feature categories are genes, exons, etc. Hierarchical modeling is used to pool data
#' across replicates and across features. More specifically, replicate data for the
#' same feature are partially pooled to estimate feature-specific mean fraction news and uncertainties.
#' Feature means are partially pooled to estimate dataset-wide mean fraction news and standard deviations.
#' The replicate variability for each feature is also partially pooled to determine a condition-wide
#' heteroskedastic relationship between read depths and replicate variability. Partial pooling
#' reduces subjectivity when determining priors by allowing the model to determine what priors make sense
#' given the data. Partial pooling also regularizes estimates, reducing estimate variability and thus increasing
#' estimate accuracy. This is particularly important for replicate variability estimates, which often rely
#' on only a few replicates of data per feature and thus are typically highly unstable.
#'
#' The hybrid analysis (referred to in the bakR publication as the Hybrid implementation)
#' inherits the hierarchical modeling structure of the complete analysis, but reduces computational
#' burden by foregoing per-replicate-and-feature fraction new estimation and uncertainty quantification. Instead,
#' the hybrid analysis takes as data fraction new estimates and approximate uncertainties from \code{fast_analysis}.
#' Runtimes of the hybrid analysis are thus often an order of magnitude shorter than with the complete analysis, but
#' loses some accuracy by relying on point estimates and uncertainty quantification that is only valid in the
#' limit of large dataset sizes (where the dataset size for the per-replicate-and-feature fraction new estimate is the raw number
#' of sequencing reads mapping to the feature in that replicate).
#'
#' Users also have the option to save or discard the `Stan` fit object. Fit objects can be exceedingly large (> 10 GB) for most
#' nucleotide recoding RNA-seq datasets. Therefore, if you don't want to store such a large object, a summary object will be saved instead,
#' which greatly reduces the size of the output (~ 10-50 MB) while still retaining much of the important information. In addition,
#' the output of \code{TL_stan} provides the estimates and uncertainties for key parameters (L2FC(kdeg), kdeg, and fraction new)
#' that will likely be of most interest. That being said, there are some analyses that are only possible if the original fit object
#' is saved. For example, the fit object will contain all of the samples from the posterior collected during model fitting. Thus,
#' new parameters (e.g., L2FC(kdeg)'s comparing two experimental samples) not naturally generated by the model can be estimated
#' post-hoc. Still, there are often approximate estimates that can be obtained for such parameters that don't rely on the full
#' fit object. One analysis that is impossible without the original fit object is generating model diagnostic plots. These include
#' trace plots (to show mixing and efficient parameter space exploration of the Markov chains), pairs plots (to show correlations
#' between parameters and where any divergences occurred), and other visualizations that can help users assess how well a model
#' ran. Because the models implemented by \code{TL_stan} are extensively validated, it is less likely that such diagnostics will be helpful,
#' but often anomalies on your data can lead to poor model convergence, in which case assessing model diagnostics can help you
#' identify the source of problems in your data. Summary statistics describing how well the model was able to estimate each parameter
#' (n_eff and rhat) are provided in the fit summaries, but can often obscure some of the nuanced details of model fitting.
#'
#'
#' @param data_list List to pass to 'Stan' of form given by \code{cBprocess}
#' @param Hybrid_Fit Logical; if TRUE, Hybrid 'Stan' model that takes as data output of \code{fast_analysis} is run.
#' @param NSS Logical; if TRUE, models that directly compare logit(fn)s are used to avoid steady-state assumption
#' @param keep_fit Logical; if TRUE, 'Stan' fit object is included in output; typically large file so default FALSE.
#' @param chains Number of Markov chains to sample from. The default is to only run a single chain. Typical NR-seq datasets
#' yield very memory intensive analyses, but running a single chain should decrease this burden. For reference, running
#' the MCMC implementation (Hybrid_Fit = FALSE) with 3 chains on an NR-seq dataset with 3 replicates of 2 experimental conditions
#' with around 20 million raw (unmapped) reads per sample requires over 100 GB of RAM. With a single chain, this burden drops to
#' around 20 GB. Due to memory demands and time constraints (runtimes for the MCMC implementation border will likely be around 1-2 days)
#' means that these models should usually be run in a specialized High Performance Computing (HPC) system.
#' @param ... Arguments passed to \code{rstan::sampling} (e.g. iter, warmup, etc.).
#' @return A list of objects:
#' \itemize{
#' \item Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the
#' reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this
#' dataframe are:
#' \itemize{
#' \item Feature_ID; Numerical ID of feature
#' \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#' \item L2FC_kdeg; L2FC(kdeg) posterior mean
#' \item L2FC_kd_sd; L2FC(kdeg) posterior sd
#' \item effect; identical to L2FC_kdeg (kept for symmetry with MLE fit output)
#' \item se; identical to L2FC_kd_sd (kept for symmetry with MLE fit output)
#' \item XF; Feature name
#' \item pval; p value obtained from effect and se + z-test
#' \item padj; p value adjusted for multiple testing using Benjamini-Hochberg procedure
#' }
#' \item Kdeg_df; dataframe with estimates of the kdeg (RNA degradation rate constant) for each feature, averaged across replicate data.
#' The columns of this dataframe are:
#' \itemize{
#' \item Feature_ID; Numerical ID of feature
#' \item Exp_ID; Numerical ID for experimental condition
#' \item kdeg; Degradation rate constant posterior mean
#' \item kdeg_sd; Degradation rate constant posterior standard deviation
#' \item log_kdeg; Log of degradation rate constant posterior mean (as of version 1.0.0)
#' \item log_kdeg_sd; Log of degradation rate constant posterior standard deviation (as of version 1.0.0)
#' \item XF; Original feature name
#' }
#' \item Fn_Estimates; dataframe with estimates of the logit(fraction new) for each feature in each replicate.
#' The columns of this dataframe are:
#' \itemize{
#' \item Feature_ID; Numerical ID for feature
#' \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#' \item Replicate; Numerical ID for replicate
#' \item logit_fn; Logit(fraction new) posterior mean
#' \item logit_fn_se; Logit(fraction new) posterior standard deviation
#' \item sample; Sample name
#' \item XF; Original feature name
#' }
#' \item Fit_Summary; only outputted if keep_fit == FALSE. Summary of 'Stan' fit object with each row corresponding to a particular
#' parameter. All posterior point descriptions are descriptions of the marginal posterior distribution for the parameter in that row.
#' For example, the posterior mean is the average value for the parameter when averaging over all other parameter values.
#' The columns of this dataframe are:
#' \itemize{
#' \item mean; Posterior mean for the parameter given by the row name
#' \item se_mean; Standard error of the posterior mean; essentially how confident the model is that what it estimates to be the
#' posterior mean is what the posterior mean actually is. This will depend on the number of chains run on the number of iterations
#' each chain is run for.
#' \item sd; Posterior standard deviation
#' \item 2.5%; 2.5th percentile of the posterior distribution. 2.5% of the posterior mass is below this point
#' \item 25%; 25th percentile of the posterior distribution
#' \item 50%; 50th percentile of the posterior distribution
#' \item 75%; 75th percentile of the posterior distribution
#' \item 97.5%; 97.5th percentile of the posterior distribution
#' \item n_eff; Effective sample size. The larger this is the better, though it should preferably be around the total number of
#' iterations (iter x chains). Small values of this could represent poor model convergence
#' \item Rhat; Describes how well separate Markov chains mixed. This is preferably as close to 1 as possible, and values higher
#' than 1 could represent poor model convergence
#' }
#' \item Stan_Fit; only outputted if keep_fit == TRUE. This is the full 'Stan' fit object, an R6 object of class `stanfit`
#' \item Mutation_Rates; data frame with information about mutation rate estimates. Has the same columns as Fit_Summary.
#' Each row corresponds to either a background mutation rate (log_lambda_o) or an s4U induced mutation rate (log_lambda_n),
#' denoted in the parameter column. The bracketed portion of the parameter name will contain two numbers. The first corresponds
#' to the Exp_ID and the second corresponds to the Replicate_ID. For example, if the parameter name is log_lambda_o\[1,2\] then
#' that row corresponds to the background mutation rate in the second replicate of experimental condition one. A final point to
#' mention is that the estimates are on a log(avg. # of mutations) scale. So a log_lambda_n of 1 means that on average, there
#' are an estimated 2.72 (exp(1)) mutations in reads from new RNA (i.e., RNA synthesized during s4U labeling).
#' }
#'
#'
TL_stan <- function(data_list, Hybrid_Fit = FALSE, keep_fit = FALSE, NSS = FALSE,
chains = 1, ...) {
### Error catching
## Check Hybrid_Fit
if(!is.logical(Hybrid_Fit)){
stop("Hybrid_Fit must be logical (TRUE or FALSE)")
}
## Check keep_fit
if(!is.logical(keep_fit)){
stop("keep_fit must be logical (TRUE or FALSE)")
}
## Check NSS
if(!is.logical(NSS)){
stop("NSS must be logical (TRUE or FALSE)")
}
## Check chains
if(!is.numeric(chains)){
stop("chains must be numeric")
}else if(!is.integer(chains)){
chains <- as.integer(chains)
}
if(chains < 1){
stop("chains must be >= 1")
}
# If NSS TRUE, runs version of models that does not assume steady-state
# relationship between fraction new and degradation rate constan
if(Hybrid_Fit){ # Run Hybrid implementation
if(NSS){
fit <- rstan::sampling(stanmodels$Hybrid_NSS, data = data_list, chains = chains, ...)
}else{
fit <- rstan::sampling(stanmodels$Hybrid, data = data_list, chains = chains, ...)
}
}else{ # Run MCMC implementation
if(NSS){
fit <- rstan::sampling(stanmodels$MCMC_NSS, data = data_list, chains = chains, ...)
}else{
fit <- rstan::sampling(stanmodels$MCMC2, data = data_list, chains = chains, ...)
}
}
# Extract logit(fraction new) replicate estimates
fn_summary <- rstan::summary(fit, pars = "mu_rep_logit_fn", probs = c(0.5))$summary
fn_summary <- fn_summary[, c("50%","mean", "sd")]
fn_summary <- as.data.frame(fn_summary)
colnames(fn_summary) <- c("median", "logit_fn", "sd")
# Get number of features from data
ngs <- data_list$NF
# Extract kdeg to get number of conditions (could get from nMT but need kdeg df anyway)
MT_summary <- rstan::summary(fit, pars = "kd", probs = c(0.5))$summary
MT_summary <- MT_summary[, c("50%","mean", "sd")]
MT_summary <- as.data.frame(MT_summary)
# Extract log(kdeg) estimates
lkd_summary <- rstan::summary(fit, pars = "log_kd", probs = c(0.5))$summary
lkd_summary <- lkd_summary[, c("50%","mean", "sd")]
lkd_summary <- as.data.frame(lkd_summary)
# Number of non-reference experimental conditions
nconds <- data_list$nMT - 1
# Get number of replicates
# If some conditions have more replicates than others, this will be the
# largest number of replicates.
reps <- data_list$nrep
nreps <- rep(reps, times=nconds)
# Extract L2FC_kdeg and case as data frame
L2FC_summary <- rstan::summary(fit, pars = "L2FC_kd", probs = c(0.5))$summary
L2FC_summary <- L2FC_summary[, c("50%","mean", "sd")]
L2FC_df <- as.data.frame(L2FC_summary)
# Effect sizes data frame setup
F_ID <- rep(seq(from=1, to=ngs), each=nconds) # Feature number vector
Exp_ID <- rep(seq(from=2, to=nconds+1), times=ngs) # Experimental condition vector
L2FC_kdeg <- L2FC_df$mean # L2FC(kdeg) vector
L2FC_kdeg_sd <- L2FC_df$sd # L2FC(kdeg) sd vector
# Fn data frame setup
F_ID_fn <- rep(1:ngs, each = (nconds+1)*reps)
Exp_ID_fn <- rep(rep(1:(nconds+1), each = reps), times = ngs)
R_ID_fn <- rep(1:reps, times = (nconds+1)*ngs)
logit_fn <- fn_summary$logit_fn
fn_se <- fn_summary$sd
rm(fn_summary)
# Kdeg data frame setup
F_ID_kd <- rep(seq(from=1, to=ngs), each=(nconds+1))
Exp_ID_kd <- rep(seq(from=1, to=(nconds+1)), times=ngs)
kdeg <- MT_summary$mean # kdeg vector
kdeg_sd <- MT_summary$sd # kdeg sd vector
log_kdeg <- lkd_summary$mean # log(kdeg) vector
log_kdeg_sd <- lkd_summary$sd # log(kdeg) sd vector
rm(MT_summary)
rm(lkd_summary)
Effects_df <- data.frame(F_ID, Exp_ID, L2FC_kdeg, L2FC_kdeg_sd, L2FC_kdeg, L2FC_kdeg_sd)
Kdeg_df <- data.frame(F_ID_kd, Exp_ID_kd, kdeg, kdeg_sd, log_kdeg, log_kdeg_sd)
Fn_df <- data.frame(F_ID_fn, Exp_ID_fn, R_ID_fn, logit_fn, fn_se)
colnames(Effects_df) <- c("Feature_ID", "Exp_ID", "L2FC_kdeg", "L2FC_kd_sd", "effect", "se")
colnames(Kdeg_df) <- c("Feature_ID", "Exp_ID", "kdeg", "kdeg_sd", "log_kdeg", "log_kdeg_sd")
colnames(Fn_df) <- c("Feature_ID", "Exp_ID", "Replicate", "logit_fn", "logit_fn_se")
Fn_df <- merge(Fn_df, data_list$sample_lookup, by.x = c("Exp_ID", "Replicate"), by.y = c("mut", "reps"))
Fn_df <- Fn_df[order(Fn_df$Feature_ID, Fn_df$Exp_ID, Fn_df$Replicate),]
## Add original gene names
sdf <- data_list$sdf[,c("XF", "fnum")] %>% dplyr::distinct()
Fn_df <- merge(Fn_df, sdf, by.x = c("Feature_ID"), by.y = "fnum")
Kdeg_df <- merge(Kdeg_df, sdf, by.x = c("Feature_ID"), by.y = "fnum")
Effects_df <- merge(Effects_df, sdf, by.x = c("Feature_ID"), by.y = "fnum")
## Remove imputed replicate estimates
# If the number of replicates is less in some experimental conditions than
# others, Stan model will impute the "missing" replicate data. These imputed values
# should be removed post-hoc to avoid user confusion.
r_vect <- data_list$nrep_vect
rep_actual <- unlist(lapply(r_vect, function(x) seq(1, x)))
mut_actual <- rep(1:(nconds+1), times = r_vect)
truedf <- data.frame(Replicate = rep_actual,
Exp_ID = mut_actual)
Fn_df <- dplyr::right_join(Fn_df, truedf, by = c("Exp_ID", "Replicate"))
# Create final Fit objects
if(keep_fit == FALSE){
fit_summary <- as.data.frame(rstan::summary(fit)$summary)
fit_summary$parameter <- rownames(fit_summary)
if(Hybrid_Fit){
mutrates <- data_list$mutrates
}else{
mutrates <- fit_summary[grep("log_lambda", rownames(fit_summary)), ]
mutrates$parameter <- rownames(mutrates)
mutrates <- dplyr::as_tibble(mutrates) }
out <- list(dplyr::as_tibble(Effects_df), dplyr::as_tibble(Kdeg_df), dplyr::as_tibble(Fn_df), dplyr::as_tibble(fit_summary), mutrates)
names(out) <- c("Effects_df", "Kdeg_df", "Fn_Estimates","Fit_Summary", "Mutation_Rates")
}else{
if(Hybrid_Fit){
mutrates <- data_list$mutrates
}else{
fit_summary <- as.data.frame(rstan::summary(fit)$summary)
mutrates <- fit_summary[grep("log_lambda", rownames(fit_summary)), ]
mutrates$parameter <- rownames(mutrates)
mutrates <- dplyr::as_tibble(mutrates)
}
out <- list(dplyr::as_tibble(Effects_df), dplyr::as_tibble(Kdeg_df), dplyr::as_tibble(Fn_df), fit, mutrates)
names(out) <- c("Effects_df", "Kdeg_df", "Fn_Estimates","Stan_fit", "Mutation_Rates")
}
return(out)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.