R/TL_stan.R
In bakR: Analyze and Compare Nucleotide Recoding RNA Sequencing Datasets

Documented in TL_stan

#' Fit 'Stan' models to nucleotide recoding RNA-seq data analysis
#'
#' \code{TL_stan} is an internal function to analyze nucleotide recoding RNA-seq data with a fully
#' Bayesian hierarchical model implemented in the PPL `Stan`. \code{TL_stan} estimates
#' kinetic parameters and differences in kinetic parameters between experimental
#' conditions. When assessing differences, a single reference sample is compared to
#' each collection of experimental samples provided.
#'
#' Two implementations of a similar model can be fit with TL_stan: a complete nucleotide recoding RNA-seq
#' analysis and a hybrid analysis that takes as input results from \code{fast_analysis}.
#' In the complete analysis (referred to in the bakR publication as the MCMC implementation),
#' U-to-C mutations are modeled as coming from a Poisson distribution
#' with rate parameter adjusted by the empirical U-content of each feature analyzed. Features
#' represent whatever the user defined them to be when constructing the bakR data object.
#' Typical feature categories are genes, exons, etc. Hierarchical modeling is used to pool data
#' across replicates and across features. More specifically, replicate data for the
#' same feature are partially pooled to estimate feature-specific mean fraction news and uncertainties.
#' Feature means are partially pooled to estimate dataset-wide mean fraction news and standard deviations.
#' The replicate variability for each feature is also partially pooled to determine a condition-wide
#' heteroskedastic relationship between read depths and replicate variability. Partial pooling
#' reduces subjectivity when determining priors by allowing the model to determine what priors make sense
#' given the data. Partial pooling also regularizes estimates, reducing estimate variability and thus increasing
#' estimate accuracy. This is particularly important for replicate variability estimates, which often rely
#' on only a few replicates of data per feature and thus are typically highly unstable.
#'
#' The hybrid analysis (referred to in the bakR publication as the Hybrid implementation)
#' inherits the hierarchical modeling structure of the complete analysis, but reduces computational
#' burden by foregoing per-replicate-and-feature fraction new estimation and uncertainty quantification. Instead,
#' the hybrid analysis takes as data fraction new estimates and approximate uncertainties from \code{fast_analysis}.
#' Runtimes of the hybrid analysis are thus often an order of magnitude shorter than with the complete analysis, but
#' loses some accuracy by relying on point estimates and uncertainty quantification that is only valid in the
#' limit of large dataset sizes (where the dataset size for the per-replicate-and-feature fraction new estimate is the raw number
#' of sequencing reads mapping to the feature in that replicate).
#'
#' Users also have the option to save or discard the `Stan` fit object. Fit objects can be exceedingly large (> 10 GB) for most
#' nucleotide recoding RNA-seq datasets. Therefore, if you don't want to store such a large object, a summary object will be saved instead,
#' which greatly reduces the size of the output (~ 10-50 MB) while still retaining much of the important information. In addition,
#' the output of \code{TL_stan} provides the estimates and uncertainties for key parameters (L2FC(kdeg), kdeg, and fraction new)
#' that will likely be of most interest. That being said, there are some analyses that are only possible if the original fit object
#' is saved. For example, the fit object will contain all of the samples from the posterior collected during model fitting. Thus,
#' new parameters (e.g., L2FC(kdeg)'s comparing two experimental samples) not naturally generated by the model can be estimated
#' post-hoc. Still, there are often approximate estimates that can be obtained for such parameters that don't rely on the full
#' fit object. One analysis that is impossible without the original fit object is generating model diagnostic plots. These include
#' trace plots (to show mixing and efficient parameter space exploration of the Markov chains), pairs plots (to show correlations
#' between parameters and where any divergences occurred), and other visualizations that can help users assess how well a model
#' ran. Because the models implemented by \code{TL_stan} are extensively validated, it is less likely that such diagnostics will be helpful,
#' but often anomalies on your data can lead to poor model convergence, in which case assessing model diagnostics can help you
#' identify the source of problems in your data. Summary statistics describing how well the model was able to estimate each parameter
#' (n_eff and rhat) are provided in the fit summaries, but can often obscure some of the nuanced details of model fitting.
#'
#'
#' @param data_list List to pass to 'Stan' of form given by \code{cBprocess}
#' @param Hybrid_Fit Logical; if TRUE, Hybrid 'Stan' model that takes as data output of \code{fast_analysis} is run.
#' @param NSS Logical; if TRUE, models that directly compare logit(fn)s are used to avoid steady-state assumption
#' @param keep_fit Logical; if TRUE, 'Stan' fit object is included in output; typically large file so default FALSE.
#' @param chains Number of Markov chains to sample from. The default is to only run a single chain. Typical NR-seq datasets
#' yield very memory intensive analyses, but running a single chain should decrease this burden. For reference, running
#' the MCMC implementation (Hybrid_Fit = FALSE) with 3 chains on an NR-seq dataset with 3 replicates of 2 experimental conditions
#' with around 20 million raw (unmapped) reads per sample requires over 100 GB of RAM. With a single chain, this burden drops to
#' around 20 GB. Due to memory demands and time constraints (runtimes for the MCMC implementation border will likely be around 1-2 days)
#' means that these models should usually be run in a specialized High Performance Computing (HPC) system.
#' @param ... Arguments passed to \code{rstan::sampling} (e.g. iter, warmup, etc.).
#' @return A list of objects:
#' \itemize{
#'  \item Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the
#'  reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this
#'  dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item L2FC_kdeg; L2FC(kdeg) posterior mean
#'   \item L2FC_kd_sd; L2FC(kdeg) posterior sd
#'   \item effect; identical to L2FC_kdeg (kept for symmetry with MLE fit output)
#'   \item se; identical to L2FC_kd_sd (kept for symmetry with MLE fit output)
#'   \item XF; Feature name
#'   \item pval; p value obtained from effect and se + z-test
#'   \item padj; p value adjusted for multiple testing using Benjamini-Hochberg procedure
#'  }
#'  \item Kdeg_df; dataframe with estimates of the kdeg (RNA degradation rate constant) for each feature, averaged across replicate data.
#'  The columns of this dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition
#'   \item kdeg; Degradation rate constant posterior mean
#'   \item kdeg_sd; Degradation rate constant posterior standard deviation
#'   \item log_kdeg; Log of degradation rate constant posterior mean (as of version 1.0.0)
#'   \item log_kdeg_sd; Log of degradation rate constant posterior standard deviation (as of version 1.0.0)
#'   \item XF; Original feature name
#'  }
#'  \item Fn_Estimates; dataframe with estimates of the logit(fraction new) for each feature in each replicate.
#'  The columns of this dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID for feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item Replicate; Numerical ID for replicate
#'   \item logit_fn; Logit(fraction new) posterior mean
#'   \item logit_fn_se; Logit(fraction new) posterior standard deviation
#'   \item sample; Sample name
#'   \item XF; Original feature name
#'  }
#'  \item Fit_Summary; only outputted if keep_fit == FALSE. Summary of 'Stan' fit object with each row corresponding to a particular
#'  parameter. All posterior point descriptions are descriptions of the marginal posterior distribution for the parameter in that row.
#'  For example, the posterior mean is the average value for the parameter when averaging over all other parameter values.
#'  The columns of this dataframe are:
#'  \itemize{
#'   \item mean; Posterior mean for the parameter given by the row name
#'   \item se_mean; Standard error of the posterior mean; essentially how confident the model is that what it estimates to be the
#'   posterior mean is what the posterior mean actually is. This will depend on the number of chains run on the number of iterations
#'   each chain is run for.
#'   \item sd; Posterior standard deviation
#'   \item 2.5%; 2.5th percentile of the posterior distribution. 2.5% of the posterior mass is below this point
#'   \item 25%; 25th percentile of the posterior distribution
#'   \item 50%; 50th percentile of the posterior distribution
#'   \item 75%; 75th percentile of the posterior distribution
#'   \item 97.5%; 97.5th percentile of the posterior distribution
#'   \item n_eff; Effective sample size. The larger this is the better, though it should preferably be around the total number of
#'   iterations (iter x chains). Small values of this could represent poor model convergence
#'   \item Rhat; Describes how well separate Markov chains mixed. This is preferably as close to 1 as possible, and values higher
#'   than 1 could represent poor model convergence
#'  }
#'  \item Stan_Fit; only outputted if keep_fit == TRUE. This is the full 'Stan' fit object, an R6 object of class `stanfit`
#'  \item Mutation_Rates; data frame with information about mutation rate estimates. Has the same columns as Fit_Summary.
#'  Each row corresponds to either a background mutation rate (log_lambda_o) or an s4U induced mutation rate (log_lambda_n),
#'  denoted in the parameter column. The bracketed portion of the parameter name will contain two numbers. The first corresponds
#'  to the Exp_ID and the second corresponds to the Replicate_ID. For example, if the parameter name is log_lambda_o\[1,2\] then
#'  that row corresponds to the background mutation rate in the second replicate of experimental condition one. A final point to
#'  mention is that the estimates are on a log(avg. # of mutations) scale. So a log_lambda_n of 1 means that on average, there
#'  are an estimated 2.72 (exp(1)) mutations in reads from new RNA (i.e., RNA synthesized during s4U labeling).
#' }
#'
#'
TL_stan <- function(data_list, Hybrid_Fit = FALSE, keep_fit = FALSE, NSS = FALSE,
                    chains = 1, ...) {

  ### Error catching

  ## Check Hybrid_Fit
  if(!is.logical(Hybrid_Fit)){
    stop("Hybrid_Fit must be logical (TRUE or FALSE)")
  }

  ## Check keep_fit
  if(!is.logical(keep_fit)){
    stop("keep_fit must be logical (TRUE or FALSE)")
  }

  ## Check NSS
  if(!is.logical(NSS)){
    stop("NSS must be logical (TRUE or FALSE)")
  }

  ## Check chains
  if(!is.numeric(chains)){
    stop("chains must be numeric")
  }else if(!is.integer(chains)){
    chains <- as.integer(chains)
  }

  if(chains < 1){
    stop("chains must be >= 1")
  }



  # If NSS TRUE, runs version of models that does not assume steady-state
  # relationship between fraction new and degradation rate constan
  if(Hybrid_Fit){ # Run Hybrid implementation
    if(NSS){
      fit <- rstan::sampling(stanmodels$Hybrid_NSS, data = data_list, chains = chains, ...)

    }else{
      fit <- rstan::sampling(stanmodels$Hybrid, data = data_list, chains = chains, ...)

    }
  }else{ # Run MCMC implementation
    if(NSS){
      fit <- rstan::sampling(stanmodels$MCMC_NSS, data = data_list, chains = chains, ...)
    }else{
      fit <- rstan::sampling(stanmodels$MCMC2, data = data_list, chains = chains, ...)

    }

  }

  # Extract logit(fraction new) replicate estimates
  fn_summary <- rstan::summary(fit, pars = "mu_rep_logit_fn", probs = c(0.5))$summary

  fn_summary <- fn_summary[, c("50%","mean", "sd")]

  fn_summary <- as.data.frame(fn_summary)

  colnames(fn_summary) <- c("median", "logit_fn", "sd")


  # Get number of features from data
  ngs <- data_list$NF

  # Extract kdeg to get number of conditions (could get from nMT but need kdeg df anyway)
  MT_summary <- rstan::summary(fit, pars = "kd", probs = c(0.5))$summary

  MT_summary <- MT_summary[, c("50%","mean", "sd")]

  MT_summary <- as.data.frame(MT_summary)
  
  # Extract log(kdeg) estimates
  lkd_summary <- rstan::summary(fit, pars = "log_kd", probs = c(0.5))$summary
  
  lkd_summary <- lkd_summary[, c("50%","mean", "sd")]
  
  lkd_summary <- as.data.frame(lkd_summary)

  # Number of non-reference experimental conditions
  nconds <- data_list$nMT - 1


  # Get number of replicates
    # If some conditions have more replicates than others, this will be the
    # largest number of replicates.
  reps <- data_list$nrep

  nreps <- rep(reps, times=nconds)

  # Extract L2FC_kdeg and case as data frame
  L2FC_summary <- rstan::summary(fit, pars = "L2FC_kd", probs = c(0.5))$summary

  L2FC_summary <- L2FC_summary[, c("50%","mean", "sd")]

  L2FC_df <- as.data.frame(L2FC_summary)

  # Effect sizes data frame setup
  F_ID <- rep(seq(from=1, to=ngs), each=nconds) # Feature number vector
  Exp_ID <- rep(seq(from=2, to=nconds+1), times=ngs) # Experimental condition vector
  L2FC_kdeg <- L2FC_df$mean # L2FC(kdeg) vector
  L2FC_kdeg_sd <- L2FC_df$sd # L2FC(kdeg) sd vector

  # Fn data frame setup
  F_ID_fn <- rep(1:ngs, each = (nconds+1)*reps)
  Exp_ID_fn <- rep(rep(1:(nconds+1), each = reps), times = ngs)
  R_ID_fn <- rep(1:reps, times = (nconds+1)*ngs)
  logit_fn <- fn_summary$logit_fn
  fn_se <- fn_summary$sd

  rm(fn_summary)

  # Kdeg data frame setup
  F_ID_kd <- rep(seq(from=1, to=ngs), each=(nconds+1))
  Exp_ID_kd <- rep(seq(from=1, to=(nconds+1)), times=ngs)
  kdeg <- MT_summary$mean # kdeg vector
  kdeg_sd <- MT_summary$sd # kdeg sd vector
  log_kdeg <- lkd_summary$mean # log(kdeg) vector
  log_kdeg_sd <- lkd_summary$sd # log(kdeg) sd vector

  rm(MT_summary)
  rm(lkd_summary)

  Effects_df <- data.frame(F_ID, Exp_ID, L2FC_kdeg, L2FC_kdeg_sd, L2FC_kdeg, L2FC_kdeg_sd)
  Kdeg_df <- data.frame(F_ID_kd, Exp_ID_kd, kdeg, kdeg_sd, log_kdeg, log_kdeg_sd)
  Fn_df <- data.frame(F_ID_fn, Exp_ID_fn, R_ID_fn, logit_fn, fn_se)

  colnames(Effects_df) <- c("Feature_ID", "Exp_ID", "L2FC_kdeg", "L2FC_kd_sd", "effect", "se")
  colnames(Kdeg_df) <- c("Feature_ID", "Exp_ID", "kdeg", "kdeg_sd", "log_kdeg", "log_kdeg_sd")
  colnames(Fn_df) <- c("Feature_ID", "Exp_ID", "Replicate", "logit_fn", "logit_fn_se")

  Fn_df <- merge(Fn_df, data_list$sample_lookup, by.x = c("Exp_ID", "Replicate"), by.y = c("mut", "reps"))

  Fn_df <- Fn_df[order(Fn_df$Feature_ID, Fn_df$Exp_ID, Fn_df$Replicate),]

  ## Add original gene names
  sdf <- data_list$sdf[,c("XF", "fnum")] %>% dplyr::distinct()

  Fn_df <- merge(Fn_df, sdf, by.x = c("Feature_ID"), by.y = "fnum")

  Kdeg_df <- merge(Kdeg_df, sdf, by.x = c("Feature_ID"), by.y = "fnum")

  Effects_df <- merge(Effects_df, sdf, by.x = c("Feature_ID"), by.y = "fnum")

  ## Remove imputed replicate estimates
    # If the number of replicates is less in some experimental conditions than
    # others, Stan model will impute the "missing" replicate data. These imputed values
    # should be removed post-hoc to avoid user confusion.
  r_vect <- data_list$nrep_vect
  rep_actual <- unlist(lapply(r_vect, function(x) seq(1, x)))
  mut_actual <- rep(1:(nconds+1), times = r_vect)

  truedf <- data.frame(Replicate = rep_actual,
                       Exp_ID = mut_actual)

  Fn_df <- dplyr::right_join(Fn_df, truedf, by = c("Exp_ID", "Replicate"))


  # Create final Fit objects
  if(keep_fit == FALSE){
    fit_summary <- as.data.frame(rstan::summary(fit)$summary)

    fit_summary$parameter <- rownames(fit_summary)

    if(Hybrid_Fit){
      mutrates <- data_list$mutrates
    }else{
      mutrates <- fit_summary[grep("log_lambda", rownames(fit_summary)), ]
      mutrates$parameter <- rownames(mutrates)
      mutrates <- dplyr::as_tibble(mutrates)    }

    out <- list(dplyr::as_tibble(Effects_df), dplyr::as_tibble(Kdeg_df), dplyr::as_tibble(Fn_df), dplyr::as_tibble(fit_summary), mutrates)
    names(out) <- c("Effects_df", "Kdeg_df", "Fn_Estimates","Fit_Summary", "Mutation_Rates")
  }else{


    if(Hybrid_Fit){
      mutrates <- data_list$mutrates
    }else{
      fit_summary <- as.data.frame(rstan::summary(fit)$summary)
      mutrates <- fit_summary[grep("log_lambda", rownames(fit_summary)), ]
      mutrates$parameter <- rownames(mutrates)
      mutrates <- dplyr::as_tibble(mutrates)
    }

    out <- list(dplyr::as_tibble(Effects_df), dplyr::as_tibble(Kdeg_df), dplyr::as_tibble(Fn_df), fit, mutrates)
    names(out) <- c("Effects_df", "Kdeg_df", "Fn_Estimates","Stan_fit", "Mutation_Rates")
  }


  return(out)
}