R/Fast_analysis.R

Defines functions avg_and_regularize fast_analysis

Documented in avg_and_regularize fast_analysis

#' Efficiently analyze nucleotide recoding data
#'
#' \code{fast_analysis} analyzes nucleotide recoding data with maximum likelihood estimation
#' implemented by \code{stats::optim} combined with analytic solutions to simple Bayesian models to perform
#' approximate partial pooling. Output includes kinetic parameter estimates in each replicate, kinetic parameter estimates
#' averaged across replicates, and log-2 fold changes in the degradation rate constant (L2FC(kdeg)).
#' Averaging takes into account uncertainties estimated using the Fisher Information and estimates
#' are regularized using analytic solutions of fully Bayesian models. The result is that kdegs are
#' shrunk towards population means and that uncertainties are shrunk towards a mean-variance trend estimated as part of the analysis.
#'
#' Unless the user supplies estimates for pnew and pold, the first step of \code{fast_analysis} is to estimate the background (pold)
#' and metabolic label (will refer to as s4U for simplicity, though bakR is compatible with other metabolic labels such as s6G)
#' induced (pnew) mutation rates. Several pnew and pold estimation strategies are implemented in bakR. For pnew estimation,
#' the two strategies are likelihood maximization of a binomial mixture model (default) and sampling from the full posterior of
#' a U-content adjusted Poisson mixture model with HMC (when \code{StanRateEst} is set to TRUE in \code{bakRFit}).
#' 
#' The default pnew estimation strategy involves combining the mutational data for all features into sample-wide mutational
#' data vectors (# of T-to-C conversions, # of Ts, and # of such reads vectors). These data vectors are then
#' downsampled (to prevent float overflow) and used to maximize the likelihood of a two-component binomial mixture model. The
#' two components correspond to reads from old and new RNA, and the three estimated paramters are the fraction of 
#' all reads that are new (nuisance parameter in this case), and the old and new read mutation rates.
#' 
#' The alternative strategy involves running a fully Bayesian implementation of a similar mixture model using Stan, a 
#' probalistic programming language that bakR makes use of in several functions. This strategy can yield more accurate
#' mutation rate estimates when the label times are much shorter or longer than the average half-lives of the sequenced RNA 
#' (i.e., the fraction news are mostly close to 0 or 1, respectively). To improve the efficiency of this approach, only a small
#' subset of RNA features are analyzed, the number of which is set by the \code{RateEst_size} parameter in \code{bakRFit}. By default,
#' this number is set to 30, which on the average NR-seq dataset yields a several minute runtime for mutation rate estimation.
#' 
#' Estimation of pold can be performed with three strategies: the same two strategies discussed for pnew estimation, and a third
#' strategy that relies on the presence of -s4U control data. If -s4U control data is present, the default pold estimation
#' strategy is to use the average mutation rate in reads from all -s4U control datasets as the global pold estimate. Thus,
#' a single pold estimate is used for all samples. The likelihood maximization strategy can be used by
#' setting \code{no_ctl} to TRUE, and this strategy becomes the default strategy if no -s4U data is present.
#' In addition, as of version 1.0.0 of bakR (released late June of 2023), users can decide to estimate
#' pold for each +s4U sample independently by setting \code{multi_pold} to TRUE. In this case, independent -s4U datasets can no longer be used for 
#' mutation rate estimation purposes, and thus the strategies for pold estimation are identical to the
#' set of pnew estimation strategies.
#'
#' Once mutation rates are estimated, fraction news for each feature in each sample are estimated. The approach utilized is MLE
#' using the L-BFGS-B algorithm implemented in \code{stats::optim}. The assumed likelihood function is derived from a Poisson mixture
#' model with rates adjusted according to each feature's empirical U-content (the average number of Us present in sequencing reads mapping
#' to that feature in a particular sample). Fraction new estimates are then converted to degradation rate constant estimates using
#' a solution to a simple ordinary differential equation model of RNA metabolism.
#'
#' Once fraction new and kdegs are estimated, the uncertainty in these parameters is estimated using the Fisher Information. In the limit of
#' large datasets, the variance of the MLE is inversely proportional to the Fisher Information evaluated at the MLE. Mixture models are
#' typically singular, meaning that the Fisher information matrix is not positive definite and asymptotic results for the variance
#' do not necessarily hold. As the mutation rates are estimated a priori and fixed to be > 0, these problems are eliminated. In addition, when assessing
#' the uncertainty of replicate fraction new estimates, the size of the dataset is the raw number of sequencing reads that map to a
#' particular feature. This number is often large (>100) which increases the validity of invoking asymptotics. 
#' As of version 1.0.0, users can opt for an alternative uncertainty estimation strategy by setting 
#' \code{Fisher} to FALSE. This strategy makes use of the standard error for the estimator of a binomial random
#' variables rate of success parameter. If we can uniquely identify new and old reads, then the 
#' variance in our estimate for the fraction of reads that is new is fn*(1-fn)/n. This uncertainty 
#' estimate will typically underestimate fraction new replicate uncertainties. We showed in the bakR
#' paper though that the Fisher information strategy often significantly overestimates uncertainties
#' of low coverage or extreme fraction new features. Therefore, this more bullish, underconservative 
#' uncertainty quantification can be useful on datasets with low mutation rates, extreme label times,
#' or low sequecning depth. We have found that false discovery rates are still well controlled when using
#' this alternative uncertainty quantification strategy.
#'
#' With kdegs and their uncertainties estimated, replicate estimates are pooled and regularized. There are two key steps in this
#' downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend,
#' and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the
#' Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty
#' estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate
#' priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of
#' regularization can be tuned by adjusting the \code{prior_weight} parameter, with larger numbers yielding stronger shrinkage towards
#' the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a
#' Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the
#' population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior).
#' In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads
#' mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization
#' of the uncertainty estimates.
#'
#' Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental
#' sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the
#' log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number
#' of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini-
#' Hochberg procedure to control false discovery rates (FDRs).
#'
#' In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed.
#' In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg).
#' This analysis strategy is implemented when \code{NSS} is set to TRUE. Comparing logit(fraction new) is only valid
#' If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq
#' data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in
#' logit(fraction new) could stem from differences in kinetics or label times.
#'
#'
#' @param df Dataframe in form provided by cB_to_Fast
#' @param pnew Labeled read mutation rate; default of 0 means that model estimates rate from s4U fed data. If pnew is provided by user, must be  a vector
#' of length == number of s4U fed samples. The 1st element corresponds to the s4U induced mutation rate estimate for the 1st replicate of the 1st
#' experimental condition; the 2nd element corresponds to the s4U induced mutation rate estimate for the 2nd replicate of the 1st experimental condition,
#' etc.
#' @param pold Unlabeled read mutation rate; default of 0 means that model estimates rate from no-s4U fed data
#' @param no_ctl Logical; if TRUE, then -s4U control is not used for background mutation rate estimation
#' @param read_cut Minimum number of reads for a given feature-sample combo to be used for mut rate estimates
#' @param features_cut Number of features to estimate sample specific mutation rate with
#' @param nbin Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used
#' @param prior_weight Determines extent to which logit(fn) variance is regularized to the mean-variance regression line
#' @param MLE Logical; if TRUE then replicate logit(fn) is estimated using maximum likelihood; if FALSE more conservative Bayesian hypothesis testing is used
#' @param ztest TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test. 
#' @param lower Lower bound for MLE with L-BFGS-B algorithm
#' @param upper Upper bound for MLE with L-BFGS-B algorithm
#' @param se_max Uncertainty given to those transcripts with estimates at the upper or lower bound sets. This prevents downstream errors due to
#' abnormally high standard errors due to transcripts with extreme kinetics
#' @param mut_reg If MLE has instabilities, empirical mut rate will be used to estimate fn, multiplying pnew by 1+mut_reg and pold by 1-mut_reg to regularize fn
#' @param p_mean Mean of normal distribution used as prior penalty in MLE of logit(fn)
#' @param p_sd Standard deviation of normal distribution used as prior penalty in MLE of logit(fn)
#' @param StanRate Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes
#' to the runtime of the analysis.
#' @param Stan_data List; if StanRate is TRUE, then this is the data passed to the 'Stan' model to estimate mutation rates. If using the \code{bakRFit}
#' wrapper of \code{fast_analysis}, then this is created automatically.
#' @param null_cutoff bakR will test the null hypothesis of |effect size| < |null_cutoff|
#' @param NSS Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.
#' @param Chase Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of
#' reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)
#' @param BDA_model Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal
#' model is used.
#' @param multi_pold Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate.
#' @param Long Logical; if TRUE, long read optimized fraction new estimation strategy is used.
#' @param kmeans Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold.
#' @param Fisher Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which
#' can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty
#' (i.e., low coverage or low pnew).
#' @return List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:
#' \itemize{
#'  \item Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate.
#'  The columns of this dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item Replicate; Numerical ID for replicate
#'   \item logit_fn; logit(fraction new) estimate, unregularized
#'   \item logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information
#'   \item nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained
#'   \item log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized
#'   \item kdeg; degradation rate constant (kdeg) estimate
#'   \item log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information
#'   \item sample; Sample name
#'   \item XF; Original feature name
#'  }
#'  \item Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized
#'  using priors informed by the entire dataset. The columns of this dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth
#'   \item sd_log_kdeg; Standard deviation of the log(kdeg) estimates
#'   \item nreads; Total number of reads mapping to the feature in that condition
#'   \item sdp; Prior standard deviation for fraction new estimate regularization
#'   \item theta_o; Prior mean for fraction new estimate regularization
#'   \item sd_post; Posterior uncertainty
#'   \item log_kdeg_post; Posterior mean for log(kdeg) estimate
#'   \item kdeg; exp(log_kdeg_post)
#'   \item kdeg_sd; kdeg uncertainty
#'   \item XF; Original feature name
#'  }
#'  \item Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the
#'  reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this
#'  dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE
#'   \item effect; LFC(kdeg)
#'   \item se; Uncertainty in L2FC_kdeg
#'   \item pval; P-value obtained using effect_size, se, and a z-test
#'   \item padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure
#'   \item XF; Original feature name
#'  }
#'  \item Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column
#'  represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation
#'  rate estimate used
#'  \item Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each
#'  feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional
#'  degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.
#'  \item Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID
#' }
#' @importFrom magrittr %>%
#' @examples
#' \donttest{
#'
#' # Simulate small dataset
#' sim <- Simulate_bakRData(300, nreps = 2)
#'
#' # Fit fast model to get fast_df
#' Fit <- bakRFit(sim$bakRData)
#'
#' # Fit fast model with fast_analysis
#' Fast_Fit <- fast_analysis(Fit$Data_lists$Fast_df)
#' }
#' @export
fast_analysis <- function(df, pnew = NULL, pold = NULL, no_ctl = FALSE,
                          read_cut = 50, features_cut = 50,
                          nbin = NULL, prior_weight = 2,
                          MLE = TRUE, ztest = FALSE,
                          lower = -7, upper = 7,
                          se_max = 2.5,
                          mut_reg = 0.1,
                          p_mean = 0,
                          p_sd = 1,
                          StanRate = FALSE,
                          Stan_data = NULL,
                          null_cutoff = 0,
                          NSS = FALSE,
                          Chase = FALSE,
                          BDA_model = FALSE,
                          multi_pold = FALSE,
                          Long = FALSE,
                          kmeans = FALSE, Fisher = TRUE){


  # Bind variables locally to resolve devtools::check() Notes
  nT <- n <- TC <- mut <- reps <- fnum <- New_prob <- Old_prob <- NULL
  News <- Fn_rep_est <- lam_n <- lam_o <- logit_fn_rep <- totTC <- NULL
  totU <- tot_mut <- totUs <- avg_mut <- logit_fn_rep.x <- logit_fn_rep.y <- NULL
  kd_rep_est <- U_cont <- Exp_l_fn <- Fisher_lkd_num <- Fisher_lkd_den <- NULL
  Inv_Fisher_Logit_1 <- Inv_Fisher_Logit_2 <- Inv_Fisher_Logit_3 <- NULL
  tot_n <- Fisher_kdeg <- Fisher_Logit <- Logit_fn_se <- bin_ID <- kd_sd_log <- NULL
  intercept <- slope <- log_kd_rep_est <- avg_log_kd <- sd_log_kd <- NULL
  sd_post <- sdp <- theta_o <- log_kd_post <- effect_size <- effect_std_error <- NULL
  n_new <- nreads <- logit_fn_se <- log_kd_se <- NULL
  global_mean <- global_var <- alpha_p <- beta_p <- NULL
  
  


  # Check input validity -------------------------------------------------------

  ## Check df
  if (!all(c("sample", "XF", "TC", "nT", "n", "fnum", "type", "mut", "reps", "tl") %in% names(df))) {
    stop("`df` must contain `sample`, `XF`, `TC`, `nT`, `n`, `fnum`, `type`, `mut`, `reps`, and `tl` columns")
  }

  ## Check pold
  if(length(pold) > 1 & !multi_pold){
    stop("pold must be NULL or length 1 if multi_pold is TRUE")
  }

  if(!is.null(pold)){
    if(!is.numeric(pold)){
      stop("pold must be numeric")
    }else if(pold < 0){
      stop("pold must be >= 0")
    }else if(pold > 1){
      stop("pold must be <= 1")
    }
  }

  ## Extract info about # of replicates in each condition
  nMT <- max(df$mut)

  # nreps must be vector where each element i is number of replicates
  # in Exp_ID = i
  nreps <- rep(0, times = nMT)
  for(i in 1:nMT){
    nreps[i] <- max(df$reps[df$mut == i & df$type == 1])
  }


  ## Check pnew
  if(!is.null(pnew)){
    if(!all(is.numeric(pnew))){
      stop("All elements of pnew must be numeric")
    }else if((length(pnew) != sum(nreps)) & (length(pnew) != 1) ){
      stop("pnew must be a vector of length == number of s4U fed samples in dataset, or length 1.")
    }else if(!all(pnew > 0)){
      stop("All elements of pnew must be > 0")
    }else if(!all(pnew <=1)){
      stop("All elements of pnew must be <= 1")
    }
  }
  
  if(!is.null(pold)){
    if(!all(is.numeric(pold))){
      stop("All elements of pold must be numeric")
    }else if(((length(pold) != sum(nreps)) & (length(pold) != 1)) & multi_pold ){
      stop("pold must be a vector of length == number of s4U fed samples in dataset, or length 1.")
    }else if(!all(pold >= 0)){
      stop("All elements of pold must be >= 0")
    }else if(!all(pold <1)){
      stop("All elements of pold must be < 1")
    }
  }

  ## Check no_ctl
  if(!is.logical(no_ctl)){
    stop("no_ctl must be logical (TRUE or FALSE)")
  }
  

  ## Check read_cut
  if(!is.numeric(read_cut)){
    stop("read_cut must be numeric")
  }else if(read_cut < 0){
    stop("read_cut must be >= 0")
  }

  ## Check features_cut
  if(!is.numeric(features_cut)){
    stop("features_cut must be numeric")
  }else if(!is.integer(features_cut)){
    features_cut <- as.integer(features_cut)
  }

  if(features_cut <= 0){
    stop("features_cut must be > 0")
  }

  ## Check nbin
  if(!is.null(nbin)){
    if(!is.numeric(nbin)){
      stop("nbin must be numeric")
    }else if(!is.integer(nbin)){
      nbin <- as.integer(nbin)
    }

    if(nbin <= 0){
      stop("nbin must be > 0 and is preferably greater than or equal to 10")
    }
  }
  
  ### Check ztest
  if(!is.logical(ztest)){
    stop("ztest must be logical (TRUE or FALSE)")
  }


  ## Check prior_weight
  if(!is.numeric(prior_weight)){
    stop("prior_weight must be numeric")
  }else if(prior_weight < 1){
    stop("prior_weight must be >= 1. The larger the value, the more each 
    feature's fraction new uncertainty estimate is shrunk towards the
    log10(read counts) vs. log(uncertainty) regression line.")
  }


  ## Check MLE
  if(!is.logical(MLE)){
    stop("MLE must be logical (TRUE or FALSE)")
  }

  if(!MLE){
    stop("Bayesian hypothesis estimation strategy has been removed, set MLE to TRUE.")
  }

  ## Check lower and upper
  if(!all(is.numeric(c(lower, upper)))){
    stop("lower and upper must be numeric")
  }else if(upper < lower){
    stop("upper must be > lower. Upper and lower represent the upper and lower 
         bounds used by stats::optim for MLE")
  }

  ## Check mut_reg
  if(!is.numeric(mut_reg)){
    stop("mut_reg must be numeric")
  }else if(mut_reg < 0){
    stop("mut_reg must be greater than 0")
  }else if(mut_reg > 0.5){
    stop("mut_reg must be less than 0.5")
  }

  ## Check p_mean
  if(!is.numeric(p_mean)){
    stop("p_mean must be numeric")
  }else if(abs(p_mean) > 2){
    warning("p_mean is far from 0. This might bias estimates considerably; 
            tread lightly.")
  }

  ## Check p_sd
  if(!is.numeric(p_sd)){
    stop("p_sd must be numeric")
  }else if(p_sd <= 0){
    stop("p_sd must be greater than 0")
  }else if(p_sd < 0.5){
    warning("p_sd is pretty small. This might bias estimates considerably; 
            tread lightly.")
  }

  ## Check StanRate
  if(!is.logical(StanRate)){
    stop("StanRate must be logical (TRUE or FALSE)")
  }

  ## Check null_cutoff
  if(!is.numeric(null_cutoff)){
    stop("null_cutoff must be numeric")
  }else if(null_cutoff < 0){
    stop("null_cutoff must be 0 or positive")
  }else if(null_cutoff > 2){
    warning("You are testing against a null hypothesis |L2FC(kdeg)| greater 
            than 2; this might be too conservative")
  }


  ## Check Chase
  if(!is.logical(Chase)){
    stop("Chase must be logical (TRUE or FALSE)")
  }

  ## Check Fisher
  if(!is.logical(Fisher)){
    stop("Fisher must be logical (TRUE or FALSE")
  }

  # ESTIMATE MUTATION RATES ----------------------------------------------------

  # Helper functions that I will use on multiple occasions
  logit <- function(x) log(x/(1-x))
  inv_logit <- function(x) exp(x)/(1+exp(x))


  if(Long | kmeans){
    
    message("Estimating pnew with kmeans clustering")
    
    # Make sure package for clustering is installed
    if (!requireNamespace("Ckmeans.1d.dp", quietly = TRUE))
      stop("To use kmeans estimation strategy requires 'Ckmeans.1d.dp' package 
           which cannot be found. Please install 'Ckmeans.1d.dp' using 
           'install.packages('Ckmeans.1d.dp')'.")

    # Filter out -s4U data, wont' use that here
    df <- df[df$type == 1,]

    # Find all combos of mut and reps
    id_dict <- df %>%
      dplyr::select(mut, reps) %>%
      dplyr::distinct()

    New_data_estimate <- data.frame(pnew = 0, pold = 0, mut = id_dict$mut,
                                    reps = id_dict$reps)
    # Loop throw rows of id_dict
    for(i in 1:nrow(id_dict)){
      rows <- which(df$mut == id_dict$mut[i] & df$reps == id_dict$reps[i])
      rates <- rep(df$TC[rows]/df$nT[rows], times = df$n[rows])

      means <- Ckmeans.1d.dp::Ckmeans.1d.dp(rates, k = 2)$centers


      New_data_estimate$pnew[i] <- max(means)
      New_data_estimate$pold[i] <- min(means)

    }


    
    
    if(multi_pold){
      New_data_estimate <- New_data_estimate[,c("pnew", "pold", "mut", "reps")]
      
      Old_data_estimate <- New_data_estimate[,c("pold", "mut", "reps")]
      New_data_estimate <- New_data_estimate[,c("pnew", "mut", "reps")]

    }else{
      
      if((sum(df$type == 0) > 0) & !no_ctl & !multi_pold){
        message("Estimating unlabeled mutation rate with -s4U data")
        
        #Old mutation rate estimation
        Mut_data <- df
        
        Old_data <- Mut_data[Mut_data$type == 0, ]
        
        Old_data <- Old_data %>% dplyr::ungroup() %>%
          dplyr::summarise(mutrate = sum(n*TC)/sum(n*nT))
        
        pold <- Old_data$mutrate
      }else{
        # average out pold
        pold <- mean(New_data_estimate$pold)
        New_data_estimate <- New_data_estimate[,c("pnew", "mut", "reps")]
        
      }

      
    }


  }else{
    ### Old mutation rate estimation
    if((is.null(pnew) | is.null(pold)) & StanRate ){ # use Stan


      mut_fit <- rstan::sampling(stanmodels$Mutrate_est, data = Stan_data, chains = 1)
    }

    ## Make data frame of pnew estimates
    if(is.null(pnew)){

      # Get estimates from Stan fit summary
      if(StanRate){
        
        message("Estimating pnew with Stan output")

        # Even if replicates are imbalanced (more replicates of a given experimental condition
        # than another), the number of mutation rates Stan will estimate = max(nreps)*nMT
        # nreps = vector containing number of replicates for ith experimental condition
        # nMT = number of experimental conditions
        # Therefore, have to remove imputed mutation rates

        nrep_mut <- max(nreps)

        U_df <- df[(df$type == 1) & (df$XF %in% unique(Stan_data$sdf$XF)),] %>% dplyr::group_by(mut, reps) %>%
          dplyr::summarise(avg_T = sum(nT*n)/sum(n))

        U_df <- U_df[order(U_df$mut, U_df$reps),]

        # In theory this may have too many entries; some could be imputed
        # So going to need to associate a replicate/experimental ID with
        # each row and remove those that are not real
        # Especially important since I am dividing by U_df$avg_T
        pnew <- exp(as.data.frame(rstan::summary(mut_fit, pars = "log_lambda_n")$summary)$mean)

        rep_theory <- rep(seq(from = 1, to = nrep_mut), times = nMT)
        mut_theory <- rep(seq(from = 1, to = nMT), each = nrep_mut)

        rep_actual <- unlist(lapply(nreps, function(x) seq(1, x)))
        mut_actual <- rep(1:nMT, times = nreps)

        pnewdf <- data.frame(pnew = pnew,
                             R = rep_theory,
                             E = mut_theory)

        truedf <- data.frame(R = rep_actual,
                             E = mut_actual)


        ## Get final pnew estimate vector
        pnewdf <- dplyr::right_join(pnewdf, truedf, by = c("R", "E"))

        pnew <- pnewdf$pnew/U_df$avg_T


        if(!is.null(pold)){
          rm(mut_fit)
        }


        New_data_estimate <- data.frame(mut_actual, rep_actual, pnew)
        colnames(New_data_estimate) <- c("mut", "reps", "pnew")



      }else{ # Use binomial mixture model to estimate new read estimate mutation rate

        message("Estimating pnew with likelihood maximization")

        # Binomial mixture likelihood
        mixture_lik <- function(param, TC, nT, n){

          logl <- sum(n*log(inv_logit(param[3])*(factorial(nT)/(factorial(nT-TC)*factorial(TC)))*(inv_logit(param[1])^TC)*((1 -inv_logit(param[1]))^(nT-TC)) +  (1-inv_logit(param[3]))*(factorial(nT)/(factorial(nT-TC)*factorial(TC)))*(inv_logit(param[2])^TC)*((1 - inv_logit(param[2]))^(nT-TC)) ) )

          return(-logl)

        }

        # Remove unlabeled controls
        df_pnew <- df[df$type == 1,]


        # Summarize out genes
        df_pnew <- df_pnew %>%
          dplyr::group_by(TC, nT, mut, reps) %>%
          dplyr::summarise(n = sum(n))

        # Check to make sure likelihood is evalutable
        df_check <- df_pnew %>%
          dplyr::mutate(fn = mixture_lik(param = c(-7, -2, 0),
                                         TC = TC,
                                         nT = nT,
                                         n = n))

        if(sum(!is.finite(df_check$fn)) > 0){
          stop("The binomial log-likelihood is not computable for some of your 
               data, meaning the default mutation rate estimation strategy won't 
               work. Rerun bakRFit with StanRateEst set to TRUE to remedy this problem.")
        }

        rm(df_check)


        low_ps <- c(-9, -9, -9)
        high_ps <- c(0, 0, 9)


        # Fit mixture model to each sample
        df_pnew <- df_pnew %>%
          dplyr::group_by(mut, reps) %>%
          dplyr::summarise(pnew = inv_logit(max(stats::optim(par=c(-7, -2, 0), mixture_lik, TC = TC, nT = nT, n = n, method = "L-BFGS-B", lower = low_ps, upper = high_ps)$par[1:2])) )

        New_data_estimate <- df_pnew


      }


    }else{ # Construct pmut data frame from User input

      message("Using provided pnew estimates")
      
      if(length(pnew) == 1){ # replicate the single pnew provided

        pnew_vect <- rep(pnew, times = sum(nreps) )

        ## Compatible with unbalanced replicates
        rep_vect <- unlist(lapply(nreps, function(x) seq(1, x)))

        mut_vect <- rep(1:nMT, times = nreps)

        New_data_estimate <- data.frame(mut_vect, rep_vect, pnew_vect)
        colnames(New_data_estimate) <- c("mut", "reps", "pnew")

      } else{ # use vector of pnews provided

        ## Compatible with unbalanced replicates
        rep_vect <- unlist(lapply(nreps, function(x) seq(1, x)))

        mut_vect <- rep(1:nMT, times = nreps)
        New_data_estimate <- data.frame(mut_vect, rep_vect, pnew)
        colnames(New_data_estimate) <- c("mut", "reps", "pnew")
      }
    }

    ### Estimate mutation rate in old reads
    if(is.null(pold)){
      
      if((sum(df$type == 0) > 0) & !no_ctl & !multi_pold & !StanRate){
        message("Estimating unlabeled mutation rate with -s4U data")
        
        #Old mutation rate estimation
        Mut_data <- df
        
        Old_data <- Mut_data[Mut_data$type == 0, ]
        
        Old_data <- Old_data %>% dplyr::ungroup() %>%
          dplyr::summarise(mutrate = sum(n*TC)/sum(n*nT))
        
        pold <- Old_data$mutrate
        
      }else if(StanRate){ # Use Stan to estimate rates

        message("Estimating unlabeled mutation rate with Stan output")
        
        ## Extract estimate of mutation rate in old reads from Stan fit
        nrep_mut <- max(df$reps)

        # Calculate U-content
        U_df <- df[(df$type == 1) & (df$XF %in% unique(Stan_data$sdf$XF)),] %>% dplyr::group_by(mut, reps) %>%
          dplyr::summarise(avg_T = sum(nT*n)/sum(n))

        U_df <- U_df[order(U_df$mut, U_df$reps),]

        # Vector of pold estimates for each sample
        pold <- exp(as.data.frame(rstan::summary(mut_fit, pars = "log_lambda_o")$summary)$mean)

        # Balanced replicate vectors
        # For matching Stan estimates to a replicate and experimental condition ID
        rep_theory <- rep(seq(from = 1, to = nrep_mut), times = nMT)
        mut_theory <- rep(seq(from = 1, to = nMT), each = nrep_mut)

        # Actual replicate vectors
        # Will be same as rep_theory and mut_theory if there are the same
        # number of replicates in each experimental condition
        rep_actual <- unlist(lapply(nreps, function(x) seq(1, x)))
        mut_actual <- rep(1:nMT, times = nreps)

        polddf <- data.frame(pold = pold,
                             R = rep_theory,
                             E = mut_theory)

        truedf <- data.frame(R = rep_actual,
                             E = mut_actual)


        # Filter out imputed data
        polddf <- dplyr::right_join(polddf, truedf, by = c("R", "E"))
        
        pold <- polddf$pold/U_df$avg_T
      
        
        rm(mut_fit)
        rm(U_df)
        
        Old_data_estimate <- data.frame(mut_actual, rep_actual, pold)
        colnames(Old_data_estimate) <- c("mut", "reps", "pold")
        
        

        if(!multi_pold){
          # Final estimate of mutation rate in old reads
          pold <- mean(Old_data_estimate$pold)
          

        }
       
      }else{

          message("Estimating unlabeled mutation rate with likelihood maximization")

          # Binomial mixture likelihood
          mixture_lik <- function(param, TC, nT, n){

            logl <- sum(n*log(inv_logit(param[3])*(factorial(nT)/(factorial(nT-TC)*factorial(TC)))*(inv_logit(param[1])^TC)*((1 -inv_logit(param[1]))^(nT-TC)) +  (1-inv_logit(param[3]))*(factorial(nT)/(factorial(nT-TC)*factorial(TC)))*(inv_logit(param[2])^TC)*((1 - inv_logit(param[2]))^(nT-TC)) ) )

            return(-logl)

          }

          # Remove unlabeled controls
          df_pold <- df[df$type == 1,]


          # Summarize out genes
          df_pold <- df_pold %>%
            dplyr::group_by(TC, nT, mut, reps) %>%
            dplyr::summarise(n = sum(n))


          # Check to make sure likelihood is evalutable
          df_check <- df_pold %>%
            dplyr::mutate(fn = mixture_lik(param = c(-7, -2, 0),
                                           TC = TC,
                                           nT = nT,
                                           n = n))

          if(sum(!is.finite(df_check$fn)) > 0){
            stop("The binomial log-likelihood is not computable for some of your 
                 data, meaning the default mutation rate estimation strategy 
                 won't work. Rerun bakRFit with StanRateEst set to TRUE to remedy 
                 this problem.")
          }

          rm(df_check)


          low_ps <- c(-9, -9, -9)
          high_ps <- c(0, 0, 9)


          if(multi_pold){
            # Fit mixture model to each sample
            df_pold <- df_pold %>%
              dplyr::group_by(mut,reps) %>%
              dplyr::summarise(pold = inv_logit(min(stats::optim(par=c(-7, -2, 0), mixture_lik, TC = TC, nT = nT, n = n, method = "L-BFGS-B", lower = low_ps, upper = high_ps)$par[1:2])) )
            
            
            Old_data_estimate <- df_pold
            
          }else{
            # Fit mixture model to each sample
            df_pold <- df_pold %>%
              dplyr::group_by(mut,reps) %>%
              dplyr::summarise(pold = inv_logit(min(stats::optim(par=c(-7, -2, 0), mixture_lik, TC = TC, nT = nT, n = n, method = "L-BFGS-B", lower = low_ps, upper = high_ps)$par[1:2])) ) %>%
              dplyr::ungroup() %>%
              dplyr::summarise(pold = mean(pold))
            
            
            
            pold <- df_pold$pold

          }



        }
    }else{
      
      if(multi_pold){
        message("Using provided pold estimates")
        
        if(length(pold) == 1){ # replicate the single pnew provided
          
          pold_vect <- rep(pold, times = sum(nreps) )
          
          ## Compatible with unbalanced replicates
          rep_vect <- unlist(lapply(nreps, function(x) seq(1, x)))
          
          mut_vect <- rep(1:nMT, times = nreps)
          
          Old_data_estimate <- data.frame(mut_vect, rep_vect, pold_vect)
          colnames(Old_data_estimate) <- c("mut", "reps", "pold")
          
        } else{ # use vector of pnews provided
          
          ## Compatible with unbalanced replicates
          rep_vect <- unlist(lapply(nreps, function(x) seq(1, x)))
          
          mut_vect <- rep(1:nMT, times = nreps)
          Old_data_estimate <- data.frame(mut_vect, rep_vect, pold)
          colnames(Old_data_estimate) <- c("mut", "reps", "pold")
        }
      }else{
        message("Using provided pold estimate")
      
      }


    }
  }

  if(!multi_pold){
    New_data_estimate$pold <- pold
    message(paste0(c("Estimated pnews and polds for each sample are:", utils::capture.output(New_data_estimate)), collapse = "\n"))

  }else{
    New_data_estimate <- dplyr::inner_join(New_data_estimate, Old_data_estimate, by = c("reps", "mut"))
    message(paste0(c("Estimated pnews and polds for each sample are:", utils::capture.output(New_data_estimate)), collapse = "\n"))
    
  }


  if(!all(New_data_estimate$pnew - New_data_estimate$pold > 0)){
    stop("All pnew must be > pold; did you input an unusually large pold?")
  }

  # Mutation rate estimates
  pmuts_list <- New_data_estimate



  # PREP DATA FOR ANALYSES -----------------------------------------------------
    # i) Filter out any -s4U control samples
    # ii) Make lookup table mapping label times to experimental conditions
    # iii) Make lookup table mapping sample names to experimental conditions and replicate IDs
    # iv) Make lookup table mapping feature number to feature name

  Mut_data <- df

  # Filter out -s4U control samples
  Mut_data <- Mut_data[Mut_data$type == 1,]

  tl_df <- Mut_data %>% dplyr::select(mut, tl) %>%
    dplyr::distinct()

  # Label time lookup table
  tl_df <- tl_df[order(tl_df$mut),]

  tl <- tl_df$tl

  ngene <- max(Mut_data$fnum)
  num_conds <- max(Mut_data$mut)

  nreps <- rep(0, times = num_conds)
  for(i in 1:num_conds){
    nreps[i] <- max(Mut_data$reps[Mut_data$mut == i & Mut_data$type == 1])
  }

  # Sample characteristics lookup table
  sample_lookup <- Mut_data[, c("sample", "mut", "reps")] %>% dplyr::distinct()

  # Feature lookup table
  feature_lookup <- Mut_data[,c("fnum", "XF")] %>% dplyr::distinct()



  # ESTIMATE FRACTION NEW ------------------------------------------------------


  ## Estimate fraction new in each replicate using U-content adjusted Poisson model
  message("Estimating fraction labeled")

  Mut_data <- dplyr::left_join(Mut_data, New_data_estimate, by = c("mut", "reps"))

  if(Long){

    # Calculate variance with delta approximation using beta distribution
    var_calc <- function(alpha, beta){

      EX <- alpha/(alpha + beta)
      VX <- (alpha*beta)/(((alpha + beta)^2)*(alpha + beta + 1))


      var1 <- (((1/EX) + 1/(1 - EX))^2)*VX
      var2 <- -(-1/(4*(EX^2)) + 1/(4*((1-EX)^2)))^2
      var3 <- VX^2

      totvar <- var1 - var2*var3

      return(totvar)
    }

    # convert logit(fn) se -> log(kdeg) se
    logit_to_log <- function(lfn, varx){

      ## composition strategy
      #se <- sqrt(((((1 + exp(lfn))^-2)*varx)/(-log(1 - inv_logit(lfn))))*varx)

      ## full monte
      part1 <- 1/(log( (exp(-lfn))/(1 + exp(-lfn)))*(1 + exp(-lfn)))
      se <- sqrt((part1^2)*varx)

      return(se)

    }


    # 1) Add new read identifier
    cutoff <- function(pnew, pold){
      return(((logit(pnew) + logit(pold))/2 + logit((pnew + pold)/2))/2 )
    }


    Mut_data <- Mut_data %>%
      dplyr::mutate(new = ifelse(logit(TC/nT) > cutoff(pnew, pold), TRUE, FALSE))

    # 2) Calculate number of new reads and total number of reads in each replicate for each feature
    Mut_data <- Mut_data %>%
      dplyr::group_by(fnum, mut, reps) %>%
      dplyr::summarise(nreads = sum(n),
                       n_new = sum(new))


    # 3) Calculate fraction new and uncertainty
    Mut_data_est <- Mut_data %>%
      dplyr::mutate(Fn_rep_est = (n_new + 1)/(nreads + 1),
                    logit_fn_rep = logit(Fn_rep_est),
                    logit_fn_se = sqrt(var_calc(n_new + 1, nreads + 1)),
                    kd_rep_est = -log(1 - Fn_rep_est)/tl[mut],
                    log_kd_rep_est = log(kd_rep_est),
                    log_kd_se = logit_to_log(logit_fn_rep, logit_fn_se^2))



  }else{ # MLE

    # Likelihood function for mixture model
    mixed_lik <- function(lam_n, lam_o, TC, n, logit_fn){
      logl <- sum(n*log(inv_logit(logit_fn)*(lam_n^TC)*exp(-lam_n) + (1-inv_logit(logit_fn))*(lam_o^TC)*exp(-lam_o) )) + log(stats::dnorm(logit_fn, mean = p_mean, sd = p_sd))
      return(-logl)
    }

    # Calculate logit(fn) MLE
    Mut_data_est <- Mut_data %>% dplyr::ungroup() %>% dplyr::mutate(lam_n = pnew*nT, lam_o = pold*nT) %>%
      dplyr::group_by(fnum, mut, reps, TC) %>%
      dplyr::summarise(lam_n = sum(lam_n*n)/sum(n), lam_o = sum(lam_o*n)/sum(n),
                       n = sum(n), .groups = "keep") %>%
      dplyr::ungroup() %>%
      dplyr::group_by(fnum, mut, reps) %>%
      dplyr::summarise(logit_fn_rep = stats::optim(0, mixed_lik, TC = TC, n = n, lam_n = sum(lam_n*n)/sum(n), lam_o = sum(lam_o*n)/sum(n), method = "L-BFGS-B", lower = lower, upper = upper)$par, nreads =sum(n), .groups = "keep") %>%
      dplyr::mutate(logit_fn_rep = ifelse(logit_fn_rep == lower, stats::runif(1, lower-0.2, lower), ifelse(logit_fn_rep == upper, stats::runif(1, upper, upper+0.2), logit_fn_rep))) %>%
      dplyr::ungroup()

    ## Look for numerical instabilities
    instab_df <- Mut_data_est %>% dplyr::filter(abs(logit_fn_rep) > upper )

    fnum_instab <- instab_df$fnum
    mut_instab <- instab_df$mut
    reps_instab <- instab_df$reps

    # Replace unstable estimates with estimate + jitter so as to not underestimate replicate variability
    instab_est <- dplyr::right_join(Mut_data, instab_df, by = c("fnum", "mut", "reps")) %>%
      dplyr::ungroup() %>%
      dplyr::mutate(totTC = TC*n, totU = nT*n) %>%
      dplyr::group_by(fnum, mut, reps) %>%
      dplyr::summarise(tot_mut = sum(totTC), totUs = sum(totU), pnew = mean(pnew), pold = mean(pold)) %>% dplyr::ungroup() %>%
      dplyr::mutate(avg_mut = tot_mut/totUs,
                    Fn_rep_est = (avg_mut - (1-mut_reg)*pold)/((1+mut_reg)*pnew - (1-mut_reg)*pold)) %>%
      dplyr::mutate(Fn_rep_est = ifelse(Fn_rep_est > 1, stats::runif(1, min = inv_logit(upper-0.1), max = 1) , ifelse(Fn_rep_est < 0, stats::runif(1, min = 0, max = inv_logit(lower+0.1)), Fn_rep_est ) )) %>%
      dplyr::mutate(logit_fn_rep = logit(Fn_rep_est)) %>%
      dplyr::select(fnum, mut, reps, logit_fn_rep)

    # Generate estimates on other useful scales

    if(Chase){
      Mut_data_est <- dplyr::left_join(Mut_data_est, instab_est, by = c("fnum", "mut", "reps")) %>%
        dplyr::mutate(logit_fn_rep = ifelse(abs(logit_fn_rep.x) > upper, -logit_fn_rep.y, -logit_fn_rep.x)) %>%
        dplyr::select(fnum, mut, reps, nreads, logit_fn_rep) %>%
        dplyr::mutate(Fn_rep_est = inv_logit(logit_fn_rep)) %>%
        dplyr::mutate(kd_rep_est = -log(1 - Fn_rep_est)/tl[mut])
    }else{
      Mut_data_est <- dplyr::left_join(Mut_data_est, instab_est, by = c("fnum", "mut", "reps")) %>%
        dplyr::mutate(logit_fn_rep = ifelse(abs(logit_fn_rep.x) > upper, logit_fn_rep.y, logit_fn_rep.x)) %>%
        dplyr::select(fnum, mut, reps, nreads, logit_fn_rep) %>%
        dplyr::mutate(Fn_rep_est = inv_logit(logit_fn_rep)) %>%
        dplyr::mutate(kd_rep_est = -log(1 - Fn_rep_est)/tl[mut])
    }

    message("Estimating per replicate uncertainties")

    Mut_data <- dplyr::left_join(Mut_data, Mut_data_est[, c("kd_rep_est" ,"logit_fn_rep", "fnum", "mut", "reps")], by = c("fnum", "mut", "reps"))

    ## Estimate Fisher Info and uncertainties
    if(Fisher){
      if(Chase){
        Mut_data <- Mut_data %>% dplyr::ungroup() %>%
          dplyr::group_by(fnum, mut, reps, TC, pnew, pold, logit_fn_rep, kd_rep_est) %>%
          dplyr::summarise(U_cont = sum(nT*n)/sum(n), n = sum(n), .groups = "keep") %>%
          dplyr::mutate(Exp_l_fn = exp(-logit_fn_rep)) %>%
          dplyr::mutate(Inv_Fisher_Logit_3 = 1/(((pnew/pold)^TC)*exp(-(U_cont)*(pnew - pold)) - 1 )) %>%
          dplyr::mutate(Inv_Fisher_Logit_1 = 1 + Exp_l_fn ) %>%
          dplyr::mutate(Inv_Fisher_Logit_2 = ((1 + Exp_l_fn)^2)/Exp_l_fn) %>%
          dplyr::mutate(Fisher_lkd_num = tl[mut]*kd_rep_est*( ((pnew*U_cont)^TC)*exp(-pnew*U_cont) - ( ((pold*U_cont)^TC)*exp(-pold*U_cont) ) ) ) %>%
          dplyr::mutate(Fisher_lkd_den = (exp(kd_rep_est*tl[mut]) - 1)*((pnew*U_cont)^TC)*exp(-pnew*U_cont) + ((pold*U_cont)^TC)*exp(-pold*U_cont) ) %>%
          dplyr::ungroup() %>%
          dplyr::group_by(fnum, mut, reps) %>%
          dplyr::summarise(Fisher_kdeg = sum(n*((Fisher_lkd_num/Fisher_lkd_den)^2))/sum(n),
                           Fisher_Logit = sum(n/((Inv_Fisher_Logit_1 + Inv_Fisher_Logit_2*Inv_Fisher_Logit_3)^2))/sum(n),
                           tot_n = sum(n)) %>% #,
          #Fisher_fn = sum(n*((Fisher_fn_num/Fisher_fn_den)^2)), tot_n = sum(n)) %>%
          dplyr::mutate(log_kd_se = 1/sqrt(tot_n*Fisher_kdeg),
                        Logit_fn_se = 1/sqrt(tot_n*Fisher_Logit)) %>%
          dplyr::mutate(log_kd_se = ifelse(log_kd_se > se_max, se_max, log_kd_se),
                        Logit_fn_se = ifelse(Logit_fn_se > se_max, se_max, Logit_fn_se))#, Fn_se = 1/sqrt(tot_n*Fisher_fn))
        
      }else{
        Mut_data <- Mut_data %>% dplyr::ungroup() %>%
          dplyr::group_by(fnum, mut, reps, TC, pnew, pold, logit_fn_rep, kd_rep_est) %>%
          dplyr::summarise(U_cont = sum(nT*n)/sum(n), n = sum(n), .groups = "keep") %>%
          dplyr::mutate(Exp_l_fn = exp(logit_fn_rep)) %>%
          dplyr::mutate(Inv_Fisher_Logit_3 = 1/(((pnew/pold)^TC)*exp(-(U_cont)*(pnew - pold)) - 1 )) %>%
          dplyr::mutate(Inv_Fisher_Logit_1 = 1 + Exp_l_fn ) %>%
          dplyr::mutate(Inv_Fisher_Logit_2 = ((1 + Exp_l_fn)^2)/Exp_l_fn) %>%
          dplyr::mutate(Fisher_lkd_num = tl[mut]*kd_rep_est*( ((pnew*U_cont)^TC)*exp(-pnew*U_cont) - ( ((pold*U_cont)^TC)*exp(-pold*U_cont) ) ) ) %>%
          dplyr::mutate(Fisher_lkd_den = (exp(kd_rep_est*tl[mut]) - 1)*((pnew*U_cont)^TC)*exp(-pnew*U_cont) + ((pold*U_cont)^TC)*exp(-pold*U_cont) ) %>%
          dplyr::ungroup() %>%
          dplyr::group_by(fnum, mut, reps) %>%
          dplyr::summarise(Fisher_kdeg = sum(n*((Fisher_lkd_num/Fisher_lkd_den)^2))/sum(n),
                           Fisher_Logit = sum(n/((Inv_Fisher_Logit_1 + Inv_Fisher_Logit_2*Inv_Fisher_Logit_3)^2))/sum(n),
                           tot_n = sum(n)) %>% #,
          #Fisher_fn = sum(n*((Fisher_fn_num/Fisher_fn_den)^2)), tot_n = sum(n)) %>%
          dplyr::mutate(log_kd_se = 1/sqrt(tot_n*Fisher_kdeg),
                        Logit_fn_se = 1/sqrt(tot_n*Fisher_Logit)) %>%
          dplyr::mutate(log_kd_se = ifelse(log_kd_se > se_max, se_max, log_kd_se),
                        Logit_fn_se = ifelse(Logit_fn_se > se_max, se_max, Logit_fn_se))#, Fn_se = 1/sqrt(tot_n*Fisher_fn))
        
      }
      
      Mut_data_est$logit_fn_se <- Mut_data$Logit_fn_se
      Mut_data_est$log_kd_se <- Mut_data$log_kd_se
      
    }else{
      ## Procedure:
      # 1) Estimate beta distribution prior empirically
      # 2) Determine beta posterior analytically
      # 3) Use beta sd as uncertainty
      
      ## Estimate beta prior
      fn_means <- Mut_data_est %>%
        dplyr::group_by(mut, reps) %>%
        dplyr::summarise(global_mean = mean(Fn_rep_est),
                         global_var = stats::var(Fn_rep_est))
      
      priors <- fn_means %>%
        dplyr::mutate(alpha_p = global_mean*(((global_mean*(1-global_mean))/global_var) - 1),
                      beta_p = alpha_p*(1 - global_mean)/global_mean)
      
      ## Add priors
      Mut_data_est <- dplyr::inner_join(Mut_data_est, priors, by = c("mut", "reps"))
      
      ## Estimate logit uncertainty
      lfn_calc <- function(alpha, beta){
        
        EX <- alpha/(alpha + beta)
        VX <- (alpha*beta)/(((alpha + beta)^2)*(alpha + beta + 1))
        
        
        totvar <- (((1/EX) + 1/(1 - EX))^2)*VX
        
        return(totvar)
      }
      
      ## Estimate log(kdeg) uncertainty
      lkdeg_calc <- function(alpha, beta){
        
        EX <- alpha/(alpha + beta)
        VX <- (alpha*beta)/(((alpha + beta)^2)*(alpha + beta + 1))
        
        
        totvar <- (( 1/(log(1-EX)*(1-EX)) )^2)*VX
        
        return(totvar)
      }
      
      Mut_data_est <- Mut_data_est %>%
        dplyr::mutate(logit_fn_se = sqrt(lfn_calc(alpha_p + nreads*Fn_rep_est, nreads + beta_p)),
                      log_kd_se = sqrt(lkdeg_calc(alpha_p + nreads*Fn_rep_est, nreads + beta_p)))
    }
    



    Mut_data_est <- Mut_data_est %>% dplyr::mutate(log_kd_rep_est = log(kd_rep_est))


  }
  
  logit_fn <- Mut_data_est$logit_fn_rep

  # Report numerical instabilities from maximum likelihood estimation
  if(!(all(logit_fn < upper) & all(logit_fn > lower))){
    num_unstable <- sum(logit_fn >= upper) + sum(logit_fn <= lower)
    tot_ests <- length(logit_fn)
    prcnt_unstable <- round((num_unstable/tot_ests)*100,3)
    warning(paste0(num_unstable, " out of " , tot_ests, " (", prcnt_unstable,"%)", " logit(fn) estimates are at the upper or lower bounds set. These likely represent features with limited data or extremely stable/unstable features. If the number of boundary estimates is concerning, try increasing the magnitude of upper and lower."))
  }


  # ESTIMATE VARIANCE VS. READ COUNT TREND -------------------------------------

  
  out <- avg_and_regularize(Mut_data_est, nreps, sample_lookup, feature_lookup,
                            nbin = nbin, NSS = NSS, 
                            BDA_model = BDA_model, null_cutoff = null_cutoff,
                            Mutrates = pmuts_list, ztest = ztest)
  
  return(out)

}

#' Efficiently average replicates of nucleotide recoding data and regularize
#'
#' \code{avg_and_regularize} pools and regularizes replicate estimates of kinetic parameters. There are two key steps in this
#' downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend,
#' and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the
#' Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty
#' estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate
#' priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of
#' regularization can be tuned by adjusting the \code{prior_weight} parameter, with larger numbers yielding stronger shrinkage towards
#' the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a
#' Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the
#' population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior).
#' In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads
#' mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization
#' of the uncertainty estimates.
#'
#' Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental
#' sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the
#' log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number
#' of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini-
#' Hochberg procedure to control false discovery rates (FDRs).
#'
#' In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed.
#' In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg).
#' This analysis strategy is implemented when \code{NSS} is set to TRUE. Comparing logit(fraction new) is only valid
#' If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq
#' data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in
#' logit(fraction new) could stem from differences in kinetics or label times.
#'
#'
#' @param Mut_data_est Dataframe with fraction new estimation information. Required columns are: 
#' \itemize{
#'  \item fnum; numerical ID of feature
#'  \item reps; numerical ID of replicate
#'  \item mut; numerical ID of experimental condition (Exp_ID)
#'  \item logit_fn_rep; logit(fn) estimate
#'  \item kd_rep_est; kdeg estimate
#'  \item log_kd_rep_est; log(kdeg) estimate
#'  \item logit_fn_se; logit(fn) estimate uncertainty
#'  \item log_kd_se; log(kdeg) estimate uncertainty
#' }
#' @param nbin Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used
#' @param nreps Vector of number of replicates in each experimental condition
#' @param sample_lookup Dictionary mapping sample names to various experimental details
#' @param feature_lookup Dictionary mapping feature IDs to original feature names
#' @param null_cutoff bakR will test the null hypothesis of |effect size| < |null_cutoff|
#' @param NSS Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.
#' @param Chase Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of
#' reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)
#' @param BDA_model Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal
#' model is used.
#' @param Mutrates List containing new and old mutation rate estimates
#' @param ztest TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test. 
#' @return List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:
#' \itemize{
#'  \item Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate.
#'  The columns of this dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item Replicate; Numerical ID for replicate
#'   \item logit_fn; logit(fraction new) estimate, unregularized
#'   \item logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information
#'   \item nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained
#'   \item log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized
#'   \item kdeg; degradation rate constant (kdeg) estimate
#'   \item log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information
#'   \item sample; Sample name
#'   \item XF; Original feature name
#'  }
#'  \item Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized
#'  using priors informed by the entire dataset. The columns of this dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth
#'   \item sd_log_kdeg; Standard deviation of the log(kdeg) estimates
#'   \item nreads; Total number of reads mapping to the feature in that condition
#'   \item sdp; Prior standard deviation for fraction new estimate regularization
#'   \item theta_o; Prior mean for fraction new estimate regularization
#'   \item sd_post; Posterior uncertainty
#'   \item log_kdeg_post; Posterior mean for log(kdeg) estimate
#'   \item kdeg; exp(log_kdeg_post)
#'   \item kdeg_sd; kdeg uncertainty
#'   \item XF; Original feature name
#'  }
#'  \item Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the
#'  reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this
#'  dataframe are:
#'  \itemize{
#'   \item Feature_ID; Numerical ID of feature
#'   \item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
#'   \item L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE
#'   \item effect; LFC(kdeg)
#'   \item se; Uncertainty in L2FC_kdeg
#'   \item pval; P-value obtained using effect_size, se, and a z-test
#'   \item padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure
#'   \item XF; Original feature name
#'  }
#'  \item Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column
#'  represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation
#'  rate estimate used
#'  \item Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each
#'  feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional
#'  degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.
#'  \item Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID
#' }
#' @importFrom magrittr %>%
avg_and_regularize <- function(Mut_data_est, nreps, sample_lookup, feature_lookup,
                               nbin = NULL, NSS = FALSE, Chase = FALSE,
                               BDA_model = FALSE, null_cutoff = 0,
                               Mutrates = NULL, ztest = FALSE){


  ### Check Mut_data_est
  expected_cols <- c("nreads", "fnum", "reps",
                              "mut", "logit_fn_rep", "kd_rep_est", "log_kd_rep_est", 
                              "logit_fn_se", "log_kd_se")
  
  if(!is.data.frame(Mut_data_est)){
    stop("Mut_data_est must be a data frame!")
  }
  
  if(!all(expected_cols %in% colnames(Mut_data_est))){
    stop("Mut_data_est is lacking some necessary columns. See documentation (?avg_and_regularize())
         for list of necessary columns")
  }
  
  ### Check nreps
  if(!is.numeric(nreps)){
    stop("nreps must be a numeric vector of the number of replicates in each Exp_ID!")
  }

  ### Check sample_lookup
  expected_cols <- c("sample", "mut", "reps")
  if(!all(expected_cols %in% colnames(sample_lookup))){
    stop("sample_lookup is lacking necessary columns. It should contain three columns: 1) sample, which is the name of the same. 2) mut, which is the Exp_ID, and 3) reps which is the numerical replicate ID ")
  }
  
  if(!is.data.frame(sample_lookup)){
    stop("sample_lookup must be a data frame!")
  }
  
  ### Check feature_lookup
  expected_cols <- c("XF", "fnum")
  if(!all(expected_cols %in% colnames(feature_lookup))){
    stop("feature_lookup is lacking necessary columns. It should contain at least two columns: 1) XF, the name of each feature analyzed, and 2) fnum, the numerical ID for the corresponding feature")
  }
  
  if(!is.data.frame(feature_lookup)){
    stop("feature_lookup must be a data frame!")
  }
  
  ## Check nbin
  if(!is.null(nbin)){
    if(!is.numeric(nbin)){
      stop("nbin must be numeric")
    }else if(!is.integer(nbin)){
      nbin <- as.integer(nbin)
    }
    
    if(nbin <= 0){
      stop("nbin must be > 0 and is preferably greater than or equal to 10")
    }
  }
  
  ## Check NSS
  if(!is.logical(NSS)){
    stop("NSS must be logical (TRUE or FALSE)")
  }
  
  ### Check Chase
  if(!is.logical(Chase)){
    stop("Chase must be logical (TRUE or FALSE)")
  }
  
  ### Check BDA_model
  if(!is.logical(BDA_model)){
    stop("BDA_model must be logical (TRUE or FALSE)")
  }
  
  ### Check ztest
  if(!is.logical(ztest)){
    stop("ztest must be logical (TRUE or FALSE)")
  }
  
  ## Check null_cutoff
  if(!is.numeric(null_cutoff)){
    stop("null_cutoff must be numeric")
  }else if(null_cutoff < 0){
    stop("null_cutoff must be 0 or positive")
  }else if(null_cutoff > 2){
    warning("You are testing against a null hypothesis |L2FC(kdeg)| greater than 2; this might be too conservative")
  }
  
  
  # Bind variables locally to resolve devtools::check() Notes
  fnum <- mut <- logit_fn_rep <- bin_ID <- kd_sd_log <- intercept <- NULL
  slope <- log_kd_rep_est <- avg_log_kd <- sd_log_kd <- sd_post <- NULL
  sdp <- theta_o <- log_kd_post <- effect_size <- effect_size_std_error <- NULL
  effect_std_error <- NULL
  
  # Helper functions that I will use on multiple occasions
  logit <- function(x) log(x/(1-x))
  inv_logit <- function(x) exp(x)/(1+exp(x))
  
  
  # ESTIMATE VARIANCE VS. READ COUNT TREND -------------------------------------
  
  ngene <- max(Mut_data_est$fnum)
  nMT <- max(Mut_data_est$mut)
  
  ## Now affiliate each fnum, mut with a bin Id based on read counts,
  ## bin data by bin_ID and average log10(reads) and log(sd(logit_fn))
  
  if(is.null(nbin)){
    nbin <- max(c(round(ngene*sum(nreps)/100), 10))
    nbin <- min(c(nbin, 3000))
  }
  
  message("Estimating read count-variance relationship")
  
  #browser()
  
  if(NSS){
    Binned_data <- Mut_data_est %>% dplyr::group_by(fnum, mut) %>%
      dplyr::summarise(nreads = sum(nreads), kd_sd_log = log(sqrt(1/sum(1/((stats::sd(logit_fn_rep)^2) +logit_fn_se^2 ) ) ) )) %>%
      dplyr::ungroup() %>%
      dplyr::mutate(bin_ID = as.numeric(Hmisc::cut2(nreads, g = nbin))) %>% dplyr::group_by(bin_ID, mut) %>%
      dplyr::summarise(avg_reads = mean(log10(nreads)), avg_sd = mean(kd_sd_log))
    
    ## Estimate read count vs. replicate variability trend
    
    lm_list <- vector("list", length = nMT)
    lm_var <- lm_list
    
    # One linear model for each experimental condition
    for(i in 1:nMT){
      heterosked_lm <- stats::lm(avg_sd ~ avg_reads, data = Binned_data[Binned_data$mut == i,] )
      h_int <- summary(heterosked_lm)$coefficients[1,1]
      h_slope <- summary(heterosked_lm)$coefficients[2,1]
      
      if(h_slope > 0){
        h_slope <- 0
        h_int <- mean(Binned_data$avg_sd[Binned_data$mut == i])
      }
      
      lm_list[[i]] <- c(h_int, h_slope)
      
      lm_var[[i]] <- stats::var(stats::residuals(heterosked_lm))
    }
    
    # Put linear model fit extrapolation into convenient data frame
    true_vars <-  Mut_data_est %>% dplyr::group_by(fnum, mut) %>%
      dplyr::summarise(nreads = sum(nreads), kd_sd_log = log(sqrt(1/sum(1/((stats::sd(logit_fn_rep)^2) + logit_fn_se^2 ) ) ) )) %>%
      dplyr::ungroup() %>% dplyr::group_by(fnum, mut) %>% dplyr::mutate(slope = lm_list[[mut]][2], intercept = lm_list[[mut]][1]) %>%
      dplyr::group_by(mut) %>%
      dplyr::summarise(true_var = stats::var(kd_sd_log - (intercept + slope*log10(nreads) ) ),
                       true_nat_var = stats::var(exp(kd_sd_log)^2 - exp((intercept + slope*log10(nreads) ))^2 ))
    
    log_kd <- as.vector(Mut_data_est$log_kd_rep_est)
    logit_fn <- as.vector(Mut_data_est$logit_fn_rep)
    
  }else{
    Binned_data <- Mut_data_est %>% dplyr::group_by(fnum, mut) %>%
      dplyr::summarise(nreads = sum(nreads), kd_sd_log = log(sqrt(1/sum(1/((stats::sd(log_kd_rep_est)^2) + log_kd_se^2 ) ) ) )) %>%
      dplyr::ungroup() %>%
      dplyr::mutate(bin_ID = as.numeric(Hmisc::cut2(nreads, g = nbin))) %>% dplyr::group_by(bin_ID, mut) %>%
      dplyr::summarise(avg_reads = mean(log10(nreads)), avg_sd = mean(kd_sd_log))
    
    ## Estimate read count vs. replicate variability trend
    
    lm_list <- vector("list", length = nMT)
    lm_var <- lm_list
    
    # One linear model for each experimental condition
    for(i in 1:nMT){
      heterosked_lm <- stats::lm(avg_sd ~ avg_reads, data = Binned_data[Binned_data$mut == i,] )
      h_int <- summary(heterosked_lm)$coefficients[1,1]
      h_slope <- summary(heterosked_lm)$coefficients[2,1]
      
      if(h_slope > 0){
        h_slope <- 0
        h_int <- mean(Binned_data$avg_sd[Binned_data$mut == i])
      }
      
      lm_list[[i]] <- c(h_int, h_slope)
      
      lm_var[[i]] <- stats::var(stats::residuals(heterosked_lm))
    }
    
    # Put linear model fit extrapolation into convenient data frame
    true_vars <-  Mut_data_est %>% dplyr::group_by(fnum, mut) %>%
      dplyr::summarise(nreads = sum(nreads), kd_sd_log = log(sqrt(1/sum(1/((stats::sd(log_kd_rep_est)^2) + log_kd_se^2 ) ) ) )) %>%
      dplyr::ungroup() %>% dplyr::group_by(fnum, mut) %>% dplyr::mutate(slope = lm_list[[mut]][2], intercept = lm_list[[mut]][1]) %>%
      dplyr::group_by(mut) %>%
      dplyr::summarise(true_var = stats::var(kd_sd_log - (intercept + slope*log10(nreads) ) ),
                       true_nat_var = stats::var(exp(kd_sd_log)^2 - exp((intercept + slope*log10(nreads) ))^2 ))
    
    log_kd <- as.vector(Mut_data_est$log_kd_rep_est)
    logit_fn <- as.vector(Mut_data_est$logit_fn_rep)
    
  }
  
  
  
  # PREP DATA FOR REGULARIZATION -----------------------------------------------
  
  # Vectors of use
  kd_estimate <- as.vector(Mut_data_est$kd_rep_est)
  log_kd_se <- as.vector(Mut_data_est$log_kd_se)
  logit_fn_se <- as.vector(Mut_data_est$logit_fn_se)
  Replicate <- as.vector(Mut_data_est$reps)
  Condition <- as.vector(Mut_data_est$mut)
  Gene_ID <- as.vector(Mut_data_est$fnum)
  nreads <- as.vector(Mut_data_est$nreads)
  
  rm(Mut_data_est)
  
  # Create new data frame with kinetic parameter estimates and uncertainties
  df_fn <- data.frame(logit_fn, logit_fn_se, Replicate, Condition, Gene_ID, nreads, log_kd, kd_estimate, log_kd_se)
  
  
  # Remove vectors no longer of use
  rm(logit_fn)
  rm(logit_fn_se)
  rm(Replicate)
  rm(Condition)
  rm(Gene_ID)
  rm(nreads)
  
  df_fn <- df_fn[order(df_fn$Gene_ID, df_fn$Condition, df_fn$Replicate),]
  
  
  # Relabel logit(fn) as log(kdeg) for steady-state independent analysis
  if(NSS){
    colnames(df_fn) <- c("log_kd", "log_kd_se", "Replicate", "Condition", "Gene_ID", "nreads", "logit_fn", "kd_estimate", "logit_fn_se")
    
    df_fn$kd_estimate <- inv_logit(df_fn$log_kd)
  }
  
  
  
  # AVERAGE REPLICATE DATA AND REGULARIZE WITH INFORMATIVE PRIORS --------------
  
  message("Averaging replicate data and regularizing estimates")
  
  #Average over replicates and estimate hyperparameters
  avg_df_fn_bayes <- df_fn %>% dplyr::group_by(Gene_ID, Condition) %>%
    dplyr::summarize(avg_log_kd = stats::weighted.mean(log_kd, 1/log_kd_se),
                     sd_log_kd = sqrt(1/sum(1/((stats::sd(log_kd)^2) + log_kd_se^2 ) ) ),
                     nreads = sum(nreads)) %>% dplyr::ungroup() %>%
    dplyr::group_by(Condition) %>%
    dplyr::mutate(sdp = stats::sd(avg_log_kd)) %>%
    dplyr::mutate(theta_o = mean(avg_log_kd)) %>%
    dplyr::ungroup()
  

  
  #Calcualte population averages
  # sdp <- sd(avg_df_fn$avg_logit_fn) # Will be prior sd in regularization of mean
  # theta_o <- mean(avg_df_fn$avg_logit_fn) # Will be prior mean in regularization of mean
  
  var_pop <- mean(avg_df_fn_bayes$sd_log_kd^2) # Will be prior mean in regularization of sd
  var_of_var <- mean(true_vars$true_nat_var) # Will be prior variance in regularization of sd
  
  
  ## Regularize standard deviation estimate
  
  # Estimate hyperpriors with method of moments
  
  a_hyper <- 2*(var_pop^2)/var_of_var + 4
  # that serves as prior degrees of freedom
  
  b_hyper <- (var_pop*(a_hyper - 2))/a_hyper
  
  
  if(BDA_model){ # Not yet working well; inverse chi-squared model from BDA3
    avg_df_fn_bayes <- avg_df_fn_bayes %>% dplyr::group_by(Gene_ID, Condition) %>%
      dplyr::mutate(sd_post = (sd_log_kd*nreps[Condition] + a_hyper*exp(lm_list[[Condition]][1] + lm_list[[Condition]][2]*log10(nreads)))/(a_hyper + nreps[Condition] - 2) ) %>%
      dplyr::mutate(log_kd_post = (avg_log_kd*(nreps[Condition]*(1/(sd_post^2))))/(nreps[Condition]/(sd_post^2) + (1/sdp^2)) + (theta_o*(1/sdp^2))/(nreps[Condition]/(sd_post^2) + (1/sdp^2))) %>%
      dplyr::mutate(kdeg = exp(log_kd_post) ) %>%
      dplyr::mutate(kdeg_sd = sd_post*exp(log_kd_post) ) %>%
      dplyr::ungroup() %>%
      dplyr::group_by(Gene_ID) %>%
      dplyr::mutate(effect_size = log_kd_post - log_kd_post[Condition == 1]) %>%
      dplyr::mutate(effect_std_error = ifelse(Condition == 1, sd_post/sqrt(nreps[1]), sqrt((sd_post[Condition == 1]^2)/nreps[1] + (sd_post^2)/nreps[Condition] ) )) %>%
      dplyr::mutate(L2FC_kdeg = effect_size*log2(exp(1))) %>%
      dplyr::mutate(pval = pmin(1, 2*stats::pnorm((abs(effect_size) - null_cutoff)/effect_std_error, lower.tail = FALSE))) %>%
      dplyr::ungroup()
    
  }else{
    # Regularize estimates with Bayesian models and empirically informed priors
    avg_df_fn_bayes <- avg_df_fn_bayes %>% dplyr::group_by(Gene_ID, Condition) %>%
      dplyr::mutate(sd_post = exp( (log(sd_log_kd)*nreps[Condition]*(1/true_vars$true_var[Condition]) + (1/lm_var[[Condition]])*(lm_list[[Condition]][1] + lm_list[[Condition]][2]*log10(nreads)))/(nreps[Condition]*(1/true_vars$true_var[Condition]) + (1/lm_var[[Condition]])) )) %>%
      dplyr::mutate(log_kd_post = (avg_log_kd*(nreps[Condition]*(1/(sd_post^2))))/(nreps[Condition]/(sd_post^2) + (1/sdp^2)) + (theta_o*(1/sdp^2))/(nreps[Condition]/(sd_post^2) + (1/sdp^2))) %>%
      dplyr::mutate(kdeg = exp(log_kd_post) ) %>%
      dplyr::mutate(kdeg_sd = sd_post*exp(log_kd_post) ) %>%
      dplyr::ungroup() %>%
      dplyr::group_by(Gene_ID) %>%
      dplyr::mutate(effect_size = log_kd_post - log_kd_post[Condition == 1]) %>%
      dplyr::mutate(effect_std_error = ifelse(Condition == 1, sd_post, sqrt(sd_post[Condition == 1]^2 + sd_post^2))) %>%
      dplyr::mutate(L2FC_kdeg = effect_size*log2(exp(1)))
    
    if(ztest){
      avg_df_fn_bayes <- avg_df_fn_bayes %>%
        dplyr::mutate(pval = pmin(1, 2*stats::pnorm((abs(effect_size) - null_cutoff)/effect_std_error, lower.tail = FALSE))) %>%
        dplyr::ungroup()
      
    }else{
      
      avg_df_fn_bayes <- avg_df_fn_bayes %>%
        dplyr::mutate(pval = pmin(1, 2*stats::pt((abs(effect_size) - null_cutoff)/effect_std_error, df = 2*(nreps[Condition]-1) + 2*a_hyper, lower.tail = FALSE))) %>%
        dplyr::ungroup()
        
      
    }
    
    
  }
  
  
  
  # STATISTICAL TESTING --------------------------------------------------------
  
  
  message("Assessing statistical significance")
  
  ## Populate various data frames with important fit information
  
  effects <- avg_df_fn_bayes$effect_size[avg_df_fn_bayes$Condition > 1]
  ses <- avg_df_fn_bayes$effect_std_error[avg_df_fn_bayes$Condition > 1]
  
  pval <- avg_df_fn_bayes$pval[avg_df_fn_bayes$Condition > 1]
  padj <- stats::p.adjust(pval, method = "BH")
  
  
  # ORGANIZE OUTPUT ------------------------------------------------------------
  
  Genes_effects <- avg_df_fn_bayes$Gene_ID[avg_df_fn_bayes$Condition > 1]
  Condition_effects <- avg_df_fn_bayes$Condition[avg_df_fn_bayes$Condition > 1]
  
  L2FC_kdegs <- avg_df_fn_bayes$L2FC_kdeg[avg_df_fn_bayes$Condition > 1]
  
  Effect_sizes_df <- data.frame(Genes_effects, Condition_effects, L2FC_kdegs, effects, ses, pval, padj)
  
  colnames(Effect_sizes_df) <- c("Feature_ID", "Exp_ID", "L2FC_kdeg", "effect", "se", "pval", "padj")
  
  # Add sample information to output
  df_fn <- merge(df_fn, sample_lookup, by.x = c("Condition", "Replicate"), by.y = c("mut", "reps"))
  
  # Add feature name information to output
  df_fn <- merge(df_fn, feature_lookup, by.x = "Gene_ID", by.y = "fnum")
  avg_df_fn_bayes <- merge(avg_df_fn_bayes, feature_lookup, by.x = "Gene_ID", by.y = "fnum")
  Effect_sizes_df <- merge(Effect_sizes_df, feature_lookup, by.x = "Feature_ID", by.y = "fnum")
  
  # Order output
  df_fn <- df_fn[order(df_fn$Gene_ID, df_fn$Condition, df_fn$Replicate),]
  avg_df_fn_bayes <- avg_df_fn_bayes[order(avg_df_fn_bayes$Gene_ID, avg_df_fn_bayes$Condition),]
  Effect_sizes_df <- Effect_sizes_df[order(Effect_sizes_df$Feature_ID, Effect_sizes_df$Exp_ID),]
  
  
  avg_df_fn_bayes <- avg_df_fn_bayes[,c("Gene_ID", "Condition", "avg_log_kd", "sd_log_kd", "nreads", "sdp", "theta_o", "sd_post",
                                        "log_kd_post", "kdeg", "kdeg_sd", "XF")]
  
  colnames(avg_df_fn_bayes) <- c("Feature_ID", "Exp_ID", "avg_log_kdeg", "sd_log_kdeg", "nreads", "sdp", "theta_o", "sd_post",
                                 "log_kdeg_post", "kdeg", "kdeg_sd", "XF")
  
  if(NSS){
    colnames(df_fn) <- c("Feature_ID", "Exp_ID", "Replicate", "logit_fn", "logit_fn_se", "nreads", "log_kdeg", "fn", "log_kd_se", "sample", "XF")
  }else{
    colnames(df_fn) <- c("Feature_ID", "Exp_ID", "Replicate", "logit_fn", "logit_fn_se", "nreads", "log_kdeg", "kdeg", "log_kd_se", "sample", "XF")
  }
  
  if(is.null(Mutrates)){
    # Convert to tibbles because I like tibbles better
    fast_list <- list(dplyr::as_tibble(df_fn), 
                      dplyr::as_tibble(avg_df_fn_bayes), 
                      dplyr::as_tibble(Effect_sizes_df), 
                      c(a = a_hyper, b = b_hyper), 
                      lm_list,
                      dplyr::as_tibble(Binned_data))
    
    names(fast_list) <- c("Fn_Estimates", "Regularized_ests", "Effects_df", "Hyper_Parameters", "Mean_Variance_lms", "Mean_Variance_data")
    
    class(fast_list) <- "FastFit"
    
    message("All done! Run QC_checks() on your bakRFit object to assess the 
            quality of your data and get recommendations for next steps.")
    
  }else{
    # Convert to tibbles because I like tibbles better
    fast_list <- list(dplyr::as_tibble(df_fn), 
                      dplyr::as_tibble(avg_df_fn_bayes),
                      dplyr::as_tibble(Effect_sizes_df), 
                      Mutrates, c(a = a_hyper, b = b_hyper), 
                      lm_list,
                      dplyr::as_tibble(Binned_data))
    
    names(fast_list) <- c("Fn_Estimates", "Regularized_ests", "Effects_df", "Mut_rates", "Hyper_Parameters", "Mean_Variance_lms", "Mean_Variance_data")
    
    class(fast_list) <- "FastFit"
    
    message("All done! Run QC_checks() on your bakRFit object to assess the 
            quality of your data and get recommendations for next steps.")
    
  }
  
  return(fast_list)
  
}

Try the bakR package in your browser

Any scripts or data that you put into this service are public.

bakR documentation built on June 22, 2024, 6:55 p.m.