knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.dim = c(6, 4) )
Load bakR! Version 1.0.0 or later is required to run all of the code in this vignette and reproduce the results presented. ggplot2 and corrplot (the latter of which is not automatically installed with bakR) will also be used for some figures shown here.
library(bakR) library(ggplot2) library(corrplot) set.seed(123)
Version 0.4.0 of bakR introduced a function that takes as input a bakRFit object and assesses a number of aspects of the model fit and raw data:
# Simulate a nucleotide recoding dataset sim_data <- Simulate_relative_bakRData(1000, depth = 1000000, nreps = 2) # This will simulate 1000 features, 1000000 reads, 2 experimental conditions, # and 2 replicates for each experimental condition # See ?Simulate_relative_bakRData for details regarding tunable parameters # Run the efficient model Fit <- bakRFit(sim_data$bakRData) # Quality control checks QC <- QC_checks(Fit)
QC_checks assesses four aspects of your data and bakR's model fit:
The first thing that QC_checks()
assesses is the average mutation rate in all sequencing reads from all analyzed samples. High quality data is characterized by mutation rates that are higher in all +s^4^U samples than in any -s^4^U samples. In addition, to a first order approximation the higher the mutation rates in +s^4^U samples and the lower the mutation rates in -s^4^U samples, the better. One output of QC_checks is a convenient plot of raw mutation rates to see what these look like for your data:
# Barplots of raw mutation rates QC$raw_mutrates
In this simulated data, you can see that the +s^4^U raw mutation rates are all consistenly much higher than the -s^4^U mutation rates. The barplot also provides a quasi-arbitrary guide for what good +s^4^U and -s^4^U mutation rates are.
The second thing that QC_checks()
assesses is the mutation rate of new and old reads, inferred by bakR. The difference between the inferred and raw mutation rates can be a cause of confusion for those new to NR-seq data. The raw mutation rates are the average proportion of Ts that are mutated in all sequencing reads for a given sample. This means that the raw mutation rates average over mutation rates from both new reads (i.e., reads derived from RNA synthesized during metabolic labeling) and old reads (i.e., those that were synthesized prior to metabolic labeling). The inferred mutation rates are the estimated mutation rates of old and new reads separately. Higher inferred mutation rates yield higher raw mutation rates, but they are distinct quantities. In particular, the raw mutation rate depends not only on the new and old read mutation rates, but on what fraction of sequencing reads are new and what fraction are old.
Like with the raw mutation rates, higher new read mutation rates and lower old read mutation rates are better. QC_checks()
outputs a helpful visualization of these mutation rates:
# Barplots of inferred mutation rates QC$conversion_rates
Once again, this simulated data looks good!
The third thing that QC_checks()
assesses is the average estimated fraction new. The ability to accurately infer the degradation kinetics of a particular RNA feature is highly dependent on the relationship between the length of metabolic labeling and the half-life of the feature. More specifically, some really nice work from the Dieterich lab has shown that degradation rate constant estimates are more accurate for features with half-lives closer to the metabolic labeling time. If the label time is much longer (shorter) than a feature's half-live, than the feature will be almost completely labeled (unlabeled). The relationship between fraction new and degradation rate constant makes it so that at extreme fraction news (close to 0 or 1), the rate constant estimate is very sensitive to the exact fraction new estimate. This means that fraction news close to 0 or 1 yield more uncertain kinetic parameter estimates.
QC_checks() will output a message about the average estimated fraction new in each +s^4^U sample. You can also visualize the distribution of fraction new estimates as follows:
# Barplots of inferred mutation rates ggplot(Fit$Fast_Fit$Fn_Estimates, aes(x = logit_fn, color = as.factor(sample))) + geom_density() + theme_classic() + scale_color_viridis_d() + xlab("logit(fn) estimates") + ylab("density")
Note, the density plot above is showing fraction news on a logit scale. Logit is a function that takes as input numbers bounded between 0 and 1 and outputs an unbounded number. A logit(fraction new) of 0 means a fraction new of 0.5. A logit(fraction new) of -2 (2) is a fraction new of ~0.1 (~0.9) Similar to log-transforming RNA-seq read count data, logit transforming fraction news can be useful when visualizing data that ranges multiple orders of magnitude.
Finally, QC_checks()
assesses the extent to which fraction new estimates correlate between replicates of the same experimental condition. Better correlation means less variable data, which means it will be easier to identify differences in kinetic parameters between experimental conditions. QC_checks
will output a message regarding each replicate-to-replicate correlation, and also provides a number of correlation plots that you can inspect:
# Barplots of inferred mutation rates # Numerical indices are ordered as they appear in QC_checks() output message # So this is for replicate 1 and 2 of experimental ID 1 QC$correlation_plots[[1]]
In addition, QC_checks
outputs a correlation matrix to allow for convenient visualization of all sample-to-sample correlations:
# Using function from corrplot package corrplot.mixed(QC$correlation_matrix, upper = "square", lower = "number", addgrid.col = "black", tl.col = "black")
QC_checks
is designed to identify and flag challenges that bakR could run into when analyzing your data. In my experience, the single most important thing to check is the robustness of the new and old read mutation rate estimates (pnew and pold).
By default, bakR fits a simple binomial mixture with the method of maximum likelihood to estimate pnew. If you have -s^4^U data, then bakR uses the average mutation rate from that data as a global pold estimate. Otherwise, the default pold estimation strategy is the same as for pnew. This strategy can go astray if most of the reads in your data are either new or old. Intuitively, if there are very few new (old) reads, it is very difficult for a model to infer what the mutation rate in new (old) reads is. For example, here is an analysis of simulated data where over 98% of reads are old:
# Seed for reproducibility set.seed(321) # Simulate a nucleotide recoding dataset sim_data <- Simulate_bakRData(1000, nreps = 2, fn_mean = -4) # This will simulate 1000 features, 2 experimental conditions, # and 2 replicates for each experimental condition # The average logit(fn) will be -4, which corresponds to an average fn of just under 0.02. # Run the efficient model # Check the pnew estimates, which should all be around 0.05 Fit <- bakRFit(sim_data$bakRData)
While the pnew estimates in one of the experimental conditions is alright, the estimates for the other
condition are both massive underestimates. Running QC_checks
provides a suggestion though
# Run QC_checks and read messages QC <- QC_checks(Fit)
Note the message that says "I suggest rerunning bakRFit with FastRerun and StanRateEst = TRUE, particularly if some of the estimated mutation rates are oddly low (< 0.01) in a subset of samples". bakR has a second pnew and pold estimation strategy up it's sleeve, accessible via the StanRateEst
parameter. Setting this to TRUE causes bakR to resort to using a fully Bayesian approach with the probabilistic programming language Stan
working on the back end to fit a mixture model. This approach will take a bit longer to run, but can provide more accurate pnew and pold estimates:
# Rerun with Stan-based pnew estimation # This will take a couple minutes to run Fit_s <- bakRFit(Fit, FastRerun = TRUE, StanRateEst = TRUE)
It's incredibly difficult to get perfect pnew estimates on this dataset, but this strategy certainly does a lot better! You can compare the resulting fraction new estimates in the troublesome samples:
# Simulated ground truth sim_truth <- sim_data$sim_list # Features that made it past filtering XFs <- unique(Fit$Fast_Fit$Effects_df$XF) # Simulated logit(fraction news) from features making it past filtering true_fn <- sim_truth$Fn_rep_sim$Logit_fn[sim_truth$Fn_rep_sim$Feature_ID %in% XFs & sim_truth$Fn_rep_sim$Exp_ID == 2] # Estimated logit(fraction news) est_fn <- Fit$Fast_Fit$Fn_Estimates$logit_fn[Fit$Fast_Fit$Fn_Estimates$Exp_ID == 2] # Compare estimate to truth plot(true_fn, est_fn, xlab = "True logit(fn)", ylab = "Estimated logit(fn)", main = "Default pnew estimation", xlim = c(-8, 6), ylim = c(-8, 6)) abline(0, 1, col = "red") # Estimated logit(fraction news) est_fn <- Fit_s$Fast_Fit$Fn_Estimates$logit_fn[Fit_s$Fast_Fit$Fn_Estimates$Exp_ID == 2] # Compare estimate to truth plot(true_fn, est_fn, xlab = "True logit(fn)", ylab = "Estimated logit(fn)", main = "Alternative pnew estimation", xlim = c(-8, 6), ylim = c(-8, 6)) abline(0, 1, col = "red")
Perhaps you have an alternative strategy for estimating pnew, or you are fairly confident as to what the mutation rate in new reads in your data is. In that case, your estimates can be passed directly to bakR no questions asked. You can either pass a vector of pnews for each sample (ordered as they appear in the bakRFit pnew/pold estimate message, so ordered by Experimental ID then Replicate ID), or just a single pnew if you think it is the same in all samples:
# Rerun with Stan-based pnew estimation # This will take a couple minutes to run Fit_u <- bakRFit(Fit, FastRerun = TRUE, pnew = 0.05)
As expected, this improves estimate accuracy, though not much more than setting StanRateEst = TRUE
did:
# Estimated logit(fraction news) est_fn <- Fit_u$Fast_Fit$Fn_Estimates$logit_fn[Fit_u$Fast_Fit$Fn_Estimates$Exp_ID == 2] # Compare estimate to truth plot(true_fn, est_fn, xlab = "True logit(fn)", ylab = "Estimated logit(fn)", main = "User provided pnew", xlim = c(-8, 6), ylim = c(-8, 6)) abline(0, 1, col = "red")
Sometimes the solution to challenges you might face when analyzing NR-seq data with bakR can be overcome with alternative bakR settings. Sometimes though, experimental optimization can make a much bigger difference. Here are some general suggestions to consider when designing your next NR-seq experiment:
CorrectDropout
function) and see if there are still major expression differencesAny scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.