knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.dim = c(6, 4) )
bakR version 1.0.0 or later is necessary to run this vignette.
library(bakR) set.seed(123)
bakR has a new simulation strategy as of version 1.0.0 that allows for the accurate simulation of dropout:
# Simulate a nucleotide recoding dataset sim_data <- Simulate_relative_bakRData(1000, depth = 1000000, nreps = 2, p_do = 0.4) # This will simulate 500 features, 500,000 reads, 2 experimental conditions # and 2 replicates for each experimental condition. # 40% dropout is simulated. # See ?Simulate_relative_bakRData for details regarding tunable parameters # Run the efficient model Fit <- bakRFit(sim_data$bakRData)
Dropout will cause bakR to underestimate the true fraction new. In addition, low fraction new features will receive more read counts than they would have if dropout had not existed (and vice versa for high fraction new features). Both of these biases can be corrected for with a single line of code:
# Correct dropout-induced biases Fit_c <- CorrectDropout(Fit) # You can also overwite the existing bakRFit object. # I am creating a separate bakRFit object to make comparisons later in this vignette.
Dropout will lead to a correlation between the fraction new and the difference between +s^4^U and -s^4^U read counts. This trend and the model fit can be visualized as follows:
# Correct dropout-induced biases Vis_DO <- VisualizeDropout(Fit) # Visualize dropout for 1st replicate of reference condition Vis_DO$ExpID_1_Rep_1
We can assess the impact of dropout correction by comparing the known simulated truth to the fraction new estimates, before and after dropout correction.
Before correction:
# Extract simualted ground truths sim_truth <- sim_data$sim_list # Features that made it past filtering XFs <- unique(Fit$Fast_Fit$Effects_df$XF) # Simulated logit(fraction news) from features making it past filtering true_fn <- sim_truth$Fn_rep_sim$Logit_fn[sim_truth$Fn_rep_sim$Feature_ID %in% XFs] # Estimated logit(fraction news) est_fn <- Fit$Fast_Fit$Fn_Estimates$logit_fn # Compare estimate to truth plot(true_fn, est_fn, xlab = "True logit(fn)", ylab = "Estimated logit(fn)") abline(0, 1, col = "red")
After correction:
# Features that made it past filtering XFs <- unique(Fit_c$Fast_Fit$Effects_df$XF) # Simulated logit(fraction news) from features making it past filtering true_fn <- sim_truth$Fn_rep_sim$Logit_fn[sim_truth$Fn_rep_sim$Feature_ID %in% XFs] # Estimated logit(fraction news) est_fn <- Fit_c$Fast_Fit$Fn_Estimates$logit_fn # Compare estimate to truth plot(true_fn, est_fn, xlab = "True logit(fn)", ylab = "Estimated logit(fn)") abline(0, 1, col = "red")
As discussed in the Abstract for this vignette, dropout refers to the loss of metabolically labeled RNA, or reads from said RNA. Dropout correction in the context of bakR is specifically designed to address biases due to the loss of metabolically labeled RNA during library preparation.
bakR's strategy for correcting dropout involves formalizing a model for dropout, using that model to infer a parametric relationship between the biased fraction new estimate and the ratio of +s^4^U to -s^4^U read counts for a feature, and then fitting that model to data with nonlinear least squares. bakR's model of dropout makes several simplifying assumptions to ensure tractability of parameter estimation while still capturing important features of the process:
These assumptions lead to the following expressions for the expected number of sequencing reads coming from a feature (index i) in +s^4^U and -s^4^U NR-seq data:
$$
\begin{align}
\text{E[reads from feature i in +s}^{\text{4}}\text{U sample]} &= \frac{\text{number of molecules from feature i}}{\text{total number of molecules in +s}^{\text{4}}\text{U sample}}\text{R}{\text{+s}^{\text{4}}\text{U}} \
&= \frac{\text{n}{\text{i}}\theta_{\text{i}}(1 - \text{pdo}) + \text{n}_{\text{i}}(1 - \theta_{\text{i}}) }{\sum_{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}\theta_{\text{j}}(1 - \text{pdo}) + \text{n}{\text{j}}(1 - \theta_{\text{j}}) }\text{R}{\text{+s}^{\text{4}}\text{U}} \
\
\
\text{E[reads from feature i in -s}^{\text{4}}\text{U sample]} &= \frac{\text{number of molecules from feature i}}{\text{total number of molecules in -s}^{\text{4}}\text{U sample}}\text{R}{\text{-s}^{\text{4}}\text{U}} \
&= \frac{\text{n}{\text{i}}}{\sum_{\text{j}=1}^{\text{NF}}\text{n}_{\text{j}}}\text{R}{\text{-s}^{\text{4}}\text{U}}
\end{align}
$$
where the parameters are defined as follows:
$$ \begin{align} \text{n}{\text{i}} &= \text{Number of molecules from feature i} \ \theta{\text{i}} &= \text{True fraction new for feature i} \ \text{pdo} &= \text{Probability that a labeled RNA molecule is preferentially lost during library prep} \ \text{NF} &= \text{Total number of RNA features that contribute sequencable molecules} \ \text{R} &= \text{Total number of sequencing reads.} \end{align} $$
Defining "dropout" to be the ratio of the RPM normalized expected read counts yields the following relationship between "dropout", the fraction new, and the rate of dropout:
$$ \begin{align} \text{Dropout} &= \frac{\text{E[reads from feature i in +s}^{\text{4}}\text{U sample]/}\text{R}{\text{+s}^{\text{4}}\text{U}}}{\text{E[reads from feature i in -s}^{\text{4}}\text{U sample]/}\text{R}{\text{-s}^{\text{4}}\text{U}}} \ \ &= \frac{\text{n}{\text{i}}\theta_{\text{i}}(1 - \text{pdo}) + \text{n}{\text{i}}(1 - \theta_{\text{i}}) }{\text{n}_{\text{i}}}\frac{\sum_{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}}{ \sum{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}\theta_{\text{j}}(1 - \text{pdo}) + \text{n}{\text{j}}(1 - \theta_{\text{j}}) } \ \ &= [\theta_{\text{i}}(1 - \text{pdo}) + (1 - \theta_{\text{i}})]\text{scale} \ &= [1 - \theta_{\text{i}}\text{pdo}]*\text{scale} \end{align} $$
where $\text{scale}$ is a constant scale factor that represents the factor difference in the number of sequencable molecules with and without dropout.
Currently, the relationship between the quantification of dropout and $\text{pdo}$ includes $\theta_{\text{i}}$, which is the true, unbiased fraction new. Unfortunately, the presence of dropout means that we do not have access to this quantify. Rather, we estimate a dropout biased fraction of sequencing reads that are new ($\theta_{\text{i}}^{\text{do}}$), which on average will represent an underestimation of $\theta_{\text{i}}$. Therefore, we need to relate the quantity we estimate ($\theta_{\text{i}}^{\text{do}}$) and the parameter we wish to estimate ($\text{pdo}$) to $\theta_{\text{i}}$. In the context of this model, such a relationship can be derived as follows:
$$ \begin{align} \theta_{\text{i}}^{\text{do}} &= \frac{\text{number of new sequencable molecules}}{\text{total number of sequencable molecules}} \ \theta_{\text{i}}^{\text{do}} &= \frac{\text{n}{\text{i}}\theta_{\text{i}}(1 - \text{pdo}) }{\text{n}{\text{i}}\theta_{\text{i}}(1 - \text{pdo}) + \text{n}{\text{i}}(1-\theta_{\text{i}})} \ \theta_{\text{i}}^{\text{do}} &= \frac{\theta_{\text{i}}(1 - \text{pdo}) }{\theta{\text{i}}(1 - \text{pdo}) + (1-\theta_{\text{i}})} \ \theta_{\text{i}}^{\text{do}}[1 - \theta_{\text{i}}\text{pdo}] &= \theta_{\text{i}}(1 - \text{pdo}) \ \theta_{\text{i}}^{\text{do}} &= \theta_{\text{i}}(1 - \text{pdo}) + \theta_{\text{i}}\theta_{\text{i}}^{\text{do}}\text{pdo} \ \frac{\theta_{\text{i}}^{\text{do}}}{(1 - \text{pdo}) + \theta_{\text{i}}^{\text{do}}\text{pdo}} &= \theta_{\text{i}} \end{align} $$
We can then use this relationship to substite $\theta_{\text{i}}$ for a function of $\theta_{\text{i}}^{\text{do}}$ and $\text{pdo}$ in our dropout quantification equation:
$$ \begin{align} \text{Dropout} &= [1 - \theta_{\text{i}}\text{pdo}]\text{scale}\ \text{Dropout} &= \text{scale} - \frac{\text{scale}\text{pdo}\theta_{\text{i}}^{\text{do}}}{(1 - \text{pdo}) + \theta_{\text{i}}^{\text{do}}*\text{pdo}} \end{align} $$
bakR's CorrectDropout
function fits this predicted relationship between the ratio of +s^4^U to -s^4^U reads ($\text{Dropout}$) and the uncorrected fraction new estimates ($\theta_{\text{i}}^{\text{do}}$) to infer $pdo$. Corrected fraction new estimates can then be inferred from the relationship between $\theta_{\text{i}}^{\text{do}}$, $\text{pdo}$, and $\theta_{\text{i}}$. Finally, bakR redistributes read counts according to what the estimated relative proportions of each feature would have been if no dropout had existed. The key insight is that:
$$ \begin{align} \frac{\text{E[number of reads without dropout]}}{\text{E[number of reads with dropout]}} &= \text{Dropout}\ &= \frac{\text{n}{\text{i}}\theta_{\text{i}}(1 - \text{pdo}) + \text{n}{\text{i}}(1 - \theta_{\text{i}}) }{\text{n}_{\text{i}}}\frac{\sum_{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}}{ \sum{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}\theta_{\text{j}}(1 - \text{pdo}) + \text{n}{\text{j}}(1 - \theta_{\text{j}}) } \ &= \frac{\theta_{\text{i}}(1 - \text{pdo}) + (1 - \theta_{\text{i}}) }{1}\frac{\text{N}}{\text{N}\theta_{\text{G}}(1-\text{pdo}) + \text{N}(1 - \theta_{\text{G}})}\ &= \frac{\theta_{\text{i}}(1 - \text{pdo}) + (1 - \theta_{\text{i}}) }{1}\frac{1}{\theta_{\text{G}}*(1 - \text{pdo}) + (1 - \theta_{\text{G}}) } \ \ &= \frac{\text{fraction of feature i molecules remaining after dropout}}{\text{fraction of total molecules remaining after dropout}} \ \end{align} $$
Getting from the 2nd line to the third line involved a bit of algebraic trickery (multiplying by 1) and defining the "global fraction new" ($\theta_{\text{G}}$), which is the fraction of all sequenced molecules that are new (where $\text{N}$ is the total number of sequenced molecules if none are lost due to dropout). $\theta_{\text{G}}$ can be calculated as a weighted average of dropout biased fraction news for each feature, weighted by the uncorrected number of reads each feature has, and then dropout correcting:
$$ \begin{align} \theta_{\text{G}}^{\text{do}} &= \frac{\sum_{\text{j=1}}^{\text{NF}}\theta_{\text{j}}^{\text{do}}\text{nreads}{\text{j}} }{\sum{\text{j=1}}^{\text{NF}}\text{nreads}{\text{j}}} \ \theta{\text{G}} &= \frac{\theta_{\text{G}}^{\text{do}}}{(1 - \text{pdo}) + \theta_{\text{G}}^{\text{do}}\text{pdo}} \end{align} $$
Thus, the dropout corrected read counts can be obtained as follows:
$$ \begin{align} \text{Corrected read count for feature i} &= \text{nreads}{\text{i}}\frac{\theta_{\text{i}}(1 - \text{pdo}) + (1 - \theta{\text{i}})}{\theta_{\text{G}}*(1 - \text{pdo}) + (1 - \theta_{\text{G}})} \end{align} $$
Florian Erhard's group was the first to implement a strategy for dropout correction in an NR-seq analysis tool (their R package grandR). The strategy used by grandR is discussed here. In short, the dropout rate ($\text{pdo}$) is assumed to be related to the Spearman correlation ($\rho$) between the rank of a feature's fraction new(referred to by them as the new-to-total ratio, or NTR) and the log of the dropout metric discussed above (ratio of +s^4^U to -s^4^U read counts) as follows:
$$ \begin{align} \text{pdo} &= \frac{\rho}{\rho + 1} \end{align} $$
We note that this definition is not model derived and sets an artificial upper bound on $\text{pdo}$ of 0.5. The estimated number of sequencing reads from new RNA is then multiplied by a factor ($\text{f}$) defined as follows:
$$ \text{f} = \frac{1}{1 - \text{pdo}} $$
and the corrected fraction new is estimated using the adjusted new RNA read counts. A bit of algebra shows that this relationship between the inferred $\text{pdo}$ and the corrected fraction new is identical to that derived from the model above:
$$ \begin{align} \theta_{\text{i}} &= \frac{\text{f}(\text{number of reads from new RNA})}{\text{f}(\text{number of reads from new RNA}) + (\text{number of reads from old RNA})} \ &= \frac{\text{f}\theta_{\text{i}}^{\text{do}}\text{nreads}{\text{i}}}{\text{f}\theta_{\text{i}}^{\text{do}}\text{nreads}{\text{i}} + (1 - \theta_{\text{i}}^{\text{do}})\text{nreads}{\text{i}}} \ &= \frac{\frac{\theta{\text{i}}^{\text{do}}}{(1 - \text{pdo})}}{\frac{\theta_{\text{i}}^{\text{do}}}{(1 - \text{pdo})} + (1 - \theta_{\text{i}}^{\text{do}})} \ &= \frac{\theta_{\text{i}}^{\text{do}}}{(1 - \text{pdo}) + \theta_{\text{i}}^{\text{do}}\text{pdo}} \end{align} $$
Thus, grandR is implicitly making assumptions similar to those laid out in the model above that is used by bakR.
Notable differences between bakR and grandR's dropout correction approaches are:
One potentially non-obvious step in the derivations above is when determining how to correct read counts. In that section, I had the following set of relationships:
$$ \begin{align} \frac{\text{E[number of reads without dropout]}}{\text{E[number of reads with dropout]}} &= \text{Dropout}\ &= \frac{\text{n}{\text{i}}\theta_{\text{i}}(1 - \text{pdo}) + \text{n}{\text{i}}(1 - \theta_{\text{i}}) }{\text{n}_{\text{i}}}\frac{\sum_{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}}{ \sum{\text{j}=1}^{\text{NF}}\text{n}{\text{j}}\theta_{\text{j}}(1 - \text{pdo}) + \text{n}{\text{j}}(1 - \theta_{\text{j}}) } \ &= \frac{\theta_{\text{i}}(1 - \text{pdo}) + (1 - \theta_{\text{i}}) }{1}\frac{\text{N}}{\text{N}\theta_{\text{G}}(1-\text{pdo}) + \text{N}(1 - \theta_{\text{G}})} \ &=\text{...} \end{align} $$
where:
$$ \begin{align} \text{N} &= \sum_{\text{j=1}}^{\text{NF}}{\text{n}{\text{j}}} \ \theta{\text{G}} &= \frac{\sum_{\text{j=1}}^{\text{NF}}\text{n}{\text{j}}*\theta{\text{i}}}{\sum_{\text{j=1}}^{\text{NF}}\text{n}_{\text{j}}} \end{align} $$
The algebraic trick to get from line 2 to line 3 is to multiply by 1 (or rather $\text{N}$/$\text{N}$):
$$ \begin{align} \sum_{\text{j=1}}^{\text{NF}}\text{n}{\text{i}}\theta_{\text{j}} &= \sum_{\text{j=1}}^{\text{NF}}{\text{n}_{\text{j}}}\frac{\sum{\text{j=1}}^{\text{NF}}\text{n}{\text{i}}\theta_{\text{j}}}{\sum_{\text{j=1}}^{\text{NF}}{\text{n}_{\text{j}}}} \ &= \text{N}\theta{\text{G}} \end{align} $$
Given hypotheses about how dropout arises, you may expect the extent of dropout to be correlated with the U-content of the RNA feature. In fact, this does seem to be the case in real datasets. bakR currently does not attempt to account for this correlation for a few reasons:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.