View source: R/Fitting_Models.R
bakRFit | R Documentation |
bakRFit
analyzes nucleotide recoding RNA-seq data to estimate
kinetic parameters relating to RNA stability and changes in RNA
stability induced by experimental perturbations. Several statistical
models of varying efficiency and accuracy can be used to fit data.
bakRFit(
obj,
StanFit = FALSE,
HybridFit = FALSE,
high_p = 0.2,
totcut = 50,
totcut_all = 10,
Ucut = 0.25,
AvgU = 4,
FastRerun = FALSE,
FOI = c(),
concat = TRUE,
StanRateEst = FALSE,
RateEst_size = 30,
low_reads = 100,
high_reads = 5e+05,
chains = 1,
NSS = FALSE,
Chase = FALSE,
BDA_model = FALSE,
multi_pold = FALSE,
Long = FALSE,
kmeans = FALSE,
ztest = FALSE,
Fisher = TRUE,
...
)
obj |
|
StanFit |
Logical; if TRUE, then the MCMC implementation is run. Will only be used if |
HybridFit |
Logical; if TRUE, then the Hybrid implementation is run. Will only be used if |
high_p |
Numeric; Any features with a mutation rate (number of mutations / number of Ts in reads) higher than this in any -s4U control samples (i.e., samples that were not treated with s4U) are filtered out |
totcut |
Numeric; Any features with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out |
totcut_all |
Numeric; Any features with less than this number of sequencing reads in any sample are filtered out |
Ucut |
Numeric; All features must have a fraction of reads with 2 or less Us less than this cutoff in all samples |
AvgU |
Numeric; All features must have an average number of Us greater than this cutoff in all samples |
FastRerun |
Logical; only matters if a bakRFit object is passed to |
FOI |
Features of interest; character vector containing names of features to analyze |
concat |
Logical; If TRUE, FOI is concatenated with output of reliableFeatures |
StanRateEst |
Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis. |
RateEst_size |
Numeric; if StanRateEst is TRUE, then data from RateEst_size genes are used for mutation rate estimation. This can be as low as 1 and should be kept low to ensure maximum efficiency |
low_reads |
Numeric; if StanRateEst is TRUE, then only features with more than low_reads reads in all samples will be used for mutation rate estimation |
high_reads |
Numeric; if StanRateEst is TRUE, then only features with less than high_read reads in all samples will be used for mutation rate estimation. A high read count cutoff is as important as a low read count cutoff in this case because you don't want the fraction labeled of chosen features to be extreme (e.g., close to 0 or 1), and high read count features are likely low fraction new features. |
chains |
Number of Markov chains to sample from. 1 should suffice since these are validated models. Running more chains is generally preferable, but memory constraints can make this unfeasible. |
NSS |
Logical; if TRUE, logit(fn)s are directly compared to avoid assuming steady-state |
Chase |
Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment). |
BDA_model |
Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used. |
multi_pold |
Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate. |
Long |
Logical; if TRUE, long read optimized fraction new estimation strategy is used. |
kmeans |
Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold. |
ztest |
Logical; if TRUE and the MLE implementation is being used, then a z-test will be used for p-value calculation rather than the more conservative moderated t-test. |
Fisher |
Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew). |
... |
Arguments passed to either |
If bakRFit
is run on a bakRData object, cBprocess
and then fast_analysis
will always be called. The former will generate the processed
data that can be passed to the model fitting functions (fast_analysis
and TL_Stan
). The call to fast_analysis
will generate a list of dataframes
containing information regarding the fast_analysis
fit. fast_analysis
is always
called because its output is required for both Hybrid_fit
and TL_Stan
.
If bakRFit
is run on a bakRFit object, cBprocess
will not be called again,
as the output of cBprocess
will already be contained in the bakRFit object. Similarly,
fast_analysis
will not be called again unless bakRFit is rerun on the bakRData object.
or if FastRerun
is set to TRUE. If you want to generate model fits using different parameters for cBprocess,
you will have to rerun bakRFit
on the bakRData object.
If bakRFit
is run on a bakRFnData object, fn_process
and then avg_and_regularize
will always be called. The former will generate the processed data that can be passed to the model
fitting functions (avg_and_regularize
and TL_Stan
, the latter only with HybridFit = TRUE).
If bakRFit
is run on a bakRFnFit object. fn_process
will not be called again, as
the output of fn_process
will already be contained in the bakRFnFit object. Similary,
avg_and_regularize
will not be called unless bakRFit
is rerun on the bakRData object,
or if FastRerun
is set to TRUE. If you want to generate model fits using different
parameters for fn_process
, you will have to rerun bakRFit
on the bakRData object.
See the documentation for the individual fitting functions for details regarding how they analyze nucleotide recoding data. What follows is a brief overview of how each works
fast_analysis
(referred to as the MLE implementation in the bakR paper)
either estimates mutation rates from + and (if available) - s4U samples or uses mutation rate estimates
provided by the user to perform maximum likelihood estimation (MLE) of the fraction of RNA that is labeled for each
replicate of nucleotide recoding data provided. Uncertainties for each replicate's estimate are approximated using
asymptotic results involving the Fisher Information and assuming known mutation rates. Replicate data
is pooled using an approximation to hierarchical modeling that relies on analytic solutions to simple Bayesian models.
Linear regression is used to estimate the relationship between read depths and replicate variability for uncertainty
estimation regularization, again performed using analytic solutions to Bayesian models.
TL_Stan
with Hybrid_Fit set to TRUE (referred to as the Hybrid implementation in the bakR paper)
takes as input estimates of the logit(fraction new) and uncertainty provided by fast_analysis
.
It then uses 'Stan' on the backend to implement a hierarchical model that pools data across replicates and the dataset
to estimate effect sizes (L2FC(kdeg)) and uncertainties. Replicate variability information is pooled across each experimental
condition to regularize variance estimates using a hierarchical linear regression model.
The default behavior of TL_Stan
(referred to as the MCMC implementation in the bakR paper)
is to use 'Stan' on the back end to implement a U-content exposure adjusted Poisson mixture model
to estimate fraction news from the mutational data. Partial pooling of replicate variability estimates
is performed as with the Hybrid implementation.
bakRFit object with results from statistical modeling and data processing. Objects possibly included are:
Fast_Fit; Always will be present. Output of fast_analysis
Hybrid_Fit; Only present if HybridFit = TRUE. Output of TL_stan
Stan_Fit; Only present if StanFit = TRUE. Output of TL_stan
Data_lists; Always will be present. Output of cBprocess
with Fast and Stan == TRUE
# Simulate data for 1000 genes, 2 replicates, 2 conditions
simdata <- Simulate_bakRData(1000, nreps = 2)
# You always must fit fast implementation before any others
Fit <- bakRFit(simdata$bakRData)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.