fn_process: Curate data in bakRFnData object for statistical modeling
In bakR: Analyze and Compare Nucleotide Recoding RNA Sequencing Datasets

fn_process

R Documentation

Curate data in bakRFnData object for statistical modeling

Description

fn_process creates the data structures necessary to analyze nucleotide recoding RNA-seq data with the MLE and Hybrid implementations in bakRFit. The input to fn_process must be an object of class bakRFnData.

Usage

fn_process(
  obj,
  totcut = 50,
  totcut_all = 10,
  Chase = FALSE,
  FOI = c(),
  concat = TRUE
)

Arguments

`obj`	An object of class bakRFnData
`totcut`	Numeric; Any transcripts with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out
`totcut_all`	Numeric; Any transcripts with less than this number of sequencing reads in any sample are filtered out
`Chase`	Boolean; if TRUE, pulse-chase analysis strategy is implemented
`FOI`	Features of interest; character vector containing names of features to analyze. If `FOI` is non-null and `concat` is TRUE, then all minimally reliable FOIs will be combined with reliable features passing all set filters (`totcut` and `totcut_all`). If `concat` is FALSE, only the minimally reliable FOIs will be kept. A minimally reliable FOI is one that passes filtering with minimally stringent parameters.
`concat`	Boolean; If TRUE, FOI is concatenated with output of reliableFeatures

Details

fn_process first filters out features with less than totcut reads in any sample. It then creates the necessary data structures for analysis with bakRFit and some of the visualization functions (namely plotMA).

The 1st step executed by fn_process is to find the names of features which are deemed "reliable". A reliable feature is one with sufficient read coverage in every single sample (i.e., > totcut_all reads in all samples) and sufficient read coverage in at all replicates of at least one experimental condition (i.e., > totcut reads in all replicates for one or more experimental conditions). This is done with a call to reliableFeatures.

The 2nd step is to extract only reliableFeatures from the fns dataframe in the bakRFnData object. During this process, a numerical ID is given to each reliableFeature, with the numerical ID corresponding to their order when arranged using dplyr::arrange.

The 3rd step is to prepare data structures that can be passed to fast_analysis and TL_stan (usually accessed via the bakRFit helper function).

Value

returns list of objects that can be passed to TL_stan and/or fast_analysis. Those objects are:

Stan_data; list that can be passed to TL_stan with Hybrid_Fit = TRUE. Consists of metadata as well as data that Stan will analyze. Data to be analyzed consists of equal length vectors. The contents of Stan_data are:
- NE; Number of datapoints for 'Stan' to analyze (NE = Number of Elements)
- NF; Number of features in dataset
- TP; Numerical indicator of s4U feed (0 = no s4U feed, 1 = s4U fed)
- FE; Numerical indicator of feature
- num_mut; Number of U-to-C mutations observed in a particular set of reads
- MT; Numerical indicator of experimental condition (Exp_ID from metadf)
- nMT; Number of experimental conditions
- R; Numerical indicator of replicate
- nrep; Number of replicates (maximum across experimental conditions)
- nrep_vect; Vector of number of replicates in each experimental condition
- tl; Vector of label times for each experimental condition
- Avg_Reads; Standardized log10(average read counts) for a particular feature in a particular condition, averaged over replicates
- sdf; Dataframe that maps numerical feature ID to original feature name. Also has read depth information
- sample_lookup; Lookup table relating MT and R to the original sample name
Fn_est; A data frame containing fraction new estimates for +s4U samples:
- sample; Original sample name
- XF; Original feature name
- fn; Fraction new estimate
- n; Number of reads
- Feature_ID; Numerical ID for each feature
- Replicate; Numerical ID for each replicate
- Exp_ID; Numerical ID for each experimental condition
- tl; s4U label time
- logit_fn; logit of fraction new estimate
- kdeg; degradation rate constant estimate
- log_kdeg; log of degradation rate constant estimate
- logit_fn_se; Uncertainty of logit(fraction new) estimate
- log_kd_se; Uncertainty of log(kdeg) estimate
Count_Matrix; A matrix with read count information. Each column represents a sample and each row represents a feature. Each entry is the raw number of read counts mapping to a particular feature in a particular sample. Column names are the corresponding sample names and row names are the corresponding feature names.
Ctl_data; Identical content to Fn_est but for any -s4U data (and thus with fn estimates set to 0). Will be NULL if no -s4U data is present

Examples



# Load cB
data("cB_small")

# Load metadf
data("metadf")

# Create bakRData
bakRData <- bakRData(cB_small, metadf)

# Preprocess data
data_for_bakR <- cBprocess(obj = bakRData)

bakR documentation built on June 22, 2024, 6:55 p.m.