In bartongroup/RATS: Relative Abundance of Transcripts

knitr::opts_chunk$set(echo = TRUE)

library(rats)
library(data.table)

Let's set up an example, using RAT's data emulator (used for code testing, not suitable for data simulations).

# Simulate some data.
simdat <- sim_boot_data(clean=TRUE) 
# For convenience let's assign the contents of the list to separate variables
mycond_A <- simdat[[2]]   # Simulated bootstrapped data for one condition.
mycond_B <- simdat[[3]]   # Simulated bootstrapped data for other condition.
myannot <- simdat[[1]]    # Transcript and gene IDs for the above data.

# Call DTU
mydtu <- call_DTU(annot= myannot, verbose= FALSE,
                  boot_data_A= mycond_A, boot_data_B= mycond_B,
                  dprop_thresh=0.1, qboot=TRUE, rboot=FALSE)

Output structure

The output of RATs is a list containing 4 elements:

print( names(mydtu) )

Parameters

Parameters is a list that contains information about the data and the settings for a particular run.

# Parameter list's elements.
print( names(mydtu$Parameters) )

description - (str) Free-text description of the run. It is useful to record data sources, annotation source and version, experimental parameters, etc...
time - (str) Date and time for the run. This does not represent the exact time the run started.
rats_version - (str) The version of RATs.
R_version - (str) The version of R (including OS architecture).
var_name - (str) The value passed to the varname parameter.
cond_A & cond_B - (str) The values passed to the name_A and name_B parameters.
data_type - (str) The format of the input data.
num_replic_A & num_replic_B - (int) The number of samples in each condition.
num_genes - (int) The number of genes in the provided annotation.
num_transc - (int) The number of transcripts in the provided annotation.
tests - (str) The value passed to the testmode parameter.
use_sums - (bool) Sum the replicates instead of averaging them (pre-0.7.0 RATs always summed).
p_thresh - (num) The value passed to the p_thresh parameter.
abund_thresh - (num) The value passed to the abund_thresh parameter.
dprop_thresh - (num) The value passed to the dprop_thresh parameter.
correction - (str) Multiple testing correction method.
abund_scaling - (num) The value(s) passed to the scaling parameter. It will be either a single numeric value or a named vector of values.
quant_boot - (bool) The value passed to the qboots parameter.
quant_reprod_thresh - (num) The value passed to the qrep_thresh parameter, if qboot==TRUE.
quant_bootnum - (int) The value passed to the qbootnum parameter, if qboot==TRUE.
rep_boot - (bool) The value passed to the rboot parameter.
rep_reprod_thresh - (num) The value passed to the rrep_thresh parameter, if rboot==TRUE.
rep_bootnum - (int) The number of replicate bootstrapping iterations (currently M*N, where M and N are the numbers of samples in the two conditions), if rboot==TRUE.
seed - (int) Custom seed for the random generator. NA if none was specified.
reckless - (bool) The value passed to the reckless parameter.
lean - (bool) The value passed to the lean parameter.

Note: If bootstraps are disabled, the bootstrap-related fields may be NA, regardless of supplied values, to reflect the fact that they were not used.

Genes

Genes is a data.table listing results at the gene level. For your convenience, the respective aggregated transcript-level DTU calls are also included here (defined as at least one isoform being called DTU individually).

# Genes table's fields.
print( names(mydtu$Genes) )

The first few columns show the result of each decision step in a boolean manner, allowing easy filtering of the table. The remaining columns list the values based on which these decisions were made. Pseudo-code formulas are shown here to help understand how the different fields interact when making decisions.

parent_id - (str) Gene identifier.
elig - (bool) Eligible for testing. Whether the gene met the pre-filtering criteria. = (elig_transc >= 2).
sig - (bool) Statistically significant. = (pval_corr < Parameters$p_thresh).
elig_fx - (bool) Eligible effect size. Whether at least one of the isoforms meets the effect size criterion. = any(Transcript[parent_id, elig_fx]).
quant_reprod - (bool) Result reproducible across quantification iterations. For positive DTU, = (quant_dtu_freq >= Parameters$quant_reprod_thresh), for non-DTU = (quant_dtu_freq <= 1 - Parameters$quant_reprod_thresh).
rep_reprod - (bool) Result reproducible across replicates. For positive DTU, = (rep_dtu_freq >= Parameters$rep_reprod_thresh), for non-DTU = (rep_dtu_freq <= 1 - Parameters$rep_reprod_thresh).
DTU - (bool) Gene-level Differential Transcript Usage. = (sig & elig_fx & quant_reprod & rep_reprod).
transc_DTU - (bool) Aggregated from Transcripts$DTU: At least one isoform was individually reported as DTU. = (Transcript[ , any(DTU), by=parent_id]).
known_transc - (int) Number of known trascripts for this gene according to the given annotation.
detect_trancs - (int) Number of detected (expressed) transcripts in the given dataset.
elig_transc - (int) Number of eligible transcripts, aggregated from Transcripts$elig.
pval - (num) G test of independence p-value for the isoform ratios of the gene.
pval_corr - (num) Multiple testing corrected p-value. pval_corr= p.adjust(pval, Parameters$correction).
quant_p_median - (num) Median of corrected p-values across quantification bootstraps (not available in lean mode).
quant_p_min - (num) Minimum observed (corrected) p-value across quantification bootstraps (not available in lean mode).
quant_p_max - (num) Maximum observed (corrected) p-value across quantification bootstraps (not available in lean mode).
quant_na_freq - (num) Fraction of quantification iterations in which DTU could not be determined (not meeting pre-filtering criteria).
quant_dtu_freq - (num) Fraction of replicate iterations that support a positive DTU classification.
rep_p_median - (num) Median p-value across replication bootstraps (not available in lean mode).
rep_p_min - (num) Minimum observed p-value across replication bootstraps (not available in lean mode).
rep_p_max - (num) Maximum observed p-value across replication bootstraps (not available in lean mode).
rep_na_freq - (num) Fraction of replication iterations in which DTU could not be determined (not meeting pre-filtering criteria).
rep_dtu_freq - (num) Fraction of replication iterations that support a positive DTU classification.

Note: The fields reporting on the bootstraps will not be shown when bootstrapping is disabled.

Transcripts

Transcripts is a data.table listing results at the transcript level. For your convenience, the respective gene-level DTU calls are also included here.

# Transcripts table's fields.
print( names(mydtu$Transcripts) )

target_id - (str) Transcript identifier.
parent_id - (str) Gene identifier.
elig_xp - (bool) Eligible expression level. = (meanA >= Parameters$abund_thresh | meanB >= Parameters$abund_thresh).
elig - (bool) Eligible for testing (meets the noise threshold and at least one other isoform is expressed). = (elig_xp & totalA != 0 & totalB != 0 & (abundA != totalA | abundB != totalB)).
sig - (bool) Statistically significant. = (pval_corr < Parameters$p_thresh).
elig_fx - (bool) Eligible effect size. Proxy for biological significance. = (Dprop > Parameters$dprop_thresh). *. quant_reprod - (bool) Quantification reproducible. For positive DTU, = (quant_dtu_freq >= Parameters$quant_reprod_thresh), for non-DTU = (quant_dtu_freq <= 1 - Parameters$quant_reprod_thresh).
rep_reprod - (bool) Replication reproducible. For positive DTU, = (rep_dtu_freq >= Parameters$rep_reprod_thresh), for non-DTU = (rep_dtu_freq <= 1 - Parameters$rep_reprod_thresh).
DTU - (bool) The Transcript is Differentially Used. = (sig & elig_fx & quant_reprod & rep_reprod).
gene_DTU - (bool) Expanded from Genes$DTU. Indicates that the gene as a whole shows significant change in isoform ratios. = (Genes[parent_id, DTU]).
abundA and abundB - (num) Either the mean or the sum of the abundances across replicates, depending on what is indicated by Parameters$use_sums. This is the value used for the tests.
stdevA and stdevB - (num) The standard deviation of the abundance across the replicates.
log2FC - (num) log2 of fold-change of transcript abundance: = log2(abundB / abundA). abundA and abundB are not internally normalised for library size, so care must be taken when interpreting log2FC.
totalA and totalB - (num) The total abundance for the gene: totalA= sum(transcripts[parent_id, abundA]).
propA and propB - (num) The proportion of the gene expression owed to this transcript. propA= abundA / totalA.
Dprop - (num) The difference in the proportion of the transcript between the two conditions (effect size). = (probB - propA).
pval - (num) The proportion equality test P-value for the transcript.
pval_corr - (num) Multiple testing corrected p-value. = p.adjust(pval, Parameters$correction).
quant_p_median - (num) Median of corrected p-value across quantification bootstraps (not available in lean mode).
quant_p_min - (num) Minimum observed (corrected) p-value across quantification bootstraps (not available in lean mode).
quant_p_max - (num) Maximum observed (corrected) p-value across quantification bootstraps (not available in lean mode).
quant_Dprop_mean - (num) Mean effect size across quantification bootstraps (not available in lean mode).
quant_Dprop_stdev - (num) Standard deviation of effect size across quantification bootstraps (not available in lean mode).
quant_Dprop_min - (num) Minimum observed effect size across quantification bootstraps (not available in lean mode).
quant_Dprop_max - (num) Maximum observed effect size across quantification bootstraps (not available in lean mode).
quant_na_freq - (num) Fraction of quantification iterations in which DTU could not be determined (not meeting pre-filtering criteria).
quant_dtu_freq - (num) Fraction of quantification iterations that support a positive DTU classification.
rep_p_median - (num) Median of corrected p-value across replication bootstraps (not available in lean mode).
rep_p_min - (num) Minimum observed (corrected) p-value across replication bootstraps (not available in lean mode).
rep_p_max - (num) Maximum observed (corrected) p-value across replication bootstraps (not available in lean mode).
rep_Dprop_mean - (num) Mean effect size across replication bootstraps (not available in lean mode).
rep_Dprop_stdev - (num) Standard deviation of effect size across replication bootstraps (not available in lean mode).
rep_Dprop_min - (num) Minimum observed effect size across replication bootstraps (not available in lean mode).
rep_Dprop_max - (num) Maximum observed effect size across replication bootstraps (not available in lean mode).
rep_na_freq - (num) Fraction of replication iterations in which DTU could not be determined (not meeting pre-filtering criteria).
rep_dtu_freq - (num) Fraction of replication iterations that support a positive DTU classification.

Note: The fields reporting on the bootstraps will not be shown when bootstrapping is disabled.

Abundances

Abundances is a list of two data.tables, one for each condition. Each transcript is represented by a single abundance value per replicate, so if bootstrapped data was used, these values are the means across iterations. If plain abundances were provided as input, then Abundances essentially contains the input data. These abundances are included in the output because they are required for some of RATs' plotting options.

# Elements of Abundances.
print( names(mydtu$Abundances) )

condA - (num) The transcript abundances in the first condition.
condB - (num) The transcript abundances in the second condition.

# Abundance table for first condition.
print( head(mydtu$Abundances[[1]]) )

Quick results

For your convenience, RATs provides a few functions to give you a quick summary of the run. They all follow the same style.

These reports should not be seen as a substitute for the detailed RATs output.

Summary of DTU

The dtu_summary() function lists the total number of genes and transcripts for each of 3 categories:

DTU: There is significant change in isoform ratios (in terms of both effect size and statistical significance).
non-DTU: No significant change.
ineligible: Ineligible for testing. Genes/transcripts with read count below the set threshold, or where the gene has only one known transcript.

# A tally of the outcome.
print( dtu_summary(mydtu) )

The get_dtu_ids() function lists the coresponding identifiers per category. The ID lists obtained are ordered by effect size (Dprop).

# Gene and transcript IDs corresponding to the tally above.
ids <- get_dtu_ids(mydtu)
print( names(ids) )
print( ids )

Summary of isoform switching

Isoform switching is a subset of DTU. Primary isoform switching is often considered the most likely type of DTU to have an effect. The following two functions summarise the extent of isoform switching in the results:

# A tally of genes switching isoform ranks.
print( dtu_switch_summary(mydtu) )

# The gene IDs displaying isoform switching.
ids <- get_switch_ids(mydtu)
print( names(ids) )

Summary of DTU plurality

In case you want to know how many isoforms are affected per gene.

# A tally of genes switching isoform ranks.
print( dtu_plurality_summary(mydtu) )

# The gene IDs displaying isoform switching.
ids <- get_plurality_ids(mydtu)

These give you the number and IDs of genes in which 2, 3, etc... isoforms show DTU.

Contact information

The rats R package was developed within The Barton Group at The University of Dundee by Dr. Kimon Froussios, Dr. Kira Mourão and Dr. Nick Schurch.

To report problems or ask for assistance, please raise a new issue on the project's support forum. Providing a reproducible working example that demonstrates your issue is strongly encouraged to help us understand the problem. Also, be sure to read the vignette(s), and browse/search the support forum before posting a new issue, in case your question is already answered there.

Enjoy!