BBUM_DEcorr: FDR correction of secondary effects and significance calling...

View source: R/BBUM_DEcorr.R

BBUM_DEcorrR Documentation

FDR correction of secondary effects and significance calling on DE data frames

Description

Process DE results data tables. Designed for DESeq2 output but any similar tables with all appropriate columns are applicable. Out of two subsets of data rows, one set ("signal/sample set") with primary effects and one ("background set") without, BBUM_DEcorr models the p values to distinguish secondary effects signal from primary effects signal using the BBUM model. It then calls significant genes within the signal set after correcting for FDR of both null and secondary effects in multiple testing.

Usage

BBUM_DEcorr(
  df.deseq,
  classCol,
  geneName = NULL,
  pBBUM.alpha = 0.05,
  excluded = c(),
  outliers = c(),
  add_starts = list(),
  only_start = FALSE,
  limits = list(),
  auto_outliers = TRUE,
  rthres = 1,
  rtrimmax = 0.05,
  atrimmax = 10,
  two_tailed = FALSE,
  quiet = FALSE
)

Arguments

df.deseq

A DE results data.frame (or subclass). Typically the output DESeqResults object from DESeq2, after running results(). The data.frame could also have been further processed before calling BBUM_DEcorr, as long as the appropriate columns are unchanged. See details.

classCol

A character string of the name of the column that indicates whether a data point row belongs in the signal set or the background set.

geneName

A character string of the name of the column that contains the names of genes, e.g. "FeatureID". Leave as default (NULL) if the column has not been changed (i.e. as row names, or "row" if tidy=T in results() of DESeq2).

pBBUM.alpha

Cutoff level of pBBUM for significance calling.

excluded

A character vector of all gene names that did not meet some user-determined prior filtering criteria (e.g. read cutoff). If excluded is not supplied, all genes with valid p values will be used for analysis.

outliers

A character vector of all genes names that should be regarded as outliers among the background set (on top of automatically identified outliers, if applicable).

add_starts

List of named vectors for additional starts of fitting algorithm beyond the default set.

only_start

Whether the algorithm should only use the given starts (add_starts) to fit.

limits

Named list of custom limits for specific paramters. Parameters not mentioned would be default values.

auto_outliers

Toggle automatic outlier trimming.

rthres

Threshold value of r parameter to trigger a failed r.pass value. For automatic trimming methods in other functions and not meant for use in isolation.

rtrimmax

Maximum fraction of data points allowed to be outliers in the background set of data (to be trimmed).

atrimmax

Maximum absolute number of data points allowed to be outliers in the background set of data (to be trimmed).

two_tailed

Toggle the "two-tailed" case of BBUM correction, if the background assumption is weak and bona fide hits in the background class are relevant. See Details. Default behavior is off.

quiet

Suppress printed messages and warnings.

Details

BBUM_DEcorr can also be used on arbitrary data frames other those generated by DESeq2 (e.g. by EdgeR), as long as three necessary columns are present: data points names (geneName), p values (pvalue), and data set identifier (classCol).

The output data.frame preserves all original rows in order, all original columns in order, and row names if present.

classCol: For example, in the case where positive log fold changes represent the signal class, and negative log fold changes represent the background class, all rows of positive fold change points have classCol == T, and all rows of negative fold change points have classCol == F.

pBBUM represents the expected overall FDR level if the cutoff were set at that particular p value. This is similar to the interpretation of p values corrected through the typical p.adjust(method = "fdr").

pBBUM values are designed for the signal set p values only, Values for the background set are given but not valid as significance testing adjustment, and so should not be used to call any hits. They are provided primarily to compare the equivalent transformation against the signal set to assess the adjustment strategy. The background set should not be considered for hits.

BBUM_DEcorr functions properly only with p values filtered for poor quality data points, e.g. low read counts. Filtering has either been done manually in prior, through DESeq2's independentFiltering in results(), or through the excluded argument. Excluding a point via excluded is equivalent to filtering it out of the data.frame to begin with. Such points tend to have high p values and may disrupt the uniform null distribution.

Outliers (outliers) are included in the graphs from BBUM_plot() and have p values associated. Excluded points (excluded) are removed from analysis altogether.

Default starts for BBUM fitting are implemented. If additional starts should included, or only custom starts should be considered, make use of add_starts and/or only_start arguments.

If more than one start achieved the identical likelihood value, A random start is chosen among them.

Automatic outlier detection relies on the model fitting a value of r > 1. Such a result suggests that a stronger signal (presumably outliers) exists in the background set than in the signal set, which violates the assumptions of the model. This is a conservative strategy. The ideal way to deal with outliers is to identify and handle them before any statistical analyses. For benchmarking of the trimming strategy, see Wang & Bartel, 2022.

Adding too many starts or allowing too much outlier trimming can increase computation time.

If the background assumption is weak, such that a small number of bona fide hits are anticipated and relevant to the hypothesis at hand among the data points designated "background class", the FDR could be made to include the background class. This is akin to a two-tailed test (despite a one-tailed assumption to begin with). This would allow the generation of genuine FDR-corrected p values for the background class points as well. Toggle this using the two_tailed value.

Due to the asymptotic behavior of the function when any p values = 0, any p values < .Machine$double.xmin*10 would be constrained to .Machine$double.xmin*10.

Value

A tibble based on the input results table, with added columns from BBUM correction and significance calling:

  • geneName: Copy of extracted gene names (character)

  • excluded: Whether the point is considered failing the cutoff and excluded in excluded (logical)

  • outlier: Whether the point is considered an outlier (logical)

  • BBUM.class: Whether the data point is considered in the signal set instead of the background set (logical)

  • BBUM.l, BBUM.a, BBUM.th, BBUM.r: Coefficients from BBUM model fit (numeric)

  • pBBUM: BBUM-transformed, FDR-adusted p value (numeric)

  • BBUM.hit: Whether a gene is a significant hit (logical)

  • BBUM.fct: Factor summarizing status as hits, non-hits, or outliers (factor)

Examples

## Not run: 
# Typical default run
# DESeq2
dds = DESeqDataSetFromMatrix(countData = cts, ...) # DESeq2
dds = DESeq(dds)
res = results(dds) %>%
  as.data.frame() %>%
  mutate(FCup = log2FoldChange > 0)

# Call bbum function
res.BBUMcorr = BBUM_DEcorr(
  df.deseq = res,
  classCol = "FCup"
  )

# Or with some customized behavior
res.BBUMcorr = BBUM_DEcorr(
  df.deseq = res,
  classCol = "FCup",
  geneName = "miRNAs",
  pBBUM.alpha = 0.01,
  excluded = c("hsa-miR-124-5p", "hsa-miR-1-3p"),
  outliers = c("hsa-miR-21-5p"),
  add_starts = list(c(lambda = 0.9, a = 0.6, theta = 0.1, r = 0.1))
  )

## End(Not run)


wyppeter/bbum documentation built on Oct. 3, 2023, 3:29 p.m.