bbum: BBUM correction for significance testing of p values

Documented in BBUM_DEcorr

#' FDR correction of secondary effects and significance calling on DE data frames
#'
#' Process DE results data tables. Designed for DESeq2 output but any similar
#'   tables with all appropriate columns are applicable. Out of two subsets of
#'   data rows, one set ("signal/sample set") with primary effects and one
#'   ("background set") without, \code{BBUM_DEcorr} models the p values to
#'   distinguish secondary effects signal from primary effects signal using the
#'   BBUM model. It then calls significant genes within the signal set after
#'   correcting for FDR of **both** null **and** secondary effects in multiple
#'   testing.
#'
#' @param df.deseq A DE results \code{data.frame} (or subclass). Typically the
#'   output  \code{DESeqResults} object from DESeq2, after running
#'   \code{results()}. The \code{data.frame} could also have been further
#'   processed before calling \code{BBUM_DEcorr}, as long as the appropriate
#'   columns are unchanged. See details.
#' @param classCol A \code{character} string of the name of the column that
#'   indicates whether a data point row belongs in the signal set or the
#'   background set.
#' @param geneName A \code{character} string of the name of the column that
#'   contains the names of genes, e.g. \code{"FeatureID"}. Leave as default
#'   (\code{NULL}) if the column has not been changed (i.e. as row names, or
#'   \code{"row"} if \code{tidy=T} in \code{results()} of DESeq2).
#' @param pBBUM.alpha Cutoff level of \code{pBBUM} for significance calling.
#' @param excluded A \code{character} vector of all gene names that did **not**
#'   meet some user-determined prior filtering criteria (e.g. read cutoff). If
#'   \code{excluded} is not supplied, all genes with valid p values will be used
#'   for analysis.
#' @param outliers A \code{character} vector of all genes names that should be
#'   regarded as outliers among the background set (on top of automatically
#'   identified outliers, if applicable).
#' @inheritParams BBUM_corr
#'
#' @details \code{BBUM_DEcorr} can also be used on arbitrary data frames other
#'   those generated by \code{DESeq2} (e.g. by \code{EdgeR}), as long as three
#'   necessary columns are present: data points names (\code{geneName}), p
#'   values (\code{pvalue}), and data set identifier (\code{classCol}).
#' @details The output \code{data.frame} preserves all original rows in order,
#'   all original columns in order, and row names if present.
#' @details \code{classCol}: For example, in the case where positive log fold
#'   changes represent the signal class, and negative log fold changes represent
#'   the background class, all rows of positive fold change points have
#'   \code{classCol == T}, and all rows of negative fold change points have
#'   \code{classCol == F}.
#' @details \code{pBBUM} represents the expected overall FDR level if the cutoff
#'   were set at that particular p value. This is similar to the interpretation
#'   of p values corrected through the typical \code{p.adjust(method = "fdr")}.
#' @details \code{pBBUM} values are designed for the signal set p values only,
#'   Values for the background set are given but not valid as significance
#'   testing adjustment, and so should **not** be used to call any hits. They
#'   are provided primarily to compare the equivalent transformation against the
#'   signal set to assess the adjustment strategy. The background set should
#'   **not** be considered for hits.
#' @details \code{BBUM_DEcorr} functions properly only with p values filtered
#'   for poor quality data points, e.g. low read counts. Filtering has either
#'   been done manually in prior, through DESeq2's \code{independentFiltering}
#'   in \code{results()}, or through the \code{excluded} argument. Excluding a
#'   point via \code{excluded} is equivalent to filtering it out of the
#'   data.frame to begin with. Such points tend to have high p values and
#'   may disrupt the uniform null distribution.
#' @details Outliers (\code{outliers}) are included in the graphs from
#'   \code{BBUM_plot()} and have p values associated. Excluded points
#'   (\code{excluded}) are removed from analysis altogether.
#' @details Default starts for BBUM fitting are implemented. If additional
#'   starts should included, or only custom starts should be considered, make
#'   use of \code{add_starts} and/or \code{only_start} arguments.
#' @details If more than one start achieved the identical likelihood value,
#'   A random start is chosen among them.
#' @details Automatic outlier detection relies on the model fitting a value of
#'   \code{r > 1}. Such a result suggests that a stronger signal (presumably
#'   outliers) exists in the background set than in the signal set, which
#'   violates the assumptions of the model. This is a *conservative* strategy.
#'   The ideal way to deal with outliers is to identify and handle them before
#'   any statistical analyses. For benchmarking of the trimming strategy, see
#'   Wang & Bartel, 2022.
#' @details Adding too many starts or allowing too much outlier trimming can
#'   increase computation time.
#' @details  If the background assumption is weak, such that a small number
#'   of bona fide hits are anticipated and relevant to the hypothesis at
#'   hand among the data points designated "background class", the FDR could be
#'   made to include the background class. This is akin to a two-tailed test
#'   (despite a one-tailed assumption to begin with). This would allow the
#'   generation of genuine FDR-corrected p values for the background class
#'   points as well. Toggle this using the \code{two_tailed} value.
#' @details Due to the asymptotic behavior of the function when any
#'   p values = 0, any p values < \code{.Machine$double.xmin*10} would be
#'   constrained to \code{.Machine$double.xmin*10}.
#'
#' @return A \code{tibble} based on the input results table, with added
#'   columns from BBUM correction and significance calling:
#' * \code{geneName}: Copy of extracted gene names (\code{character})
#' * \code{excluded}: Whether the point is considered failing the cutoff and
#'   excluded in \code{excluded} (\code{logical})
#' * \code{outlier}: Whether the point is considered an outlier (\code{logical})
#' * \code{BBUM.class}: Whether the data point is considered in the signal set
#'   instead of the background set (\code{logical})
#' * \code{BBUM.l}, \code{BBUM.a}, \code{BBUM.th}, \code{BBUM.r}: Coefficients
#'   from BBUM model fit (\code{numeric})
#' * \code{pBBUM}: BBUM-transformed, FDR-adusted p value (\code{numeric})
#' * \code{BBUM.hit}: Whether a gene is a significant hit (\code{logical})
#' * \code{BBUM.fct}: Factor summarizing status as hits, non-hits, or outliers
#'   (\code{factor})
#'
#' @examples
#' \dontrun{
#' # Typical default run
#' # DESeq2
#' dds = DESeqDataSetFromMatrix(countData = cts, ...) # DESeq2
#' dds = DESeq(dds)
#' res = results(dds) %>%
#'   as.data.frame() %>%
#'   mutate(FCup = log2FoldChange > 0)
#'
#' # Call bbum function
#' res.BBUMcorr = BBUM_DEcorr(
#'   df.deseq = res,
#'   classCol = "FCup"
#'   )
#'
#' # Or with some customized behavior
#' res.BBUMcorr = BBUM_DEcorr(
#'   df.deseq = res,
#'   classCol = "FCup",
#'   geneName = "miRNAs",
#'   pBBUM.alpha = 0.01,
#'   excluded = c("hsa-miR-124-5p", "hsa-miR-1-3p"),
#'   outliers = c("hsa-miR-21-5p"),
#'   add_starts = list(c(lambda = 0.9, a = 0.6, theta = 0.1, r = 0.1))
#'   )
#' }
#'
#' @export
BBUM_DEcorr = function(
  df.deseq,
  classCol,
  geneName = NULL,
  pBBUM.alpha = 0.05,
  excluded = c(),
  outliers = c(),
  add_starts = list(), only_start = FALSE,
  limits = list(),
  auto_outliers = TRUE, rthres = 1,
  rtrimmax = 0.05, atrimmax = 10,
  two_tailed = FALSE,
  quiet = FALSE
) {

  # Check, collect, and process data.frame input ----
  if(!quiet){print(">> Starting BBUM correction.")}
  df.res = tibble::as_tibble(df.deseq, rownames = NA, .name_repair = "minimal")

  # First handle whatever the name column is in the provided input, for usage convenience
  if(!is.null(geneName) && !(geneName %in% colnames(df.deseq))){
    stop("Provided geneName col not in input data.frame!")
    }
  restore.rownames = FALSE
  if(is.null(geneName)) {
    if("row" %in% colnames(df.res)){
      # The tidy=T case
      df.res = dplyr::mutate(df.res, geneName = as.character(row))
    } else {
      # Default
      restore.rownames = TRUE
      df.res = tibble::rownames_to_column(df.res, "geneName")
    }
  } else {
    # User-provided
    df.res = dplyr::mutate(df.res, geneName = as.character(!!dplyr::sym(geneName)))
  }

  # Now handle whatever the BBUM class column is in the provided input, for usage convenience
  if(!(classCol %in% colnames(df.res))){ stop("Provided class col not in input data.frame!") }
  df.res = dplyr::mutate(df.res, BBUM.class = !!dplyr::sym(classCol))

  # Process read cutoff and other filtering
  allGeneNames = as.character(dplyr::pull(df.res, geneName))
  meets.cutoff = allGeneNames[!allGeneNames %in% excluded]
  if(length(meets.cutoff) < 10){ stop("Number of genes not excluded is too small!") }

  # Process outlier filter
  outliers = outliers[outliers %in% meets.cutoff]
  outliersN = length(outliers)
  if(!quiet && outliersN > 0){
    print(paste0(">> Outliers manually masked (# = ", outliersN, "):"))
    print(as.character(outliers))
  }

  # Final polishing
  df.dat = df.res %>%
    dplyr::mutate(
      excluded = !(as.character(geneName) %in% meets.cutoff),
      outlier  =   as.character(geneName) %in% outliers
    )

  # Model BBUM on p values ----
  BBUM.out = df.dat %>%
    BBUM_apply(
      add_starts = add_starts,
      only_start = only_start,
      limits = limits,
      pBBUM.alpha = pBBUM.alpha,
      auto_outliers = auto_outliers,
      rthres = rthres,
      rtrimmax = rtrimmax,
      atrimmax = atrimmax,
      two_tailed = two_tailed,
      quiet = quiet
    )

  # Call hits ----
  df.bbum = BBUM.out$data %>%
    dplyr::mutate(
      BBUM.hits = !outlier &
        dplyr::if_else(is.na(pBBUM), FALSE, pBBUM < pBBUM.alpha) &  # pBBUM criterion
        (two_tailed | BBUM.class), # Call primary effect hits
      BBUM.fct = factor(
        dplyr::if_else(
          BBUM.hits,
          "hit",
          dplyr::if_else(outlier,
                         "outlier",
                         "none"
          )
        ), levels = c("none","hit","outlier"))  # summaritive factor for categories
    ) %>%
    dplyr::ungroup()

  # Benchmarking ----
  if(!quiet){
    # Print hits list
    hits = df.bbum %>%
      dplyr::filter(BBUM.hits) %>%
      dplyr::arrange(pBBUM) %>%
      dplyr::pull(geneName) %>%
      unname()
    c_hits = length(hits)
    print(paste0(">> Significant genes (# = ", c_hits, "):"))
    if(c_hits > 0){
      print(as.character(hits))
    }
  }

  # Restore how the input df was ----
  if(restore.rownames){
    df.bbum = df.bbum %>%
      dplyr::mutate(geneName.rowname = geneName) %>%
      tibble::column_to_rownames("geneName.rowname")
  }

  # Done ----
  return(df.bbum)
}