In kkrismer/idr2d: Irreproducible Discovery Rate for Genomic Interactions Data

IDR2D is an extension of the original method IDR [@li2011], which was intended for ChIP-seq peaks (or one-dimensional genomic data). This package applies the method to two-dimensional genomic data, such as interactions between two genomic loci (also called anchors). Genomic interaction data is generated by genome-wide methods such as Hi-C [@pmid20461051], ChIA-PET [@pmid19247990], and HiChIP [@pmid25128017].

knitr::opts_chunk$set(fig.width = 7, fig.height = 7, echo = FALSE,
                      warning = FALSE, message = FALSE, dev = 'png',
                      out.extra = 'style="border-width: 0;"')

Input data

Load example data:

rep1_df <- idr2d:::chiapet$rep1_df
rep2_df <- idr2d:::chiapet$rep2_df

Example data - replicate 1

library(DT)
header <- htmltools::withTags(table(
  class = 'display',
  thead(
    tr(
      th(colspan = 3, "anchor A"),
      th(colspan = 3, "anchor B"),
      th(rowspan = 2, "FDR")
    ),
    tr(
      lapply(rep(c("chr.", "start coordinate", "end coordinate"), 2), th)
    )
  )
))
datatable(rep1_df[seq_len(min(nrow(rep1_df), 1000)), ],
          container = header, rownames = FALSE,
          options = list(searching = FALSE)) %>%
  formatRound("fdr", 3)

Only the first 1000 interactions are shown.

Example data - replicate 2

datatable(rep2_df[seq_len(min(nrow(rep2_df), 1000)), ],
          container = header, rownames = FALSE,
          options = list(searching = FALSE)) %>%
  formatRound("fdr", 3)

Only the first 1000 interactions are shown.

Analysis

Load the package:

library(idr2d)

Estimate IDR:

idr_results <- estimate_idr2d(rep1_df, rep2_df,
                              value_transformation = "log_additive_inverse")
rep1_idr_df <- idr_results$rep1_df

Important to note here is that the appropriate value transformation depends on the semantics of the value column (always the seventh column) in rep1_df and rep2_df. This column is used to establish a ranking between interactions, with highly significant interactions on top of the list and least significant interactions (i.e., most likely noise) at the bottom of the list. The ranking is established by the value column, sorted in descending order. Since our value column contains FDRs (the lower, the more significant), we need to transform the values to comply with the assumption that high values indicate high significance. For p-values and p-value derived measures (like Q values), the log_additive_inverse transformation (-log(x)) is recommended.

Results

# avoid CRAN warnings
rank <- rep_rank <- value <- rep_value <- idr <- NULL

header <- htmltools::withTags(table(
  class = 'display',
  thead(
    tr(
      th("rank in R1"),
      th("rank in R2"),
      th("transformed value in R1"),
      th("transformed value in R2"),
      th("IDR")
    )
  )
))

df <- dplyr::select(rep1_idr_df, rank, rep_rank, value, rep_value, idr)
datatable(df[seq_len(min(nrow(df), 1000)), ], 
          rownames = FALSE,
          options = list(searching = FALSE),
          container = header) %>%
  formatRound(c("value", "rep_value", "idr"), 3)

Only the first 1000 observations are shown.

Summary

summary(idr_results)

Distribution of IDRs

draw_idr_distribution_histogram(rep1_idr_df)

Rank - IDR dependence

draw_rank_idr_scatterplot(rep1_idr_df)

Value - IDR dependence

draw_value_idr_scatterplot(rep1_idr_df)

Additional information

Most of the functionality of the IDR2D package is also offered through the website at https://idr2d.mit.edu.

For a more detailed discussion on IDR2D, please have a look at the IDR2D paper:

IDR2D identifies reproducible genomic interactions
Konstantin Krismer, Yuchun Guo, and David K. Gifford
Nucleic Acids Research, Volume 48, Issue 6, 06 April 2020, Page e31; DOI: https://doi.org/10.1093/nar/gkaa030