TopDom: An Efficient and Deterministic Method for Identifying Topological Domains in Genomes

Documented in TopDom

#' Identify Topological Domains from a Hi-C Contact Matrix
#' 
#' @param data A TopDomData object, or the pathname to a normalized
#' Hi-C contact matrix file as read by [readHiC()], that specify N bins.
#' 
#' @param window.size The number of bins to extend (as a non-negative integer).
#' Recommended range is in {5, ..., 20}.
#' 
#' @param outFile (optional) The filename without extension of the three
#' result files optionally produced. See details below.
#' 
#' @param statFilter (logical) Specifies whether non-significant
#' topological-domain boundaries should be dropped or not.
#' 
#' @param ... Additional arguments passed to [readHiC()].
#' 
#' @param debug If `TRUE`, debug output is produced.
#'
#' @return A named list of class `TopDom` with data.frame elements
#' `binSignal`, `domain`, and `bed`.
#' * The `binSignal` data frame (N-by-7) holds mean contact frequency,
#'   local extreme, and p-value for every bin. The first four columns
#'   represent basic bin information given by matrix file, such as
#'   bin id (`id`), chromosome(`chr`), start coordinate (`from.coord`),
#'   and end coordinate (`to.coord`) for each bin.
#'   The last three columns (`local.ext`, `mean.cf`, and `p-value`) represent
#'   computed values by the TopDom algorithm.
#'   The columns are:
#'   - `id`: Bin ID
#'   - `chr`: Chromosome
#'   - `from.coord`: Start coordinate of bin
#'   - `to.coord`: End coordinate of bin
#'   - `local.ext`:
#'      + `-1`: Local minima.
#'      + `-0.5`: Gap region.
#'      + `0`: General bin.
#'      + `1`: Local maxima.
#'   - `mean.cf`: Average of contact frequencies between lower and upper
#'     regions for bin _i = 1,2,...,N_.
#'   - `p-value`: Computed p-value by Wilcox rank sum test.
#'     See Shin et al. (2016) for more details.
#' 
#' * The `domain` data frame (D-by-7):
#'   Every bin is categorized by basic building block, such as gap, domain,
#'   or boundary.
#'   Each row indicates a basic building block.
#'   The first five columns include the basic information about the block,
#'   'tag' column indicates the class of the building block.
#'   - `id`: Identifier of block
#'   - `chr`: Chromosome
#'   - `from.id`: Start bin index of the block
#'   - `from.coord`: Start coordinate of the block
#'   - `to.id`: End bin index of the block
#'   - `to.coord`: End coordinate of the block
#'   - `tag`: Categorized name of the block. Three possible blocks exists:
#'     + `gap`
#'     + `domain`
#'     + `boundary`
#'   - `size`: size of the block
#'
#' * The `bed` data frame (D-by-4) is a representation of the `domain`
#'   data frame in the
#'   [BED file format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1).
#'   It has four columns:
#'   - `chrom`: The name of the chromosome.
#'   - `chromStart`: The starting position of the feature in the chromosome.
#'      The first base in a chromosome is numbered 0.
#'   - `chromEnd`: The ending position of the feature in the chromosome.
#'      The `chromEnd` base is _not_ included in the feature. For example,
#'      the first 100 bases of a chromosome are defined as `chromStart=0`,
#'      `chromEnd=100`, and span the bases numbered 0-99.
#'   - `name`: Defines the name of the BED line. This label is displayed to
#'      the left of the BED line in the
#'      [UCSC Genome Browser](https://genome.ucsc.edu/cgi-bin/hgGateway)
#'      window when the track is open to full display mode or directly to
#'      the left of the item in pack mode.
#' 
#' If argument `outFile` is non-`NULL`, then the three elements (`binSignal`,
#' `domain`, and `bed`) returned are also written to tab-delimited files
#' with file names \file{<outFile>.binSignal}, \file{<outFile>.domain}, and
#' \file{<outFile>.bed}, respectively.  None of the files have row names,
#' and all but the BED file have column names.
#'
#' @section Windows size:
#' The `window.size` parameter is by design the only tuning parameter in the
#' TopDom method and affects the amount of smoothing applied when calculating
#' the TopDom bin signals.  The binning window extends symmetrically downstream
#' and upstream from the bin such that the bin signal is the average
#' `window.size^2` contact frequencies.
#' For details, see Equation (1) and Figure 1 in Shin et al. (2016).
#' Typically, the number of identified TDs decreases while their average
#' lengths increase as this window-size parameter increases (Figure 2).
#' The default is `window.size = 5` (bins), which is motivated as:
#' "Considering the previously reported minimum TD size (approx. 200 kb)
#' (Dixon et al., 2012) and our bin size of 40 kb, _w_\[indow.size\] = 5 is a
#' reasonable setting" (Shin et al., 2016).
#'
#' @example incl/TopDom.R
#' 
#' @author Hanjun Shin, Harris Lazaris, and Gangqing Hu.
#' \R package, help, and code refactoring by Henrik Bengtsson.
#'
#' @references
#'
#' * Shin et al.,
#'   TopDom: an efficient and deterministic method for identifying
#'   topological domains in genomes,
#'   _Nucleic Acids Research_, 44(7): e70, April 2016.
#'   DOI: 10.1093/nar/gkv1505,
#'   PMCID: [PMC4838359](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4838359/),
#'   PMID: [26704975](https://pubmed.ncbi.nlm.nih.gov/26704975/)
#'
#' * Shin et al., \R script \file{TopDom_v0.0.2.R}, 2017 (originally from
#'   \code{http://zhoulab.usc.edu/TopDom/};
#'   later available on \url{https://github.com/jasminezhoulab/TopDom} via
#'   \url{https://zhoulab.dgsom.ucla.edu/pages/software})
#'
#' * Shin et al., TopDom Manual, 2016-07-08 (original from
#'   \code{http://zhoulab.usc.edu/TopDom/TopDom\%20Manual_v0.0.2.pdf};
#'   later available on \url{https://github.com/jasminezhoulab/TopDom} via
#'   \url{https://zhoulab.dgsom.ucla.edu/pages/software})
#'
#' * Hanjun Shin, Understanding the 3D genome organization in topological
#'   domain level, Doctor of Philosophy Dissertation,
#'   University of Southern California, March 2017,
#'   \url{https://digitallibrary.usc.edu/cdm/ref/collection/p15799coll40/id/347735}
#'
#' * Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in
#'   mammalian genomes identified by analysis of chromatin interactions.
#'   _Nature_; 485(7398):376-80, April 2012.
#'   DOI: 10.1038/nature11082,
#'   PMCID: [PMC3356448](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3356448/),
#'   PMID: 22495300.
#'
#' @importFrom utils read.table write.table
#' @export
TopDom <- function(data, window.size, outFile = NULL, statFilter = TRUE, ..., debug = getOption("TopDom.debug", FALSE)) {
  window.size <- as.integer(window.size)
  stopifnot(is.numeric(window.size),
            length(window.size) == 1,
            !is.na(window.size),
            window.size >= 0)
  stopifnot(is.logical(debug), length(debug) == 1L, !is.na(debug))

  if (is.character(data)) data <- readHiC(data, ...)
  stopifnot(inherits(data, "TopDomData"))

  bins <- data$bins
  matrix.data <- data$counts
  n_bins <- nrow(bins)

  mean.cf <- rep(0, times = n_bins)
  pvalue <- rep(1.0, times = n_bins)

  ## Gap region (== -0.5) by default
  local.ext <- rep(-0.5, times = n_bins)  ## gap region

  if (debug) {
    mcat("#########################################################################")
    mcat("Step 1 : Generating binSignals by computing bin-level contact frequencies")
    mcat("#########################################################################")
  }
  ptm <- proc.time()
  for (i in seq_len(n_bins)) {
    diamond <- Get.Diamond.Matrix(mat.data = matrix.data, i = i, size = window.size)
    mean.cf[i] <- mean(diamond)
  }
  stop_if_not(length(mean.cf) == n_bins)

  eltm <- proc.time() - ptm
  if (debug) {
    mcat("Step 1 Running Time : ", eltm[3])
    mcat("Step 1 : Done!")

    mcat("#########################################################################")
    mcat("Step 2 : Detect TD boundaries based on binSignals")
    mcat("#########################################################################")
  }

  ptm <- proc.time()
  # gap.idx <- Which.Gap.Region(matrix.data = matrix.data)
  # gap.idx <- Which.Gap.Region2(mean.cf)
  gap.idx <- Which.Gap.Region2(matrix.data = matrix.data, w = window.size)

  proc.regions <- Which.process.region(rmv.idx = gap.idx, n_bins = n_bins, min.size = 3L)
  stop_if_not(!anyNA(proc.regions[["start"]]), all(proc.regions[["start"]] >= 1), all(proc.regions[["start"]] <= n_bins))
  stop_if_not(!anyNA(proc.regions[["end"]]), all(proc.regions[["end"]] >= 1), all(proc.regions[["end"]] <= n_bins))

  # mcat(proc.regions)

  for (i in seq_len(nrow(proc.regions))) {
    start <- proc.regions[i, "start"]
    end   <- proc.regions[i,   "end"]
    if (debug) mcat("Process Region #", i, " from ", start, " to ", end)
    idxs <- start:end
    local.ext[idxs] <- Detect.Local.Extreme(x = mean.cf[idxs])  ## assigns values (-1,0,+1)
  }
  stop_if_not(!anyNA(local.ext), length(local.ext) == n_bins, all(local.ext %in% c(-0.5, -1, 0, +1)))
  rm(list = "idxs")

  eltm <- proc.time() - ptm
  if (debug) {
    mcat("Step 2 Running Time : ", eltm[3])
    mcat("Step 2 : Done!")
  }
  
  if (statFilter) {
    if (debug) {
      mcat("#########################################################################")
      mcat("Step 3 : Statistical Filtering of false positive TD boundaries")
      mcat("#########################################################################")
    }
    
    ptm <- proc.time()
    if (debug) mcat("-- Matrix Scaling....")
    scale.matrix.data <- matrix.data
    for (i in seq_len(2 * window.size)) {
      # diag(scale.matrix.data[, i:n_bins]) <- scale(diag(matrix.data[, i:n_bins]))
      idxs <- seq(from = 1 + (n_bins * i), to = n_bins * n_bins, by = 1 + n_bins)
      scale.matrix.data[idxs] <- scale(matrix.data[idxs])
    }
    rm(list = c("idxs", "matrix.data"))
    
    if (debug) mcat("-- Compute p-values by Wilcox Ranksum Test")
    for (i in seq_len(nrow(proc.regions))) {
      start <- proc.regions[i, "start"]
      end <- proc.regions[i, "end"]

      if (debug) mcat("Process Region #", i, " from ", start, " to ", end)

      idxs <- start:end
      pvalue[idxs] <- Get.Pvalue(matrix.data = scale.matrix.data[idxs, idxs], size = window.size, scale = 1.0)
    }
    rm(list = "idxs")
    stop_if_not(length(pvalue) == n_bins, !anyNA(pvalue))
    if (debug) mcat("-- Done!")

    if (debug) mcat("-- Filtering False Positives")
    ## NOTE: The below duplication is left on purpose until we fully
    ##       understand why it is there in the first place, cf.
    ##       https://github.com/HenrikBengtsson/TopDom/issues/3
#    local.ext[((local.ext == -1.0) | (local.ext == -1.0)) & (pvalue < 0.05)] <- -2.0
#    local.ext[local.ext == -1.0] <-  0.0  ## general bin
#    local.ext[local.ext == -2.0] <- -1.0  ## local minima
    ## "Finally, we filter out local minima with P-values larger than 0.05. [...]"
    ##  (Page 4 in Shin et al. 2016)
    local.ext[local.ext == -1.0 & pvalue >= 0.05] <- 0.0  ## drop non-significant local minima
    
    stop_if_not(!anyNA(local.ext), length(local.ext) == n_bins, all(local.ext %in% c(-0.5, -1, 0, +1)))
    
    if (debug) mcat("-- Done!")

    eltm <- proc.time() - ptm
    if (debug) mcat("Step 3 Running Time : ", eltm[3])
    if (debug) mcat("Step 3 : Done!")
  } else {
    rm(list = "matrix.data")
    pvalue <- 0
  }

  if (debug) {
    mcat("#########################################################################")
    mcat("Step 4 : Convert bins to domains (internal step)")
    mcat("#########################################################################")
  }
  
  domains <- Convert.Bin.To.Domain.TMP(
    bins = bins,
    signal.idx = which(local.ext == -1.0),  ## local minima
    gap.idx = which(local.ext == -0.5),     ## gap region
    pvalues = pvalue,
    pvalue.cut = 0.05
  )

  bins <- cbind(
    bins,
    local.ext = local.ext,
    mean.cf = mean.cf,
    pvalue = pvalue
  )

  bedform <- domains[, c("chr", "from.coord", "to.coord", "tag")]
  colnames(bedform) <- c("chrom", "chromStart", "chromEnd", "name")

  if (!is.null(outFile)) {
    if (debug) {
      mcat("#########################################################################")
      mcat("Writing Files")
      mcat("#########################################################################")
    }

    outBinSignal <- paste0(outFile, ".binSignal")
    if (debug) mcat("binSignal File : ", outBinSignal)
    write.table(bins, file = outBinSignal, quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t")

    outDomain <- paste0(outFile, ".domain")
    if (debug) mcat("Domain File : ", outDomain)
    write.table(domains, file = outDomain, quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t")

    outBed <- paste0(outFile, ".bed")
    if (debug) mcat("Bed File : ", outBed)
    write.table(bedform, file = outBed, quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "\t")
  }

  if (debug) mcat("Done!")

  if (debug) mcat("Job Complete!")
  
  res <- structure(
    list(binSignal = bins, domain = domains, bed = bedform),
    class = "TopDom"
  )
  attr(res, "parameters") <- list(
    window.size = window.size,
    statFilter = statFilter
  )
  
  res
}

# @fn Get.Diamond.Matrix
#
# @param mat.data N-by-N numeric matrix, where each element indicate contact frequency
#
# @param i (integer) a bin index
#
# @param size (integer) the window size to expand from bin `i`
#
# @return A subset of the `mat.data` matrix.  If `i` == N, then a missing value is returned.
Get.Diamond.Matrix <- function(mat.data, i, size) {
  n_bins <- nrow(mat.data)
  if (i == n_bins) {
    na <- NA_real_
    storage.mode(na) <- storage.mode(mat.data)
    return(na)
  }

  lowerbound <- max(1, i - size + 1)
  upperbound <- min(i + size, n_bins)

  mat.data[lowerbound:i, (i + 1):upperbound]
}

# @fn Which.process.region
# @param rmv.idx : vector of idx, remove index vector
# @param n_bins : total number of bins
# @param min.size : (integer) minimum size of bins
# @retrun : data.frame of proc.regions
Which.process.region <- function(rmv.idx, n_bins, min.size = 3L) {
  gap.idx <- rmv.idx

  proc.regions <- data.frame(start = numeric(0), end = numeric(0), stringsAsFactors = FALSE)
  proc.set <- setdiff(seq_len(n_bins), gap.idx)
  n_proc.set <- length(proc.set)

  i <- 1
  while (i < n_proc.set) {
    start <- proc.set[i]
    j <- i + 1

    while (j <= n_proc.set) {
      if (proc.set[j] - proc.set[j - 1] <= 1) {
        j <- j + 1
      } else {
        proc.regions <- rbind(proc.regions, c(start = start, end = proc.set[j - 1]))
        i <- j
        break
      }
    }

    if (j >= n_proc.set) {
      proc.regions <- rbind(proc.regions, c(start = start, end = proc.set[j - 1]))
      break
    }
  }

  colnames(proc.regions) <- c("start", "end")
  proc.regions <- proc.regions[abs(proc.regions[, "end"] - proc.regions[, "start"]) >= min.size, ]

  proc.regions
}

# @fn Which.Gap.Region
# @breif version 0.0.1 used
# @param matrix.data : n by n matrix
# @return gap index
Which.Gap.Region <- function(matrix.data) {
  n_bins <- nrow(matrix.data)
  gap <- rep(0, times = n_bins)

  i <- 1
  while (i < n_bins) {
    j <- i + 1
    while (j <= n_bins) {
      idxs <- i:j
      if (sum(matrix.data[idxs, idxs]) == 0) {
        gap[idxs] <- -0.5
        j <- j + 1
        # if (j-i > 1) gap[idxs] <- -0.5
        # j <- j+1
      } else {
        break
      }
    }

    i <- j
  }

  idx <- which(gap == -0.5)
  idx
}

# @fn Which.Gap.Region3
# @param matrix.data : n by n matrix
# @return gap index
Which.Gap.Region3 <- function(mean.cf) {
  n_bins <- length(mean.cf)
  gapidx <- which(mean.cf == 0)

  gapidx
}

# @fn Which.Gap.Region2
# @breif version 0.0.2 used
# @param matrix.data : n by n matrix
# @return gap index
Which.Gap.Region2 <- function(matrix.data, w) {
  n_bins <- nrow(matrix.data)
  gap <- rep(0, times = n_bins)

  for (i in seq_len(n_bins)) {
    if (sum(matrix.data[i, max(1, i - w):min(i + w, n_bins)]) == 0) gap[i] <- -0.5
  }

  idx <- which(gap == -0.5)
  idx
}

# @fn Detect.Local.Extreme
# @param x : original signal to find local minima
# @return vector of local extremes, -1 if the index is local minimum, +1 if the index is local maxima, 0 otherwise.
Detect.Local.Extreme <- function(x) {
  n_bins <- length(x)
  ret <- rep(0, times = n_bins)  ## general bin (default)
  x[is.na(x)] <- 0               ## general bin

  if (n_bins <= 3) {
    ret[which.min(x)] <- -1      ## local minima
    ret[which.max(x)] <- +1      ## local maxima
    return(ret)
  }
  
  # Norm##################################################3
  new.point <- Data.Norm(x = seq_len(n_bins), y = x)
  x <- new.point$y
  ##################################################
  cp <- Change.Point(x = seq_len(n_bins), y = x)
  cp <- cp$cp

  ncp <- length(cp)
  if (ncp <= 2) return(ret)
  if (ncp == n_bins) return(ret)
  
  for (i in 2:(ncp-1)) {
    cp_b <- cp[i]
    cp_c <- cp[i-1]

    x_a <- x[cp_b-1]
    x_b <- x[cp_b]
    x_c <- x[cp_b+1]
    
    if (x_b >= x_a && x_b >= x_c) {
      ret[cp_b] <- +1  ## local maxima
    } else if (x_b < x_a && x_b < x_c) {
      ret[cp_b] <- -1  ## local minima
    }

    min.val <- min(x[cp_c], x_b)
    max.val <- max(x[cp_c], x_b)

    x_r <- x[cp_c:cp_b]
    if (min(x_r) < min.val) ret[cp_c-1 + which.min(x_r)] <- -1  ## local minima
    if (max(x_r) > max.val) ret[cp_c-1 + which.max(x_r)] <- +1  ## local maxima
  }

  ret
}

# @fn Data.Norm
# @param x : x axis vector
# @param x : y axis vector
# @return list of normalized x and y
Data.Norm <- function(x, y) {
  ret.x <- rep(0, times = length(x))
  ret.y <- rep(0, times = length(y))

  ret.x[1] <- x[1]
  ret.y[1] <- y[1]

  diff.x <- diff(x)
  diff.y <- diff(y)

  scale.x <- 1 / mean(abs(diff(x)))
  scale.y <- 1 / mean(abs(diff(y)))

  # mcat(scale.x)
  # mcat(scale.y)

  diff.x <- diff.x * scale.x
  diff.y <- diff.y * scale.y
  
  for (i in 2:length(x)) {
    ret.x[i] <- ret.x[i - 1] + diff.x[i - 1]
    ret.y[i] <- ret.y[i - 1] + diff.y[i - 1]
  }

  list(x = ret.x, y = ret.y)
}

# @fn Change.Point
# @param x : x axis vector
# @param x : y axis vector
# @return change point index in x vector,
# Note that the first and the last point will be always change point
Change.Point <- function(x, y) {
  if (length(x) != length(y)) {
    mcat("ERROR : The length of x and y should be the same")
    return(0)
  }

  n_bins <- length(x)
  Fv <- rep(NA_real_, times = n_bins)
  Ev <- rep(NA_real_, times = n_bins)
  cp <- 1

  i <- 1
  Fv[1] <- 0
  while (i < n_bins) {
    j <- i + 1
    Fv[j] <- sqrt((x[j] - x[i]) ^ 2 + (y[j] - y[i]) ^ 2)

    while (j < n_bins) {
      j <- j + 1
      k <- (i + 1):(j - 1)
      dx_ji <- x[j] - x[i]
      dy_ji <- y[j] - y[i]
      A <- sqrt(dx_ji^2 + dy_ji^2)
      Ev_j <- sum(abs(dy_ji*x[k] - dx_ji*y[k] - x[i]*y[j] + x[j]*y[i])) / A
      Ev[j] <-     Ev_j
      Fv[j] <- A - Ev_j

      #################################################
      # Not Original Code
      if (is.na(Fv[j]) || is.na(Fv[j - 1])) {
        j <- j - 1
        cp <- c(cp, j)
        break
      }
      #################################################### 3
      if (Fv[j] < Fv[j - 1]) {
        j <- j - 1
        cp <- c(cp, j)
        break
      }
    }
    i <- j
  }

  cp <- c(cp, n_bins)

  list(cp = cp, objF = Fv, errF = Ev)
}

# @fn Get.Pvalue
# @param matrix.data : matrix
# @param size : size to extend
# @param scale : scale parameter if necessary. deprecated parameter
# @return computed p-value vector
#' @importFrom stats wilcox.test
Get.Pvalue <- function(matrix.data, size, scale = 1.0) {
  n_bins <- nrow(matrix.data)
  pvalue <- rep(1, times = n_bins)

  for (i in seq_len(n_bins - 1)) {
    dia <- as.vector(Get.Diamond.Matrix2(matrix.data, i = i, size = size))
    ups <- as.vector(Get.Upstream.Triangle(matrix.data, i = i, size = size))
    downs <- as.vector(Get.Downstream.Triangle(matrix.data, i = i, size = size))

    wil.test <- wilcox.test(x = dia * scale, y = c(ups, downs), alternative = "less", exact = FALSE)
    pvalue[i] <- wil.test$p.value

    # mcat(i, " = ", wil.test$p.value)
  }

  pvalue[is.na(pvalue)] <- 1
  pvalue
}

# @fn Get.Upstream.Triangle
# @param mat.data : matrix data
# @param i : bin index
# @param size : size of window to extend
# @return upstream triangle matrix
Get.Upstream.Triangle <- function(mat.data, i, size) {
  n_bins <- nrow(mat.data)

  lower <- max(1, i - size)
  idxs <- lower:i
  tmp.mat <- mat.data[idxs, idxs]
  tmp.mat[upper.tri(tmp.mat, diag = FALSE)]
}

# @fn Get.Downstream.Triangle
# @param mat.data : matrix data
# @param i : bin index
# @param size : size of window to extend
# @return downstream triangle matrix
Get.Downstream.Triangle <- function(mat.data, i, size) {
  n_bins <- nrow(mat.data)
  if (i == n_bins) {
    na <- NA_real_
    storage.mode(na) <- storage.mode(mat.data)
    return(na)
  }

  upperbound <- min(i + size, n_bins)
  idxs <- (i + 1):upperbound
  tmp.mat <- mat.data[idxs, idxs]
  tmp.mat[upper.tri(tmp.mat, diag = FALSE)]
}

# @fn Get.Diamond.Matrix2
# @param mat.data : matrix data
# @param i : bin index
# @param size : size of window to extend
# @return diamond matrix
Get.Diamond.Matrix2 <- function(mat.data, i, size) {
  n_bins <- nrow(mat.data)
  na <- NA_real_
  storage.mode(na) <- storage.mode(mat.data)
  new.mat <- matrix(rep(na, times = size * size), nrow = size, ncol = size)

  for (k in seq_len(size)) {
    if (i - (k - 1) >= 1 && i < n_bins) {
      lower <- min(i + 1, n_bins)
      upper <- min(i + size, n_bins)

      new.mat[size - (k - 1), seq_len(upper - lower + 1)] <- mat.data[i - (k - 1), lower:upper]
    }
  }

  new.mat
}

# @fn Convert.Bin.To.Domain
# @param bins : bin information
# @param signal.idx : signal index
# @param signal.idx : gap index
# @param pvalues : pvalue vector
# @param pvalue.cut : pvalue threshold
# @return dataframe storing domain information
Convert.Bin.To.Domain <- function(bins, signal.idx, gap.idx, pvalues = NULL, pvalue.cut = NULL) {
  n_bins <- nrow(bins)
  ret <- data.frame(chr = character(0), from.id = numeric(0), from.coord = numeric(0), to.id = numeric(0), to.coord = numeric(0), tag = character(0), size = numeric(0), stringsAsFactors = FALSE)
  levels(x = ret[, "tag"]) <- c("domain", "gap", "boundary")

  rmv.idx <- setdiff(seq_len(n_bins), gap.idx)
  proc.region <- Which.process.region(rmv.idx, n_bins = n_bins, min.size = 0L)
  from.coord <- bins[proc.region[, "start"], "from.coord"]
  n_procs <- nrow(proc.region)
  zeros <- double(length = n_procs)
  gap <- data.frame(
    chr = rep(bins[1, "chr"], times = n_procs),
    from.id = zeros,
    from.coord = from.coord,
    to.id = zeros,
    to.coord = zeros,
    tag = rep("gap", times = n_procs),
    size = zeros,
    stringsAsFactors = FALSE
  )

  rmv.idx <- union(signal.idx, gap.idx)
  proc.region <- Which.process.region(rmv.idx, n_bins = n_bins, min.size = 0L)
  n_procs <- nrow(proc.region)
  from.coord <- bins[proc.region[, "start"], "from.coord"]
  zeros <- double(length = n_procs)
  domain <- data.frame(
    chr = rep(bins[1, "chr"], times = n_procs),
    from.id = zeros,
    from.coord = from.coord,
    to.id = zeros,
    to.coord = zeros,
    tag = rep("domain", times = n_procs),
    size = zeros,
    stringsAsFactors = FALSE
  )

  rmv.idx <- setdiff(seq_len(n_bins), signal.idx)
  proc.region <- as.data.frame(Which.process.region(rmv.idx, n_bins = n_bins, min.size = 1L))
  n_procs <- nrow(proc.region)
  if (n_procs > 0) {
    from.coord <- bins[proc.region[, "start"] + 1, "from.coord"]
    zeros <- double(length = n_procs)
    boundary <- data.frame(
      chr = rep(bins[1, "chr"], times = n_procs),
      from.id = zeros,
      from.coord = from.coord,
      to.id = zeros,
      to.coord = zeros,
      tag = rep("boundary", times = n_procs),
      size = zeros,
      stringsAsFactors = FALSE
    )
    ret <- rbind(ret, boundary)
  }

  ret <- rbind(gap, domain)
  ret <- ret[order(ret[, 3]), ]

  ret[, "to.coord"] <- c(ret[2:nrow(ret), "from.coord"], bins[n_bins, "to.coord"])
  ret[, "from.id"] <- match(ret[, "from.coord"], table = bins[, "from.coord"])
  ret[, "to.id"] <- match(ret[, "to.coord"], table = bins[, "to.coord"])
  ret[, "size"] <- ret[, "to.coord"] - ret[, "from.coord"]

  if (!is.null(pvalues) && !is.null(pvalue.cut)) {
    for (i in seq_len(nrow(ret))) {
      if (ret[i, "tag"] == "domain") {
        domain.bins.idx <- ret[i, "from.id"]:ret[i, "to.id"]
        p.value.constr <- which(pvalues[domain.bins.idx] < pvalue.cut)

        if (length(domain.bins.idx) == length(p.value.constr)) ret[i, "tag"] <- "boundary"
      }
    }
  }

  new.bdr.set <- stack.bdr <- stack.bdr.empty <- data.frame(
    chr = character(0),
    from.id = numeric(0),
    from.coord = numeric(0),
    to.id = numeric(0),
    to.coord = numeric(0),
    tag = character(0),
    size = numeric(0),
    stringsAsFactors = FALSE
  )

  i <- 1L
  while (i <= nrow(ret)) {
    if (ret[i, "tag"] == "boundary") {
      stack.bdr <- rbind(stack.bdr, ret[i, ])
    } else if (nrow(stack.bdr) > 0) {
      new.bdr <- data.frame(
        chr        = bins[1, "chr"],
        from.id    = min(stack.bdr[, "from.id"]),
        from.coord = min(stack.bdr[, "from.coord"]),
        to.id      = max(stack.bdr[, "to.id"]),
        to.coord   = max(stack.bdr[, "to.coord"]),
        tag        = "boundary",
        size       = max(stack.bdr[, "to.coord"]) - min(stack.bdr[, "from.coord"]),
        stringsAsFactors = FALSE
      )
      new.bdr.set <- rbind(new.bdr.set, new.bdr)
      stack.bdr <- stack.bdr.empty
    }

    i <- i + 1L
  }
  rm(list = c("stack.bdr", "stack.bdr.empty"))
  

  ret <- rbind(ret[ret[, "tag"] != "boundary", ], new.bdr.set)
  ret <- ret[order(ret[, "to.coord"]), ]

  ret
}


# @fn Convert.Bin.To.Domain
# @param bins : bin information
# @param signal.idx : signal index
# @param signal.idx : gap index
# @param pvalues : pvalue vector
# @param pvalue.cut : pvalue threshold
# @return dataframe storing domain information
Convert.Bin.To.Domain.TMP <- function(bins, signal.idx, gap.idx, pvalues = NULL, pvalue.cut = NULL) {
  n_bins <- nrow(bins)
  ret <- data.frame(chr = character(0), from.id = numeric(0), from.coord = numeric(0), to.id = numeric(0), to.coord = numeric(0), tag = character(0), size = numeric(0), stringsAsFactors = FALSE)
  levels(x = ret[, "tag"]) <- c("domain", "gap", "boundary")

  rmv.idx <- setdiff(seq_len(n_bins), gap.idx)
  proc.region <- Which.process.region(rmv.idx, n_bins = n_bins, min.size = 0L)
  from.coord <- bins[proc.region[, "start"], "from.coord"]
  n_procs <- nrow(proc.region)
  zeros <- double(length = n_procs)
  gap <- data.frame(chr = rep(bins[1, "chr"], times = n_procs), from.id = zeros, from.coord = from.coord, to.id = zeros, to.coord = zeros, tag = rep("gap", times = n_procs), size = zeros, stringsAsFactors = FALSE)

  rmv.idx <- union(signal.idx, gap.idx)
  proc.region <- Which.process.region(rmv.idx, n_bins = n_bins, min.size = 0L)
  n_procs <- nrow(proc.region)
  from.coord <- bins[proc.region[, "start"], "from.coord"]
  zeros <- double(length = n_procs)
  domain <- data.frame(chr = rep(bins[1, "chr"], times = n_procs), from.id = zeros, from.coord = from.coord, to.id = zeros, to.coord = zeros, tag = rep("domain", times = n_procs), size = zeros, stringsAsFactors = FALSE)

  rmv.idx <- setdiff(seq_len(n_bins), signal.idx)
  proc.region <- as.data.frame(Which.process.region(rmv.idx, n_bins = n_bins, min.size = 1L))
  n_procs <- nrow(proc.region)
  if (n_procs > 0) {
    from.coord <- bins[proc.region[, "start"] + 1, "from.coord"]
    zeros <- double(length = n_procs)
    boundary <- data.frame(chr = rep(bins[1, "chr"], times = n_procs), from.id = zeros, from.coord = from.coord, to.id = zeros, to.coord = zeros, tag = rep("boundary", times = n_procs), size = zeros, stringsAsFactors = FALSE)
    ret <- rbind(ret, boundary)
  }

  if (nrow(domain) == 0L) {
    ret <- gap
  } else {
    ret <- rbind(gap, domain)
    ret <- ret[order(ret[, 3]), ]

    ## FIXME: Below code assumes nrow(ret) >= 2
    ret[, "to.coord"] <- c(ret[2:nrow(ret), "from.coord"], bins[n_bins, "to.coord"])
    ret[, "from.id"] <- match(ret[, "from.coord"], table = bins[, "from.coord"])
    ret[, "to.id"] <- match(ret[, "to.coord"], table = bins[, "to.coord"])
    ret[, "size"] <- ret[, "to.coord"] - ret[, "from.coord"]

    if (!is.null(pvalues) && !is.null(pvalue.cut)) {
      for (i in seq_len(nrow(ret))) {
        if (ret[i, "tag"] == "domain") {
          domain.bins.idx <- ret[i, "from.id"]:ret[i, "to.id"]
          p.value.constr <- which(pvalues[domain.bins.idx] < pvalue.cut)
  
          if (length(domain.bins.idx) == length(p.value.constr)) ret[i, "tag"] <- "boundary"
        }
      }
    }
  }

  ret
}



#' @importFrom utils str
#' @export
print.TopDom <- function(x, ...) {
  cat(sprintf("%s:\n", class(x)))
  cat("Parameters:\n")
  params <- attr(x, "parameters")
  if (length(params) > 0L) {
    cat(sprintf("- window.size: %d\n", params$window.size))
    cat(sprintf("- statFilter: %s\n", params$statFilter))
  } else {
    cat(" - N/A\n")
  }
  cat("binSignal:\n")
  str(x$binSignal)
  cat("domain:\n")
  str(x$domain)
  cat("bed:\n")
  str(x$bed)
}

#' @export
dim.TopDom <- function(x) {
  dim(x$domain)
}

#' @export
`[.TopDom` <- function(x, i, ...) {
  structure(list(
    binSignal = x$binSignal,
    domain    = x$domain[i, , drop = FALSE],
    bed       = x$bed[i, , drop = FALSE]
  ), class = "TopDom")
}
HenrikBengtsson/TopDom documentation built on April 9, 2023, 2:11 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
HenrikBengtsson/TopDom
An Efficient and Deterministic Method for Identifying Topological Domains in Genomes

R/TopDom.R
In HenrikBengtsson/TopDom: An Efficient and Deterministic Method for Identifying Topological Domains in Genomes

Documented in TopDom

R Package Documentation

Browse R Packages

We want your feedback!

HenrikBengtsson/TopDom An Efficient and Deterministic Method for Identifying Topological Domains in Genomes

R/TopDom.R In HenrikBengtsson/TopDom: An Efficient and Deterministic Method for Identifying Topological Domains in Genomes

Documented in TopDom

R Package Documentation

Browse R Packages

We want your feedback!

HenrikBengtsson/TopDom
An Efficient and Deterministic Method for Identifying Topological Domains in Genomes

R/TopDom.R
In HenrikBengtsson/TopDom: An Efficient and Deterministic Method for Identifying Topological Domains in Genomes