R/dmrseq.R
In dmrseq: Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing

Documented in dmrseq

#' Main function for detecting and evaluating significance of DMRs.
#' 
#' Performs a two-step approach that (1) detects candidate regions, and
#' (2) scores candidate regions with an exchangeable (across the genome)
#' statistic and evaluates statistical significance using a 
#' permuation test on the pooled null distribution of scores.
#' 
#' @param bs bsseq object containing the methylation values as well as the 
#'   phenotype matrix that contains sample level covariates
#' @param testCovariate Character value indicating which variable
#'  (column name) in \code{pData(bs)} to test
#'  for association of methylation levels. 
#'  Can alternatively specify an integer value indicating
#'  which of column of
#'  \code{pData(bs)} to use. This is used to construct the 
#'  design matrix for the test statistic calculation. To run using a 
#'  continuous or categorial covariate with more than two groups, simply pass in
#'  the name of a column in `pData` that contains this covariate. A continuous
#'  covariate is assmued if the data type in the `testCovariate` slot is 
#'  continuous, with the exception of if there are only two unique values 
#'  (then a two group comparison is carried out).
#' @param adjustCovariate an (optional) character value or vector 
#' indicating which variables (column names) in \code{pData(bs)} 
#' will be adjusted for when 
#'  testing for the association of methylation value with the 
#'  \code{testCovariate}. 
#' Can alternatively specify an
#' integer value or vector indicating
#'  which of the columns of \code{pData(bs)} to adjust for.
#'  If not NULL (default), then this is also used to 
#'  construct the design matrix for the test statistic calculation.
#' @param matchCovariate An (optional) character value 
#'  indicating which variable (column name) of \code{pData(bs)} 
#'  will be blocked for when 
#'  constructing the permutations in order to
#'  test for the association of methylation value with the 
#'  \code{testCovariate}, only to be used when \code{testCovariate}
#'  is a two-group factor and the number of permutations possible is less
#'  than 500000.
#'  Alternatively, you can specify an integer value indicating
#'  which column of \code{pData(bs)} to block for.
#'  Blocking means that only permutations with balanced
#'  composition of \code{testCovariate} values will be used (for example if
#'  you have samples from different gender and this is not your covariate of
#'  interest, 
#'  it is recommended to use gender as a matching covariate to avoid one 
#'  of the permutations testing entirely males versus females; this violates
#'  the null hypothesis and will decrease power).
#'  If not NULL (default), then no blocking is performed.
#' @param minInSpan positive integer that represents the minimum number of
#'    CpGs in a smoothing span window if \code{smooth} is TRUE.  
#'    Default value is 30.
#' @param minNumRegion positive integer that represents the minimum number of
#'    CpGs to consider for a candidate region. Default value is 5. 
#'    Minimum value is 3.
#' @param cutoff scalar value that represents the absolute value (or a vector 
#'    of two numbers representing a lower and upper bound) for the cutoff of 
#'    the single CpG coefficient that is used to discover 
#'    candidate regions. Default value is 0.10.
#' @param smooth logical value that indicates whether or not to smooth the 
#'    CpG level signal when discovering candidate regions.
#'    Defaults to TRUE.
#' @param bpSpan a positive integer that represents the length in basepairs
#'    of the smoothing span window if \code{smooth} is TRUE.  Default value is 
#'    1000.
#' @param verbose logical value that indicates whether progress messages
#'    should be printed to stdout. Defaults value is TRUE.
#' @param BPPARAM a \code{BiocParallelParam} object to specify the parallel 
#'    backend. The default 
#'    option is \code{BiocParallel::bpparam()} which will automatically creates
#'    a cluster appropriate for the operating system. 
#' @param maxPerms a positive integer that represents the maximum number 
#'    of permutations that will be used to generate the global null 
#'    distribution of test statistics.  Default value is 10.
#' @param maxGap integer value representing maximum number of basepairs in 
#' between neighboring CpGs to be included in the same DMR. 
#' @param maxGapSmooth integer value representing maximum number of basepairs  
#' in between neighboring CpGs to be included in the same 
#' cluster when performing smoothing (should generally be larger than
#' \code{maxGap})
#' @param stat a character vector indicating the name of the column of the 
#'   output to use as the region-level test statistic. Default value is 'stat'
#'   which is the region level-statistic designed to be comparable across the
#'   genome.
#'   It is not recommended to change this argument, but it can be done for
#'   experimental purposes. Possible values are: 'L' - the number of loci
#'   in the region, 'area' - the sum of the smoothed loci statistics,
#'   'beta' - the effect size of the region, 'stat' - the test statistic for
#'   the region, or 'avg' - the average smoothed loci statistic.
#' @param block logical indicating whether to search for large-scale (low
#'  resolution) blocks of differential methylation (default is FALSE, which
#'  means that local DMRs are desired). If TRUE, the parameters for 
#'  \code{bpSpan}, \code{minInSpan}, and \code{maxGapSmooth} should be adjusted
#'  (increased) accordingly. This setting will also merge
#'  candidate regions that (1) are in the same direction and (2) are less than 
#'  1kb apart with no covered CpGs separating them. The region-level model used 
#'  is also slightly modified - instead of a loci-specific intercept for each 
#'  CpG in theregion, the intercept term is modeled as a natural spline with 
#'  one interior knot per each 10kb of length (up to 10 interior knots).
#' @param blockSize numeric value indicating the minimum number of basepairs 
#'  to be considered a block (only used if \code{block}=TRUE). Default is 
#'  5000 basepairs.
#' @param chrsPerChunk a positive integer value indicating the number of 
#'  chromosomes per chunk. The default is 1, meaning that the data will be 
#'  looped through one chromosome at a time. When pairing up multiple 
#'  chromosomes per chunk, sizes (in terms of numbers of CpGs) will be taken
#'  into consideration to balance the sizes of each chunk.
#' @return a \code{GRanges} object that contains the results of the inference. 
#'    The object contains one row for each candidate region, sorted by q-value
#'    and then chromosome. The standard 
#'    \code{GRanges} chr, start, and end are included, along with at least
#'    7 metadata
#'    columns, in the following order: 
#'    1. L = the number of CpGs contained in the region,
#'    2. area = the sum of the smoothed beta values
#'    3. beta = the coefficient value for the condition difference (there 
#'       will be more than one column here if a multi-group comparison
#'       was performed),
#'    4. stat = the test statistic for the condition difference,
#'    5. pval = the permutation p-value for the significance of the test
#'    statistic, and 
#'    6. qval = the q-value for the test statistic (adjustment
#'    for multiple comparisons to control false discovery rate).
#'    7. index = an \code{IRanges} containing the indices of the region's 
#'       first CpG to last CpG.
#'
#' @keywords inference
#' @importFrom outliers grubbs.test
#' @importFrom bumphunter clusterMaker getSegments
#' @importFrom DelayedMatrixStats colMedians rowMads rowSums2 rowMeans2 rowDiffs
#' @importFrom matrixStats rowRanges
#' @importFrom stats formula anova as.formula
#' 
#' @importClassesFrom bsseq BSseq 
#' @importMethodsFrom bsseq pData seqnames sampleNames start width 
#' 
#' @importFrom grDevices col2rgb colorRampPalette dev.off pdf rgb
#' @importFrom graphics axis layout legend lines mtext par 
#' plot points rect rug text
#' @importFrom methods is
#' @importFrom stats approxfun lm loess median model.matrix p.adjust
#' predict preplot qt quantile rbeta rbinom runif
#' @importFrom utils combn
#' @importFrom BiocParallel bplapply register MulticoreParam bpparam
#' @importFrom splines ns
#'
#' @import bsseq 
#' @import GenomicRanges
#' @import nlme
#' @import annotatr
#' @import ggplot2
#' @import S4Vectors
#' 
#' @export
#' 
#' @examples
#' 
#' # load example data 
#' data(BS.chr21)
#' 
#' # the covariate of interest is the 'CellType' column of pData(BS.chr21)
#' testCovariate <- 'CellType'
#' 
#' # run dmrseq on a subset of the chromosome (10K CpGs)
#' regions <- dmrseq(bs=BS.chr21[240001:250000,],
#'                  cutoff = 0.05,
#'                  testCovariate=testCovariate)
#' 
dmrseq <- function(bs, testCovariate, adjustCovariate = NULL, cutoff = 0.1, 
                   minNumRegion = 5, smooth = TRUE, bpSpan = 1000, 
                   minInSpan = 30, maxGapSmooth = 2500, maxGap = 1000, 
                   verbose = TRUE,  
                   maxPerms = 10, matchCovariate = NULL, 
                   BPPARAM = bpparam(), stat = "stat", 
                   block = FALSE, blockSize = 5000,
                   chrsPerChunk = 1) {
    
    stopifnot(is(bs, "BSseq"))
    
    if (!(is.null(cutoff) || length(cutoff) %in% seq_len(2))) 
        stop("'cutoff' has to be either NULL or a vector of length 1 or 2")
    if (length(cutoff) == 2) 
        cutoff <- sort(cutoff)
    if (is.null(cutoff) | abs(cutoff) > 1 | abs(cutoff) == 0) 
        stop("Must specify a value for cutoff between 0 and 1")
    subverbose <- max(as.integer(verbose) - 1L, 0)
    
    if(minNumRegion < 3){
      stop("minNumRegion must be at least 3")
    }
    
    # check statistic name
    if (!(stat %in% c("L", "area", "beta", "stat", "avg"))) {
        stop("Specified '", stat, 
            "' as the test statistic which is not ", 
            "in the results. Please specify a valid name from one of ",
            "L, area, beta, stat, or avg")
    }
    
    # informative message about blocks if block=TRUE; check for increased
    # smoothing window
    if (block){
      message("Searching for large scale blocks with at least ",
              blockSize, " basepairs.")
      
      if(minInSpan < 100 && bpSpan < 2000 && maxGapSmooth < 1e5){
        warning("When block=TRUE, it is recommended to increase the values ",
                "of minInSpan, bpSpan, and maxGapSmooth in order to widen ",
                "the smoothing window")
      }
    }
    
    # convert covariates to column numbers if characters
    if (is.character(testCovariate)) {
        if(length(testCovariate) > 1)
          stop("Only one testCovariate can be specified")
        if(is.character(adjustCovariate)){
          if(sum(testCovariate %in% adjustCovariate) > 0)
            stop("adjustCovariate can't contain testCovariate")
        }
        if(is.character(matchCovariate)){
          if(sum(testCovariate %in% matchCovariate))
            stop("matchCovariate can't contain testCovariate")
        }
        testCovariate <- which(colnames(pData(bs)) == testCovariate)
        if (length(testCovariate) == 0) {
            stop("testCovariate not found in pData(). ",
                 "Please specify a valid testCovariate")
        }
    }
    
    if (is.character(adjustCovariate)) {
        if(is.character(matchCovariate)){
          if(matchCovariate == adjustCovariate)
            stop("matchCovariate can't be identical to adjustCovariate")
        }
        adjustCovariate <- which(colnames(pData(bs)) %in% adjustCovariate)
        if (length(adjustCovariate) == 0) {
            stop("adjustCovariate not found in pData(). ",
                "Please specify a valid adjustCovariate")
        }
    }
    
    # check that chrsPerChunk value makes sense
    if (chrsPerChunk != 1){
      if (chrsPerChunk%%1 != 0){
        stop("chrsPerChunk must be an integer")
      }else if(chrsPerChunk < 1){
        stop("chrsPerChunk must be strictly positive")
      }else if(chrsPerChunk > length(unique(seqnames(bs)))){
        stop("chrsPerChunk can't be larger than the total",
             " number of chromosomes")
      }
    }
    
    # check that bs object is sorted since `bsseq::BSseq()` no longer 
    # automatically sorts to ensure loci from same chr are indexed consecutively
    if (is.unsorted(bs)) {
      stop("'bs' must be sorted. Use 'sort(bs)'.")
    }
    
    # construct the design matrix using the pData of bs
    if (ncol(pData(bs)) < max(testCovariate, adjustCovariate)) {
        stop("Error: pData(bs) has too few columns.  ","
              Please specify valid ", 
            "covariates to use in the analysis")
    }
    
    coeff <- seq(2,(2 + length(testCovariate) - 1))
    testCov <- pData(bs)[, testCovariate]
    if (is.factor(testCov)) # drop unused levels of test
      testCov <- droplevels(testCov)
    fact <- TRUE
    sampleSize <- table(testCov)[names(table(testCov)) %in% pData(bs)[,testCovariate]]
    if (length(unique(testCov)) == 1) {
        message("Warning: only one unique value of the specified ", 
                "covariate of interest.  Assuming null comparison and ",
                "splitting sample group into two equal groups")
        testCov <- rep(1, length(testCov))
        testCov[seq_len(round(length(testCov)/2))] <- 0
    }else if (length(unique(testCov)) > 2 && !is.numeric(testCov)) {
        message("Performing a global test of H0: no difference among ",
                length(unique(testCov)), " groups (assuming the test ",
                "covariate ", colnames(pData(bs))[testCovariate],
                " is a factor).")
        coeff <- seq(coeff, coeff + length(unique(testCov)) - 2)
    }else if (length(unique(testCov)) > 2 && is.numeric(testCov)) {
        message("Assuming the test ",
              "covariate ", colnames(pData(bs))[testCovariate],
              " is continuous.")
        fact <- FALSE
    }else{
        message("Assuming the test ",
              "covariate ", colnames(pData(bs))[testCovariate],
              " is a factor.")
        if(min(sampleSize) < 2)
          stop("At least one group has only one sample! ",
               "Replicates are required to run dmrseq.")
        testCov <- as.factor(testCov)
    }
    
    if (!is.null(adjustCovariate)) {
        mmdat <- data.frame(testCov = testCov)
        adjustCov <- pData(bs)[, adjustCovariate, drop = FALSE]
        
        # check for number of unique values per adjust cov
        nunq <- apply(adjustCov, 2, function(x) length(unique(x)))
        if (any(nunq < 2))
          stop("At least one adjust covariate is constant across samples.",
               " Please remove this covariate from the model and try again.")
        
        # remove any empty factor levels
        for (f in 1:ncol(adjustCov)){
          if (is.factor(adjustCov[,f]))
            adjustCov[,f] <- droplevels(adjustCov[,f])
        }
        
        mmdat <- cbind(mmdat, adjustCov)
        frm <- paste0("~", paste0(colnames(mmdat), collapse = " + "))
        design <- model.matrix(as.formula(frm), data=mmdat)
        colnames(design)[coeff] <- colnames(pData(bs))[testCovariate]
        coeff.adj <- (max(coeff) + 1):(ncol(design))
    } else {
        design <- model.matrix(~testCov)
        colnames(design)[coeff] <- colnames(pData(bs))[testCovariate]
        coeff.adj <- NULL
    }
    
    # check for empty factor levels in design matrix
    if (sum(colSums(design) == 0) > 0){
      which.empty <- which(colSums(design) == 0)
      design <- design[,-which.empty]
    }
    
    # check that p <= n
    if (ncol(bs) < ncol(design) + 1)
      stop("Not enough degrees of freedom to estimate ", ncol(design)-1,
           " covariates using ", ncol(bs), " samples. Please use a larger ",
           "number of samples, or specify fewer adjust covariates.")
  
    # check for incompatible args
    if (fact && !is.null(matchCovariate) && length(unique(testCov)) > 2)
      stop("matchCovariate can't be used when testCovariate is not a 2-group ",
           "factor. Perhaps you'd like to add an adjustCovariate instead?")
    
    if(!is.null(matchCovariate) && choose(nrow(design), min(sampleSize)) >= 5e5)
      stop("matchCovariate can't be used when the sample size is large enough ",
           "to yield more than 500000 possible permutations. ",
           "Perhaps you'd like to add an adjustCovariate instead?")
    
    # check for interaction terms (not yet supported)
    if (length(coeff) > 1 && any(rowSums(design[,coeff]) > 1))
      stop("Interaction terms in testCovariate are not yet supported.")
    
    if (length(unique(testCov)) == 2) {
        message("Condition: ",
            unique(pData(bs)[, testCovariate][which(design[, coeff] == 1)]), 
            " vs ", 
            unique(pData(bs)[, testCovariate][which(design[, coeff] == 0)]))
    }
    if (!is.null(adjustCovariate)) {
      message("Adjusting for covariate (s): ", 
              paste(colnames(pData(bs))[adjustCovariate], collapse = ", "))
    }
    if (!is.null(matchCovariate)) {
        if (length(matchCovariate) > 1)
          stop("Covariate matching can only be carried out for one",
              " covariate")
        if (length(unique(testCov)) > 2)
          stop("Covariate matching can only be carried out for 2-group",
               " comparisons")
        if (is.character(matchCovariate)) {
            if (sum(grepl(matchCovariate, colnames(pData(bs)))) == 0) {
                stop("Error: no column in pData() found that matches ",
                      "the matchCovariate")
            } else if (length(grep(matchCovariate, colnames(pData(bs)))) > 1) {
                stop("Error: matchCovariate matches more than one ",
                      "column in pData()")
            }
            mC <- grep(matchCovariate, colnames(pData(bs)))
        } else {
            stopifnot(matchCovariate <= ncol(pData(bs)))
        }
      message("Matching permutations on covariate: ", 
              colnames(pData(bs))[mC])
    }
    
    # check for loci with missing data
    if (fact){
      lev <- unique(pData(bs)[[testCovariate]])
      filter <- NULL
      for (l in seq_along(lev)){
        filter <- rbind(filter,
              1*(DelayedMatrixStats::rowSums2(getCoverage(bs)[,pData(bs)[[testCovariate]] == 
                                           lev[l], drop = FALSE]) == 0))
      }
      filter <- which( apply(filter, 2, max) > 0 )
  
      if (length(filter) > 0) {
        stop(length(filter), " loci have zero coverage in all samples ",
             "of at least one condition. Please remove these loci ", 
             "before running dmrseq")
      }
      
    }else{
      filter <- DelayedMatrixStats::rowSums2(getCoverage(bs)==0) >= ncol(bs) - 1
      if(sum(filter) > 0)
        stop(sum(filter), " loci have zero coverage in at least ",
             ncol(bs) - 1, " samples. Please remove these loci ", 
             "before running dmrseq")
    }
    
    # register the parallel backend
    BiocParallel::register(BPPARAM)
    backend <- paste0("BiocParallel:", class(bpparam())[1])
    
    if (bpparam()$workers == 1) {
      if (verbose) {
        mes <- "Using a single core (backend: %s)."
        message(sprintf(mes, backend))
      }
      parallel <- FALSE
    } else {
      if (verbose) {
        mes <- paste0("Parallelizing using %s workers/cores ", 
                      "(backend: %s).")
        message(sprintf(mes, bpparam()$workers, backend))
      }
      parallel <- TRUE
    }
    message("Computing on ", chrsPerChunk, 
            " chromosome(s) at a time.\n")
    
    message("Detecting candidate regions with coefficient larger than ",
                   unique(abs(cutoff)), 
           " in magnitude.")
    OBS <- bumphunt(bs=bs, design = design, 
                    coeff = coeff, coeff.adj = coeff.adj, minInSpan = minInSpan,
                    minNumRegion = minNumRegion, cutoff = cutoff, 
                    maxGap = maxGap, maxGapSmooth = maxGapSmooth, 
                    smooth = smooth, bpSpan = bpSpan, verbose = verbose, 
                    parallel = parallel, block = block, blockSize = blockSize,
                    chrsPerChunk = chrsPerChunk, fact = fact,
                    adjustCovariate = adjustCovariate)
   
    # check that at least one candidate region was found; if there were none 
    # there is no need to go on to compute permutation tests...
    
    if (length(OBS) > 0) {
        message("* ", nrow(OBS), " candidates detected")
        FLIP <- NULL
        # configure the permutation matrix for two group comparisons
        if (length(unique(design[, coeff[1]])) == 2 && 
            length(coeff) == 1 &&
            choose(nrow(design), min(sampleSize)) < 5e5 ) {
            if (verbose) {
                message("Performing balanced permutations of ",
                        "condition across samples ", 
                  "to generate a null distribution of region test statistics")
            }
            perms <- combn(seq(1, nrow(design)), min(sampleSize))
            
            # Remove redundant permutations (if balanced)
            if (length(unique(table(design[,coeff]))) == 1){
              perms <- perms[, seq_len(ncol(perms)/2)]
            }
            
            # restrict to unique permutations that don't include any 
            # groups consisting of all identical conditions
            rmv <- NULL
            for (p in seq_len(ncol(perms))){
              if (length(unique(design[perms[,p],coeff])) == 1){
                rmv <- c(rmv, p)
              }
            }
            if (length(rmv) > 0 )
              perms <- perms[,-rmv]
            
            # subsample permutations based on similarity to original partition
            # gives preference to those with the least similarity
            if (maxPerms < ncol(perms)) {
               similarity <- apply(perms, 2, function(x) {
                 max(table(design[x,coeff]))
               })
               perms.all <- perms
               perms <- NULL
               levs <- sort(unique(similarity))
               l <- 1
               num <- 0
               while(!(num == maxPerms) && l <= length(levs)) {
                 keep <- sample(which(similarity == levs[l]), 
                                min(maxPerms-num, sum(similarity == levs[l])) )
                 perms <- cbind(perms, perms.all[,keep])
                 l <- l + 1
                 num <- ncol(perms) 
               }
            }
        } else {
            # Next consider a multilevel, or continuous covariate where the 
            # covariate will be permuted in an unrestricted manner
            if (verbose) {
                message("Performing unrestricted permutation of", 
                  " covariate of interest across samples ", 
                  "to generate a null distribution of region test statistics")
            }
            perms <- as.matrix(seq_len(nrow(design)))

            for (p in seq_len(maxPerms)) {
                tries <- 0
                candidate <- sample(seq_len(nrow(design)), nrow(design))
                # check that the permutation is not a duplicate, and not 
                # equal to the original
                while ((sum(apply(perms, 2, function(x) 
                                all.equal(x, candidate)) == TRUE) > 0 || 
                       sum(apply(perms, 2, function(x) 
                         all.equal(x, rev(candidate))) == TRUE) > 0) &&
                       tries <= 20) {
                  candidate <- sample(seq(seq_len(nrow(design))), nrow(design))
                  tries <- tries + 1
                }
                # save the permutation to the permutation matrix
                if (tries <= 20){
                  perms <- cbind(perms, candidate)
                }
            }
            perms <- perms[,-1] # remove original
        }
        
        pData.orig <- pData(bs)
        levs <- unique(pData.orig[[testCovariate]])
        # Now rerun on permuted designs and concatenate results
        for (j in seq_len(ncol(perms))) {
            if (verbose) {
                message("\nBeginning permutation ", j)
            }
            reorder <- perms[, j]
            designr <- design
            
            if (length(unique(design[, coeff[1]])) == 2 && 
                length(coeff) == 1 && 
                !nrow(perms) == nrow(designr)) {
                designr[, coeff] <- 0
                designr[reorder, coeff] <- 1
                pData(bs)[[testCovariate]] <- levs[1]
                pData(bs)[[testCovariate]][reorder] <- levs[2]
                
                if (!all(sort(pData.orig[[testCovariate]]) ==
                              sort(pData(bs)[[testCovariate]]))){
                  designr[, coeff] <- 1
                  designr[reorder, coeff] <- 0
                  pData(bs)[[testCovariate]] <- levs[2]
                  pData(bs)[[testCovariate]][reorder] <- levs[1]
                }
                
                xr <- NULL
                for (rd in seq_len(nrow(pData.orig))) {
                  match <- which(pData.orig[[testCovariate]] %in%
                              pData(bs)[rd,][[testCovariate]])
                  taken <- which(match %in% xr)
                  if (length(taken) > 0)
                    match <- match[-taken]
                  if (length(match) > 0)
                    xr <- c(xr, match[1])
                }
                if(length(coeff.adj) > 0){
                  pData(bs)[,adjustCovariate] <- 
                    pData.orig[xr,adjustCovariate]
                }
            } else {
                designr[, coeff] <- designr[reorder, coeff]
                pData(bs) <- pData.orig[reorder, , drop = FALSE]
            }
            
            # if matchCovariate is not null, restrict permutations such that 
            # null comparisons are balanced for the values of 
            # pData$matchCovariate this avoids comparison of,
            # say two different individuals in the null, that the comparison of 
            # interest is tissue type. Not matching would mean the null is 
            # really not null
            if (!is.null(matchCovariate)) {
                permLabel <- paste0(paste0(pData(bs)[designr[, coeff[1]] == 1, 
                                                     mC], collapse = "_"), 
                  "vs", paste0(pData(bs)[(1 - designr[, coeff[1]]) == 1, 
                                           mC], collapse = "_"))
                
                c1 <- unlist(strsplit(permLabel, "vs"))[1]
                c2 <- unlist(strsplit(permLabel, "vs"))[2]
                
                c1 <- unlist(strsplit(c1, "_"))
                c2 <- unlist(strsplit(c2, "_"))
                
                keepPerm <- 1 * (sum(c1 %in% c2) > 0 &&
                                 sum(c2 %in% c1) > 0)
                
                if (keepPerm == 0) {
                  if (verbose) {
                    message(paste0("Skipping permutation ", 
                                   gsub("vs", " vs ", permLabel)))
                  }
                  next
                }
            } else {
                permLabel <- j
            }
            
            res.flip.p <- bumphunt(bs=bs, design = designr, 
                                   coeff = coeff, 
                                   coeff.adj = coeff.adj,
                                   minInSpan = minInSpan, 
                                   minNumRegion = minNumRegion, cutoff = cutoff,
                                   maxGap = maxGap, maxGapSmooth = maxGapSmooth,
                                   smooth = smooth, bpSpan = bpSpan, 
                                   verbose = verbose, parallel = parallel,
                                   block = block, blockSize = blockSize,
                                   chrsPerChunk = chrsPerChunk, fact = fact,
                                   adjustCovariate = adjustCovariate)
            
            if (verbose) {
              message("* ", j, " out of ", ncol(perms), 
                      " permutations completed (",
                      nrow(res.flip.p), " null candidates)")
            }
            
            if (!is.null(res.flip.p)) {
                res.flip.p$permNum <- permLabel
                FLIP <- rbind(FLIP, res.flip.p)
            }
        }
        
        # restore original pData
        pData(bs) <- pData.orig
        
        # if no candidates were found in permutation
        # provide informative error message
        if (is.null(FLIP)){
          warning("No candidate regions found in permutation, so inference ",
               "can't be carried out. ",
               "Try decreasing the cutoff, or running on a larger ",
               "dataset if you are currently using a subset.")
          OBS$pval <- NA
          OBS$qval <- NA
        }else if (nrow(FLIP) < 0.05*nrow(OBS)){
          message("Note: Very few null candidate regions were found.",
               "For more accurate and sensitive inference, ",
               "try decreasing the cutoff, or running on a larger ",
               "dataset if you are currently using a subset.")
        }
        
        if (!is.null(FLIP)){
          # if there are more than 1 million candidate null regions, 
          # take a random sample
          # of 1 million of them
          if (nrow(FLIP) > 1e+06) {
            rs <- sample(seq_len(nrow(FLIP)), 1e+06, replace = FALSE)
            FLIP <- FLIP[rs, ]
          }
          
          # which column of results to use as test statistic ?  
          # check statistic name
          if (!(stat %in% c(colnames(OBS), "avg"))) {
            stop("Specified '", stat, 
                 "' as the test statistic which is not ", 
                 "in the results. Please specify a valid name from one of ",
                 "L, area, beta, or stat")
          } else if (stat == "avg") {
            OBS$avg <- OBS$area/OBS$L
            FLIP$avg <- FLIP$area/FLIP$L
          }
          
          whichStatO <- which(colnames(OBS) == stat)
          whichStatF <- which(colnames(FLIP) == stat)
          
          # Faster way to compute the p-values that doesn't use multiple cores 
          # Step 1: sort the permuted statistics vector
          perm.ordered <- c(sort(abs(FLIP[, whichStatF]), 
                                 method = "quick"), Inf)
          
          # Step 2: find the first instance in the sorted vector where the 
          # permuted value is greater than the observed and use this to 
          # determine the number of permuted values that are greater than or  
          # equal to theobserved
          pval <- rep(NA, nrow(OBS))
          pval[!is.na(OBS[, whichStatO])] <- (1 + 
                        vapply(abs(OBS[!is.na(OBS[, whichStatO]), whichStatO]),
            function(x) length(perm.ordered) - min(which(x <= perm.ordered)),
                        numeric(1))) / (1 + sum(!is.na(FLIP[, whichStatF])))
          
          # missing test statistics cause Inf for the p-value calculation 
          # instead propagate the missing values
          pval[abs(pval) == Inf] <- NA
          
          pval <- data.frame(x = pval, y = p.adjust(pval, method = "BH"))
          
          OBS$pval <- pval$x
          OBS$qval <- pval$y
        }
        
        # convert output into GRanges, with indexStart/indexEnd as IRanges
        indexIR <- IRanges(OBS$indexStart, OBS$indexEnd)
        OBS.gr <- makeGRangesFromDataFrame(OBS[,-c(4:5)], 
                                           keep.extra.columns = TRUE)
        OBS.gr$index <- indexIR
        names(OBS.gr) <- NULL
        
        # sort on pval overall (currently sorted within chromsome)
        OBS.gr <- OBS.gr[order(OBS.gr$pval, -abs(OBS.gr$stat)),]
        
        return(OBS.gr)
    } else {
        message("No candidate regions pass the cutoff of ", unique(abs(cutoff)))
        return(NULL)
    }
}
Any scripts or data that you put into this service are public.
dmrseq documentation built on April 18, 2021, 6 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
dmrseq
Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing

R/dmrseq.R
In dmrseq: Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing

Defines functions dmrseq

Documented in dmrseq

Try the dmrseq package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dmrseq Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing

R/dmrseq.R In dmrseq: Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing

Defines functions dmrseq

Documented in dmrseq

Try the dmrseq package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dmrseq
Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing

R/dmrseq.R
In dmrseq: Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing