R/remp.R

Defines functions .coverageStats_GENE .countCoveredGene .coverageStats_RE .countREByGene .predictREMP .QTF .modelTrain .aggregateNeib .loadMethy .toIndicator remp

Documented in remp

#' @title Repetitive element methylation prediction
#'
#' @description
#' \code{remp} is used to predict genomewide methylation levels of locus-specific repetitive elements (RE).
#' Two major RE types in human, Alu element (Alu) and LINE-1 (L1) are available.
#'
#' @param methyDat A \code{\link{RatioSet}}, \code{\link{GenomicRatioSet}}, \code{\link{DataFrame}},
#' \code{data.table}, \code{data.frame}, or \code{matrix} of Illumina BeadChip methylation data
#' (450k or EPIC array) or Illumina methylation percentage estimates by sequencing. See Details. 
#' Alternatively, user can also specify a pre-built data template (see \code{\link{rempTemplate}}).
#' \code{remp} to carry out the prediction. See \code{\link{rempTemplate}}. With template specified, \code{methyDat},
#' \code{REtype}, \code{parcel}, and \code{work.dir} can be skipped.
#' @param REtype Type of RE. Currently \code{"Alu"}, \code{"L1"}, and \code{"LTR"} are supported. If \code{NULL}, 
#' the type of RE will be extracted from \code{parcel}.
#' @param Seq.GR A \code{\link{GRanges}} object containing genomic locations of the CpGs profiled by sequencing
#' platforms. This parameter should not be \code{NULL} if the input methylation data \code{methyDat} are
#' obtained by sequencing. Note that the genomic location can be in either hg19 or hg38 build. 
#' See details in \code{\link{initREMP}}.
#' @param parcel An \code{\link{REMParcel}} object containing necessary data to carry out the
#' prediction. If \code{NULL}, \code{REtype} must specify a type of RE so that the function can search the
#' \code{.rds} data file in \code{work.dir} exported by \code{\link{initREMP}} (with \code{export = TRUE})
#' or \code{\link{saveParcel}}.
#' @param work.dir Path to the directory where the annotation data generated by \code{\link{initREMP}}
#' are saved. Valid when the argument \code{parcel} is missing. If not specified, temporary directory
#' \code{tempdir()} will be used. If specified, the directory path has to be the same as the
#' one specified in \code{\link{initREMP}} or in \code{\link{saveParcel}}.
#' @param win An integer specifying window size to confine the upstream and downstream flanking
#' region centered on the predicted CpG in RE for prediction. Default = \code{1000}. See Details.
#' @param method Name of model/approach for prediction. Currently \code{"rf"} (Random Forest),
#' \code{"xgbTree"} (Extreme Gradient Boosting), \code{"svmLinear"} (SVM with linear kernel), \code{"svmRadial"}
#' (SVM with radial kernel), and \code{"naive"} (carrying over methylation values of the closest
#' CpG site) are available. Default = \code{"rf"} (Random Forest). See Details.
#' @param autoTune Logical parameter. If \code{TRUE}, a 3-time repeated 5-fold cross validation
#' will be performed to determine the best model parameter. If \code{FALSE}, the \code{param} option
#' (see below) must be specified. Default = \code{TRUE}. Auto-tune will be disabled using Random Forest.
#' See Details.
#' @param param A list specifying fixed model tuning parameter(s) (not applicable for Random Forest, see Details).
#' For Extreme Gradient Boosting, \code{param} list must contain '\code{$nrounds}', '\code{$max_depth}',
#' '\code{$eta}', '\code{$gamma}', '\code{$colsample_bytree}', '\code{$min_child_weight}', and '\code{$subsample}'.
#' See \code{xgbTree} in package \code{caret}. For SVM, \code{param} list must contain '\code{$C}' (cost) for linear kernel
#' or '\code{$sigma}' and '\code{$C}' for radial basis function kernel. This parameter is valid only
#' when \code{autoTune = FALSE}.
#' @param seed Random seed for Random Forest model for reproducible prediction results.
#' Default is \code{NULL}, which generates a seed.
#' @param ncore Number of cores used for parallel computing. By default, max number of cores available
#' in the machine will be utilized. If \code{ncore = 1}, no parallel computing is allowed.
#' @param BPPARAM An optional \code{\link{BiocParallelParam}} instance determining the parallel back-end to
#' be used during evaluation. If not specified, default back-end in the machine will be used.
#' @param verbose Logical parameter. Should the function be verbose?
#'
#' @details
#' Before running \code{remp}, user should make sure the methylation data have gone through
#' proper quality control, background correction, and normalization procedures. Both beta value
#' and M value are allowed. Rows represents probes and columns represents samples. For array data,
#' please make sure to have row names that specify the Illumina probe ID (i.e. cg00000029). For sequencing
#' data, please provide the genomic location of CpGs in a \code{\link{GRanges}} obejct and
#' specify it using \code{Seq.GR} parameter. \code{win = 1000} is based on previous findings showing that
#' neighboring CpGs are more likely to be co-modified within 1000 bp. User can specify narrower window size
#' for slight improvement of prediction accuracy at the cost of less predicted RE. Window size greater than 1000 is not
#' recommended as the machine learning models would not be able to learn much userful information
#' for prediction but introduce noise. Random Forest model (\code{method = "rf"}) is recommented
#' as it offers more accurate prediction and it also enables prediction reliability functionality.
#' Prediction reliability is estimated by conditional standard deviation using Quantile Regression Forest.
#' Please note that if parallel computing is allowed, parallel Random Forest
#' (powered by package \code{\link{ranger}}) will be used automatically. The performance of
#' Random Forest model is often relatively insensitive to the choice of \code{mtry}.
#' Therefore, auto-tune will be turned off using Random Forest and \code{mtry} will be set to one third
#' of the total number of predictors. For SVM, if \code{autoTune = TRUE}, preset tuning parameter
#' search grid can be access and modified using \code{\link{remp_options}}.
#'
#' @return A \code{\link{REMProduct}} object containing predicted RE methylation results.
#'
#' @seealso See \code{\link{initREMP}} to prepare necessary annotation database before running \code{remp}.
#'
#' @examples
#' # Obtain example Illumina example data (450k)
#' if (!exists("GM12878_450k")) 
#'   GM12878_450k <- getGM12878("450k")
#' 
#' # Make sure you have run 'initREMP' first. See ?initREMP.
#' if (!exists("remparcel")) {
#'   data(Alu.hg19.demo)
#'   remparcel <- initREMP(arrayType = "450k",
#'                         REtype = "Alu",
#'                         annotation.source = "AH",
#'                         genome = "hg19",
#'                         RE = Alu.hg19.demo,
#'                         ncore = 1,
#'                         verbose = TRUE)
#' }
#' 
#' # With data template pre-built. See ?rempTemplate.
#' if (!exists("template")) 
#'   template <- rempTemplate(GM12878_450k, 
#'                            parcel = remparcel, 
#'                            win = 1000, 
#'                            verbose = TRUE)
#' 
#' # Run remp with pre-built template:
#' remp.res <- remp(template, ncore = 1)
#' 
#' # Or run remp without pre-built template (identical results):
#' \dontrun{
#'   remp.res <- remp(GM12878_450k, 
#'                    REtype = "Alu", 
#'                    parcel = remparcel, 
#'                    ncore = 1,
#'                    verbose = TRUE)
#' }
#' 
#' remp.res
#' details(remp.res)
#' rempB(remp.res) # Methylation data (beta value)
#' 
#' # Extract CpG location information. 
#' # This accessor is inherit from class 'RangedSummarizedExperiment')
#' rowRanges(remp.res)
#' 
#' # RE annotation information
#' rempAnnot(remp.res)
#' 
#' # Add gene annotation
#' remp.res <- decodeAnnot(remp.res, type = "symbol")
#' rempAnnot(remp.res)
#' 
#' # (Recommended) Trim off less reliable prediction
#' remp.res <- rempTrim(remp.res)
#' 
#' # Obtain RE-level methylation (aggregate by mean)
#' remp.res <- rempAggregate(remp.res)
#' rempB(remp.res) # Methylation data (beta value)
#' 
#' # Extract RE location information
#' rowRanges(remp.res)
#' 
#' # Density plot across predicted RE
#' remplot(remp.res)
#' 
#' @export
remp <- function(methyDat = NULL, 
                 REtype = c("Alu", "L1", "LTR"), 
                 Seq.GR = NULL,
                 parcel = NULL, 
                 work.dir = tempdir(), 
                 win = 1000,
                 method = c("rf", "xgbTree", "svmLinear", "svmRadial", "naive"),
                 autoTune = TRUE, 
                 param = NULL, 
                 seed = NULL, 
                 ncore = NULL, 
                 BPPARAM = NULL,
                 verbose = FALSE) {
  currenT <- Sys.time()

  if (is.null(methyDat)) stop("Methylation dataset (methyDat) is missing.")
  if (!is.null(Seq.GR)) {
    .isGROrStop(Seq.GR)
    names(Seq.GR) <- NULL
  }
  if (!is.null(param) & !is(param, "list")) stop("Tuning parameter(s) (param) must be a list object.")

  method <- match.arg(method)
  if (method != "naive" & !autoTune & is.null(param)) {
    message("Tuning parameter for method = '", method, "' is missing!")
    stop("You turned off auto-tune. Please specify tuning parameter(s).")
  }

  if (method %in% c("rf", "xgbTree")) {
    if (is.null(seed)) {
      seed <- sample(.Random.seed, 1)
    }
    message("Using random seed = ", seed)

    if (method == "rf") {
      autoTune <- FALSE # No need to tune RF mtry
      param <- list(mtry = round(length(remp_options(".default.predictors")) / 3)) # one third of number of predictors
      if (verbose) {
        message("Note: Auto-tune is disabled for Random Forest and mtry is set to 
                one third of number of predictors (", param$mtry, ").")
      }
    }
  }

  if (is.null(ncore)) {
    ncore <- 1
  }

  if (!is(methyDat, "template")) {
    if (win > remp_options(".default.max.flankWindow")) {
      stop(
        "Flanking window size cannot be greater than ",
        remp_options(".default.max.flankWindow"),
        ". Please see ?remp_options for details."
      )
    }

    ## Groom methylation data
    methyDat <- grooMethy(methyDat, Seq.GR, verbose = verbose)
    arrayType <- gsub("IlluminaHumanMethylation", "", methyDat@annotation["array"])
    
    methyDat <- minfi::getM(methyDat)
    
    ## Get the REMParcel object ready
    if (is.null(parcel)) {
      if (is.null(REtype)) {
        stop("REtype and parcel parameters cannot be both empty!")
      } else {
        REtype <- match.arg(REtype)
      }

      subDirName <- paste0("REMP.data.", arrayType)
      work.dir <- .forwardSlashPath(work.dir)
      data.dir <- .forwardSlashPath(file.path(work.dir, subDirName))

      path_to_parcel <- file.path(
        data.dir,
        paste0(
          "REMParcel_",
          REtype, "_",
          arrayType, "_",
          remp_options(".default.max.flankWindow"),
          ".rds"
        )
      )

      if (file.exists(path_to_parcel)) {
        remparcel <- readRDS(path_to_parcel)
      } else {
        stop("Necessary annotation data files are missing. Please make sure:
             1) you have run initREMP() first to generate the annotation data. See ?initREMP for details; 
             2) the RE type (Alu/L1) and/or platform (450k/EPIC/Sequencing) match the annotation data generated by initREMP().
             3) parameter 'parcel' is speficied, if the generated annotation data were not exported to local machine;
             4) if exported, your working directory specified is the same as you did in initREMP().")
      }
    } else {
      .isREMParcelOrStop(parcel)
      if (getParcelInfo(parcel)$platform != arrayType) {
        stop(paste0(
          "The methylation data type: ", arrayType,
          " does not match the REMParcel type: ",
          getParcelInfo(parcel)$platform
        ))
      }
      remparcel <- parcel
    }

    TEMPLATE <- rempTemplate(methyDat, Seq.GR, remparcel, win, FALSE)
  } else {
    TEMPLATE <- methyDat
  }

  ## Setup backend for paralell computing
  be <- getBackend(ncore, BPPARAM, verbose)
  ncore <- BiocParallel::bpnworkers(be)

  message(
    "Start RE methylation prediction with ",
    BiocParallel::bpnworkers(be), " core(s) ..."
  )

  REtype <- TEMPLATE$REtype # make sure REtype is consistent with REMParcel
  genome <- TEMPLATE$genome
  arrayType <- TEMPLATE$arrayType
  methyDat <- TEMPLATE$NBCpG_methyDat
  RE_NeibCpG <- TEMPLATE$NBCpG_GR
  RE_CpG_ILMN_DATA <- TEMPLATE$RECpG_methyDat
  RE_CpG_ILMN <- TEMPLATE$RECpG_GR

  refgene_main <- TEMPLATE$refgene
  RE_refGene.original <- TEMPLATE$RE

  samplenames <- colnames(methyDat)
  sampleN <- length(samplenames)

  #######################################

  if ("naive" == method) {
    BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 1))
    colnames(BEST_TUNE) <- "Not_Applicable"
    libToLoad <- NULL
    method_text <- "Naive (nearest CpG)"
    QCname_text <- "N/A"
  }
  if ("rf" == method) {
    BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 1))
    colnames(BEST_TUNE) <- "mtry"
    libToLoad <- NULL
    method_text <- "Random Forest"
    QCname_text <- "Quantile Regression Forest"
  }
  if ("xgbTree" == method) {
    BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 7))
    colnames(BEST_TUNE) <- c("nrounds", "max_depth", "eta", "gamma", "colsample_bytree", "min_child_weight", "subsample")
    libToLoad <- NULL
    method_text <- "Extreme Gradient Boosting"
    QCname_text <- "N/A"
  }
  if ("svmLinear" == method) {
    BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 1))
    colnames(BEST_TUNE) <- "cost"
    libToLoad <- c("kernlab")
    method_text <- "SVM(Linear)"
    QCname_text <- "N/A"
  }
  if ("svmRadial" == method) {
    BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 2))
    colnames(BEST_TUNE) <- c("sigma", "cost")
    libToLoad <- c("kernlab")
    method_text <- "SVM(Radial)"
    QCname_text <- "N/A"
  }
  rownames(BEST_TUNE) <- samplenames

  if (verbose) {
    message("Prediction model: ", method_text, "\nQC model: ", QCname_text)
  }

  cpgRanges <- unique(RE_NeibCpG[, "RE.Index"])
  RE_annotation <- subsetByOverlaps(RE_refGene.original, cpgRanges)
  # Note: Some REs have overlapping regions thus there can be more than one REs mapped to one RE-CpG.
  RE_annotation_name <- colnames(mcols(RE_annotation))
  regionCode <- mcols(RE_annotation)[remp_options(".default.genomicRegionColNames")]
  RE_annotation <- RE_annotation[, RE_annotation_name[!RE_annotation_name %in%
    remp_options(".default.genomicRegionColNames")]]

  # Initiate container
  REMP_PREDICT_CpG <- matrix(NA, nrow = length(cpgRanges), ncol = sampleN)
  REMP_PREDICT_QC <- matrix(NA, nrow = length(cpgRanges), ncol = sampleN)
  REMP_PREDICT_IMP <- setNames(DataFrame(matrix(NA,
    nrow = length(remp_options(".default.predictors")),
    ncol = sampleN
  )), samplenames)
  rownames(REMP_PREDICT_IMP) <- remp_options(".default.predictors")

  ## RE coverage
  RE_COVERAGE <- .coverageStats_RE(RE_annotation, regionCode, cpgRanges, RE_CpG_ILMN,
    REtype,
    indent = "    ", verbose
  )

  # Gene coverage
  GENE_COVERAGE <- .coverageStats_GENE(regionCode, refgene_main,
    REtype,
    indent = "    ", verbose
  )

  ## Prediction
  tRun <- .timeTrace(currenT)
  currenT <- tRun$t

  for (i in seq_len(sampleN)) {
    message("Predicting sample ", samplenames[i], " ...")
    RE_CpG_ILMN_OneMethyDat <- RE_CpG_ILMN_DATA[, samplenames[i]]
    # identical(RE_CpG_ILMN$Index, names(RE_CpG_ILMN_OneMethyDat))

    ### To be predicted
    if(verbose) message("    Loading methylation data for prediction...")
    RE_unprf_neib <- .loadMethy(methyDat[, i, drop = FALSE], RE_NeibCpG)
    # summary(RE_unprf_neib$distance)

    ### For model training
    RE_unprf_neib <- unique(RE_unprf_neib[, c("RE.Index", remp_options(".default.predictors"))])
    HITS <- findOverlaps(RE_unprf_neib, RE_CpG_ILMN, ignore.strand = TRUE)
    RE_prf_neib <- RE_unprf_neib[queryHits(HITS), ]
    RE_prf_neib$Methy <- RE_CpG_ILMN_OneMethyDat[subjectHits(HITS)]
    # summary(RE_prf_neib$distance.min2)
    # cor(RE_prf_neib$Methy.min, RE_prf_neib$Methy)

    P_basic <- .modelTrain(
      RE_prf_neib, remp_options(".default.predictors"), method,
      autoTune, param, be, seed, verbose
    )

    best_tune <- as.numeric(P_basic$bestTune)

    if (method != "naive") {
      REMP_PREDICT_IMP[, i] <- P_basic$importance[remp_options(".default.predictors"), "Overall"]
      for (k in seq_len(length(best_tune)))
      {
        BEST_TUNE[i, k] <- best_tune[k]
      }
    }

    ##############################################################################################

    newdata <- as.matrix(mcols(RE_unprf_neib)[, remp_options(".default.predictors")])

    if (method != "rf") {
      if (ncore > 1) {
        BiocParallel::bpstart(be)
        .bploadLibraryQuiet(libToLoad, be)
        ITER <- .iblkrow(newdata, chunks = ncore)
        REMP_PREDICT_CpG[, i] <- BiocParallel::bpiterate(ITER, .predictREMP,
          model = P_basic$model,
          BPPARAM = be,
          REDUCE = c,
          reduce.in.order = TRUE
        )
        BiocParallel::bpstop(be)
      } else {
        REMP_PREDICT_CpG[, i] <- .predictREMP(newdata, P_basic$model)
      }
      REMP_PREDICT_QC <- NULL
    } else {
      REMP_PREDICT_CpG[, i] <- predict(P_basic$model, newdata,
        type = "response",
        num.threads = ncore
      )$predictions
      REMP_PREDICT_QC[, i] <- .QTF(
        rangerObj = P_basic$model,
        x = as.matrix(mcols(RE_prf_neib)[, remp_options(".default.predictors")]),
        y = as.numeric(mcols(RE_prf_neib)[, "Methy"]),
        newX = newdata,
        seed = seed,
        num.threads = ncore
      )
    }

    tRun <- .timeTrace(currenT)
    currenT <- tRun$t
    message(
      "    ", samplenames[i], " completed! ", sampleN - i,
      " sample(s) left ...", tRun$t_text
    )
  }

  message("Done.")

  remproduct <- REMProduct(
    REtype = REtype, genome = genome, 
    platform = arrayType, win = as.character(win),
    predictModel = method_text, QCModel = QCname_text,
    rempM = REMP_PREDICT_CpG, rempQC = REMP_PREDICT_QC,
    cpgRanges = cpgRanges, sampleInfo = BEST_TUNE,
    REannotation = RE_annotation,
    RECpG = RE_CpG_ILMN,
    regionCode = regionCode,
    refGene = refgene_main,
    varImp = REMP_PREDICT_IMP,
    REStats = RE_COVERAGE, GeneStats = GENE_COVERAGE,
    Seed = seed
  )
  return(remproduct)
} ## End of remp



## --------------------------------------
## Internal functions

.toIndicator <- function(object) {
  genomicRegionInd <- matrix(0L,
    nrow = length(object),
    ncol = length(remp_options(".default.genomicRegionColNames"))
  )
  genomicRegionInd[!is.na(mcols(object)[, remp_options(".default.genomicRegionColNames")])] <- 1L
  genomicRegionInd <- DataFrame(genomicRegionInd)
  colnames(genomicRegionInd) <- remp_options(".default.genomicRegionColNames")
  mcols(object)[, remp_options(".default.genomicRegionColNames")] <- genomicRegionInd
  return(object)
}

.loadMethy <- function(methyDatOne, RE_NeibCpG) {
  RE_NeibCpG$Methy.all <- methyDatOne[RE_NeibCpG$Methy.ptr, ]
  RE_NeibCpG$Methy.min <- methyDatOne[RE_NeibCpG$Methy.ptr.min, ]
  RE_NeibCpG$Methy.min2 <- methyDatOne[RE_NeibCpG$Methy.ptr.min2, ]

  RE_NeibCpG.DF <- mcols(RE_NeibCpG)[, c("RE.CpG.ID", "distance_cat", "Methy.all")]
  catList <- sort(unique(RE_NeibCpG.DF$distance_cat))

  Methy_all_agg <- .aggregateNeib(
    Methy.all ~ RE.CpG.ID, RE_NeibCpG.DF,
    function(x) sd(x),
    c("RE.CpG.ID", "Methy.std")
  )

  ## Moving average
  cati <- NULL
  for (i in catList)
  {
    cati <- c(cati, i)
    sub_RE_NeibCpG.DF <- RE_NeibCpG.DF[RE_NeibCpG.DF$distance_cat %in% cati, ]
    sub_Methy_all_agg <- .aggregateNeib(
      Methy.all ~ RE.CpG.ID,
      sub_RE_NeibCpG.DF,
      function(x) mean(x),
      c("RE.CpG.ID", paste0("Methy.mean.mov", i))
    )
    Methy_all_agg <- merge(Methy_all_agg, sub_Methy_all_agg,
      by = "RE.CpG.ID", all.x = TRUE
    )
  }

  ## carry over missing
  for (j in catList + 2)
  {
    current <- Methy_all_agg[, j]
    k <- 1
    while (any(is.na(current)) & j + k <= 6) {
      current[is.na(current)] <- Methy_all_agg[is.na(current), j + k]
      k <- k + 1
    }
    Methy_all_agg[, j] <- current
  }

  RE_NeibCpG_meta <- mcols(RE_NeibCpG)
  mcols(RE_NeibCpG) <- cbind(
    RE_NeibCpG_meta,
    Methy_all_agg[base::match(
      RE_NeibCpG_meta$RE.CpG.ID,
      Methy_all_agg$RE.CpG.ID
    ), -1]
  )
  return(RE_NeibCpG)
}

.aggregateNeib <- function(relation, d, FUN, newColNames) {
  d_agg <- aggregate(relation, data = d, FUN)
  d_agg <- DataFrame(d_agg[, 1], DataFrame(d_agg[, -1]))
  return(setNames(d_agg, newColNames))
}

.modelTrain <- function(d, varname, method, autoTune, param,
                        be, seed, verbose) {
  ncore <- BiocParallel::bpnworkers(be)

  d <- as.data.frame(mcols(d)[, c("Methy", varname)])

  if (autoTune) {
    trC.method <- "repeatedcv"
    trC.number <- 5
    trC.repeats <- 3
  } else {
    trC.method <- "none"
    trC.number <- 1
    trC.repeats <- NA
  }

  if (method == "naive") {
    modelFit <- list(
      model = NULL, QC = NULL, importance = NULL, bestTune = NULL,
      name = "naive"
    )
  } else {

    # set up the tuning grid
    if (method == "rf") {
      param_text <- "mtry"
    }

    if (method == "svmLinear") {
      if (autoTune) param <- remp_options(".default.svmLinear.tune")
      grid <- expand.grid(C = param$C)
      param_text <- "cost"
    }

    if (method == "svmRadial") {
      if (autoTune) param <- remp_options(".default.svmRadial.tune")
      grid <- expand.grid(sigma = param$sigma, C = param$C)
      param_text <- c("sigma", "cost")
    }

    if (method == "xgbTree") {
      if (autoTune) param <- remp_options(".default.xgbTree.tune")
      grid <- expand.grid(
        nrounds = param$nrounds, max_depth = param$max_depth, eta = param$eta,
        gamma = param$gamma, colsample_bytree = param$colsample_bytree,
        min_child_weight = param$min_child_weight, subsample = param$subsample
      )
      param_text <- c("nrounds", "max_depth", "eta", "gamma", "colsample_bytree", "min_child_weight", "subsample")
    }

    if (verbose & !autoTune) {
      message(
        "    Pre-specified tuning parameter: ",
        paste(paste(param_text, unlist(param), sep = " = "), collapse = ", ")
      )
    }

    if (method == "rf") {
      doQC <- TRUE
      model.tune <- ranger::ranger(Methy ~ .,
        data = d, mtry = param$mtry, num.threads = ncore,
        importance = "permutation", keep.inbag = FALSE,
        scale.permutation.importance = TRUE, seed = seed
      )
      var_imp <- ranger::importance(model.tune)
      var_imp <- data.frame(
        Predictor = names(var_imp),
        Overall = var_imp
      )
      best_param <- model.tune$mtry
      model <- model.tune
    } else {
      set.seed(seed)
      if (ncore > 1) {
        trC <- caret::trainControl(
          method = trC.method, number = trC.number,
          repeats = trC.repeats, allowParallel = TRUE
        )
        cluster <- parallel::makeCluster(ncore)
        doParallel::registerDoParallel(cluster)
        model.tune <- caret::train(Methy ~ .,
          data = d, method = method,
          tuneGrid = grid, trControl = trC, importance = TRUE
        )
        parallel::stopCluster(cluster)
        registerDoSEQ()
      } else {
        trC <- caret::trainControl(
          method = trC.method, number = trC.number,
          repeats = trC.repeats, allowParallel = FALSE
        )
        model.tune <- caret::train(Methy ~ .,
          data = d, method = method,
          tuneGrid = grid, trControl = trC, importance = TRUE
        )
      }

      var_imp <- caret::varImp(model.tune)
      var_imp <- data.frame(
        Predictor = rownames(var_imp$importance),
        Overall = var_imp$importance
      )
      best_param <- model.tune$bestTune
      model <- model.tune$finalModel
    }

    if (verbose & autoTune) {
      message(
        "    Best tuning parameter found: ",
        paste(paste(param_text, unlist(best_param), sep = " = "), collapse = ", ")
      )
    }

    modelFit <- list(
      model = model,
      importance = var_imp,
      bestTune = best_param,
      name = method
    )
  }
  return(modelFit)
}

# Prediction reliability based on Quantile random forest
.QTF <- function(rangerObj, x, y, newX, seed, num.threads) {
  nodesX <- predict(rangerObj, x,
    num.threads = num.threads,
    type = "terminalNodes"
  )$predictions
  nnodes <- max(nodesX)
  ntree <- ncol(nodesX)
  n <- nrow(x)
  valuesNodes <- matrix(nrow = nnodes, ncol = ntree)

  set.seed(seed)
  for (tree in seq_len(ntree)) {
    shuffledNodes <- nodesX[rank(ind <- sample(seq_len(n), n)), tree]
    useNodes <- sort(unique(as.numeric(shuffledNodes)))
    valuesNodes[useNodes, tree] <- y[ind[base::match(useNodes, shuffledNodes)]]
  }

  predictNodes <- predict(rangerObj, newX, num.threads,
    type = "terminalNodes"
  )$predictions
  valuesPredict <- 0 * predictNodes
  for (tree in seq_len(ntree)) {
    valuesPredict[, tree] <- valuesNodes[ predictNodes[, tree], tree]
  }
  result <- apply(valuesPredict, 1, sd)
  return(result)
}

.predictREMP <- function(newdata, model) {
  if (is.null(model)) {
    data.frame(newdata)$Methy.min # newdata is matrix
  } else {
    if (any(class(model) %in% c("ksvm", "vm"))) {
      kernlab::predict(model, newdata)
    } # kernlab has its own predict function
    else {
      predict(model, newdata)
    }
  }
}

# Count the number of RE by gene type (NM/NR/Total), used by
# .coverageStats_RE()
.countREByGene <- function(RE_annotation, RE_list) {
  Total <- length(RE_annotation)
  RE_annotation_sub <- RE_annotation[runValue(RE_annotation$Index) %in% RE_list]
  NM <- sum(!is.na(RE_annotation_sub$InNM))
  NR <- sum(!is.na(RE_annotation_sub$InNR))
  Gene <- sum(!is.na(RE_annotation_sub$InNM) | !is.na(RE_annotation_sub$InNR))
  Intergenic <- sum(is.na(RE_annotation_sub$InNM) & is.na(RE_annotation_sub$InNR))

  return(c(Total = Total, NM = NM, NR = NR, Gene = Gene, Intergenic = Intergenic))
}

.coverageStats_RE <- function(RE_annotation, regionCode, cpgRanges, RE_CpG_ILMN,
                              REtype, indent, verbose) {
  mcols(RE_annotation) <- cbind(mcols(RE_annotation), regionCode)
  mcols(cpgRanges) <- mcols(RE_annotation[base::match(as.character(cpgRanges$RE.Index), runValue(RE_annotation$Index))])

  RE_annotation_prf <- subsetByOverlaps(RE_annotation, RE_CpG_ILMN, ignore.strand = TRUE)
  cpgRanges_prf <- subsetByOverlaps(cpgRanges, RE_CpG_ILMN, ignore.strand = TRUE)

  # Profiled RE for prediction
  cvr_RE_win_list <- runValue(RE_annotation_prf$Index)

  # Profiled RE for prediction + predicted unprofiled RE
  cvr_unRE_win_list <- runValue(RE_annotation$Index)

  REStats <- DataFrame(cbind(
    .countREByGene(RE_annotation_prf, cvr_RE_win_list),
    .countREByGene(cpgRanges_prf, cvr_RE_win_list),
    .countREByGene(RE_annotation, cvr_unRE_win_list),
    .countREByGene(cpgRanges, cvr_unRE_win_list)
  ))

  colnames(REStats) <- c("Trained_RE", "Trained_RECpG", "Predict_RE", "Predict_RECpG")

  if (verbose) {
    .showREStats(REStats, REtype, "message", indent, TRUE)
  }

  return(REStats)
}

# Count the number of genes covered by RE, used by
# .coverageStats_GENE()
.countCoveredGene <- function(regionCode, refgene_main) {
  InNM <- na.omit(regionCode$InNM)
  InNR <- na.omit(regionCode$InNR)

  NM_ind <- as.numeric(unique(unlist(strsplit(InNM, "[|]"))))
  NR_ind <- as.numeric(unique(unlist(strsplit(InNR, "[|]"))))
  gene_ind <- c(NM_ind, NR_ind)

  NM <- length(unique(refgene_main$GeneSymbol[NM_ind]))
  NR <- length(unique(refgene_main$GeneSymbol[NR_ind]))
  Gene <- length(unique(refgene_main$GeneSymbol[gene_ind]))

  return(c(NM = NM, NR = NR, Gene = Gene))
}

# indent = "    "
.coverageStats_GENE <- function(regionCode, refgene_main,
                                REtype, indent, verbose) {
  ## Count the total number of gene (protein coding gene and noncoding RNA
  ## gene)
  totgene <- length(unique(refgene_main$GeneSymbol))
  totgene_NM <- length(unique(refgene_main[refgene_main$type == "NM", ]$GeneSymbol))
  totgene_NR <- length(unique(refgene_main[refgene_main$type == "NR", ]$GeneSymbol))

  GeneStats <- DataFrame(rbind(
    c(totgene_NM, totgene_NR, totgene),
    .countCoveredGene(regionCode, refgene_main)
  ))

  rownames(GeneStats) <- c("Total", "Predicted RE")

  if (verbose) {
    .showGeneStats(GeneStats, REtype, "message", indent)
  }

  return(GeneStats)
}

Try the REMP package in your browser

Any scripts or data that you put into this service are public.

REMP documentation built on Nov. 8, 2020, 8:05 p.m.