R/MethPed.R

#####################################################################################
## MethPed      : Generic Function
#####################################################################################
## Code concept : Mohammad Tanvir Ahamed
## Maintain     : Mohammad Tanvir Ahamed
## Year         : 2015
#####################################################################################
#' MethPed: a DNA methylation classifier tool for the identification of pediatric brain tumor subtypes.
#' @usage MethPed(data, TargetID="TargetID", prob = TRUE,...)
#' @description The current version of MethPed can classify the following tumor diagnoses/subgroups:
#' Diffuse Intrinsic Pontine Glioma (DIPG),
#' Ependymoma,
#' Embryonal tumors with multilayered rosettes (ETMR),
#' Glioblastoma (GBM),
#' Medulloblastoma (MB) - Group 3 (MB_Gr3),
#' Group 4 (MB_Gr3),
#' Group WNT (MB_WNT),
#' Group SHH (MB_SHH) and
#' Pilocytic Astrocytoma (PiloAstro).
#' @param data Data for classification. Data can be in class "ExpressionSet", "matrix" or "data.frame". See MethPed vignette for details.
#' @param TargetID Name of the "Probe" column in data
#' @param prob If 'TRUE' (Default value), the return value will be conditional probability from which tumor
#' group the sample belongs to. 'FALSE' will only return value for the tumor group which has the maximum
#' conditional probability for a sample and other non-maximum values will be zero.
#' @param ... More parameter to add
#' @details Classification of pediatric tumors into biologically defined subtypes is challenging,
#' and multifaceted approaches are needed. For this aim, we developed a diagnostic classifier
#' based on DNA methylation profiles. We offer MethPed as an easy-to-use toolbox that allows
#' researchers and clinical diagnosticians to test single samples as well as large cohorts
#' for subclass prediction of pediatric brain tumors.
#'
#' The MethPed classifier uses the Random Forest (RF) algorithm to classify unknown pediatric
#' brain tumor samples into sub-types. The classification proceeds with the selection of the
#' beta values needed for the classification.
#'
#' The computational process proceeded in two stages. The first stage commences with a reduction
#' of the probe pool or building the training probe dataset for classification. We have named this
#' dataset as “predictors”. The second stage is to apply the RF algorithm to classify the probe data
#' of interest based on the training probe dataset (predictors).
#'
#' For the construction of the training probe pool (predictors), methylation data generated by the
#' Illumina Infinium HumanMethylation 450 BeadChip arrays were downloaded from the Gene Expression
#' Omnibus (GEO). Four hundred seventy-two cases were available, representing several brain tumor
#' diagnoses (DIPG, glioblastoma, ETMR, medulloblastoma, ependymoma, pilocytic astrocytoma) and their
#' further subgroups.
#'
#' The data sets were merged and probes that did not appear in all data sets were filtered away. In addition,
#' about 190,000 CpGs were removed due to SNPs, repeats and multiple mapping sites. The final data set
#' contained 206,823 unique probes and nine tumor classes including the medulloblastoma subgroups. K–neighbor
#' imputation was used for missing probe data.
#'
#' After that, a large number of regression analyses were performed to select the 100 probes per tumor class
#' that had the highest predictive power (AUC values). Based on the identified 900 methylation sites, the nine
#' pediatric brain tumor types could be accurately classified using the multiclass classification algorithm MethPed.
#' @return The output of the algorithm is partitioned in 6 parts :
#' \describe{
#'            \item{"$target_id"}{ Probes name in main data set }
#'            \item{"$probes"}{ Number of probes}
#'            \item{"$sample"}{ Number of samples}
#'            \item{"$probes_missing"}{ Names of the probes that were included in the original classifier
#'                                      but are missing from the data at hand.}
#'            \item{$oob_err.rate}{ If a large number of probes are missing this could potentially lead to a drop
#'                          in the predictive accuracy of the model. Note that, the out-of-bag error is 1.7
#'                          percent of the original MethPed classifier.}
#'            \item{$predictions}{ The last and most interesting output contains the predictions. We chose not to
#'                          assign tumors to one or other group but give probabilities of belonging in
#'                          one or other subtype. See vignette for details.}
#'          }
#'
# #> myClassification <- MethPed(MethPed_sample)
#'
#' @author
#' Anna Danielsson, Mohammad Tanvir Ahamed, Szilárd Nemes and Helena Carén, University of Gothenburg, Sweden.
#'
#' @references
#' [1]  Anna Danielsson, Szilárd Nemes, Magnus Tisell, Birgitta Lannering, Claes Nordborg, Magnus Sabel, and Helena Carén. "MethPed: A DNA Methylation Classifier Tool for the Identification of Pediatric Brain Tumor Subtypes". Clinical Epigenetics 2015, 7:62, 2015
#'
#' [2]  Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5-32. doi:10.1023/A:1010933404324
#'
#' [3] Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. "Missing Value Estimation Methods for DNA Microarrays." Bioinformatics 17.6 (2001): 520-25.
#'
#' @seealso  See \url{http://www.clinicalepigeneticsjournal.com/content/7/1/62} for more details.
#' @name MethPed
#  @importFrom dplyr select
#' @importFrom randomForest randomForest
#' @importFrom grDevices dev.new dev.off rainbow
#' @importFrom graphics barplot legend par
#' @importFrom stats predict
#' @import Biobase
#' @examples
#'
#' #################### Loading and view sample data
#' data(MethPed_sample)
#' head(MethPed_sample,10)
#'
#' #################### Check dimention of sample data
#' dim(MethPed_sample)  # Check number pof probes and samples in data
#'
#' #################### Checking missing value in the data
#' missingIndex <- checkNA(MethPed_sample)
#'
#' #################### Apply MethPed to sample data (Probability for all tumor group)
#' myClassification<-MethPed(MethPed_sample)
#' myClassification<-MethPed(MethPed_sample,prob=TRUE)
#'
#' #################### Apply MethPed to sample data (Only maximum probability expected)
#' myClassification_max<-MethPed(MethPed_sample,prob=FALSE)
#'
#' #################### Summary of results
#' summary(myClassification)
#' summary(myClassification)
#'
#' #################### Barplot of conditional prediction probability on different samples
#' par(mai = c(1, 1, 1, 2), xpd=TRUE)
#' mat<-t(myClassification$predictions)
#' mycols <- c("green",rainbow(nrow(mat),start=0,end=1)[nrow(mat):1],"red")
#' barplot(mat,  col = mycols,
#'         beside=FALSE,axisnames=TRUE,
#'         ylim=c(0,1),xlab= "Sample",ylab="Probability")
#' legend( ncol(mat)+0.5,1,
#'        legend = rownames(mat),fill = mycols,xpd=TRUE, cex = 0.6)
#'
#' ## Generic function to plot
#' plot(myClassification) # myClassification should be an object in "methPed" class
#'
#' @export
MethPed <- function(data,TargetID="TargetID",prob = TRUE, ...) { UseMethod("MethPed",data)}


#####################################################################################
## Different method for generic function "MethPed" (Methods defined for MethPed package)
#####################################################################################
##########################################
###  Method: default
#' @export
MethPed.default <- function(data,
                            TargetID = "TargetID",
                            prob = TRUE,
                            ...)
{
  stop(
    "\nMethPed require data in class 'data.frame' or 'ExpressionSet' or 'matrix. See MethPed vignette for data pre-processing.",
    call. = FALSE
  )
}
##########################################
### Method: ExpressionSet
#' @export
MethPed.ExpressionSet <- function(data,
                                  TargetID = "TargetID",
                                  prob = TRUE,
                                  ...)
{
  mat <- exprs(data)
  res <- MethPed.matrix (data = mat, TargetID, prob)
  res
}
##########################################
### Method: matrix
#' @export
MethPed.matrix <- function(data,
                           TargetID = "TargetID",
                           prob = TRUE,
                           ...)
{
  mat <- data.frame(data)
  mat$TargetID <- rownames(mat)
  rownames(mat) <- NULL
  res <- MethPed.data.frame(data = mat, TargetID, prob)
  res
}
##########################################
### Method: data.frame
#' @export
MethPed.data.frame <- function(data,
                               TargetID = "TargetID",
                               prob = TRUE,
                               ...)
{
  res <- MethPed_main(data, TargetID, prob)
  res

}
##########################################





#####################################################################################
## MethPed helper function that will call main MethPed algorithm (Main Random forest Analysis)
#####################################################################################
MethPed_main <- function(data,
                         TargetID,
                         prob,
                         ...)
{
  if (class(data) != 'data.frame')
    stop(
      class(data),
      "\nMethPed require data in class 'data.frame'. See MethPed vignette for data pre-processing details.",
      call. = FALSE
    )
  else {
    if (!TargetID %in% names(data))
      stop(
        "\nThe column name of probes '",
        TargetID,
        "' is incorrect. Check probe's column name of input data.",
        call. = FALSE
      )
    else {
      message("\nProbe's column name in data                 : ", TargetID)
      if (!any(data[, TargetID] %in% colnames(predictors)))
        stop(
          "\nNot a single probe names in input data match with probe names of predictor. Check probes name column in input data.",
          call. = FALSE
        )
      else {
        message("Probe name integrity of data with predictor : OK")
        #if (!(deparse(substitute(prob))) %in% c(TRUE, FALSE))
        if (!((prob)) %in% c(TRUE, FALSE))
          stop("\nCheck 'prob' argument       : TRUE / FALSE\n", call. = FALSE)
        else {
          if (length(which(is.na(data) == TRUE)) != 0)
            stop("\nThese is missing value in data.", call. = FALSE)
          else {
            message("Missing value in data                       : No missing data point")
            message("Data summary                                : ",
                    nrow(data),
                    " Probes and ",
                    ncol(data) - 1,
                    " Sample \n")
            message("\nInitiating  data analysis......\n")
            message("Classification is being processed on ",
                    ntree,
                    " tree\n")
            res <- NULL
            res$target_id <- TargetID
            res$probes <- nrow(data)
            res$sample <- ncol(data) - 1
            model <-
              MethPed_analysis(Data = data , TargetID = "TargetID")
            res$probes_missing <- model$probes_missing
            res$oob_err.rate <- model$oob_err.rate
            if (prob == TRUE)
              res$predictions <- model$predictions
            if (prob == FALSE)
            {
              mat <- model$predictions
              f <- function(x)
              {
                o <- rep(0, length(x))
                w <- which(x == max(x))
                r <-  if (length(w) > 1) {
                  o
                }
                else {
                  o[w] <- 1
                  o
                }
                r
              }
              mat1 <- t(apply(mat, 1, f))
              colnames(mat1) <- colnames(mat)
              mat2 <- mat1 * mat
              res$predictions <- mat2
            }
            #res$predictions<-round(model$predictions)}
            message(
              "\nMissing probes: ",
              length(res$probes_missing),
              " out of 900 probes are missing\n"
            )
            message("Finished analysis data\n\n")
            class(res) <- "methped"
            res

          }

        }
      }
    }

  }
}

#####################################################################################
## Main Random forest Analysis
######################################################################################
# Code concept : Szilárd Nemes
# Maintain     : Mohammad Tanvir Ahamed
# Year         : 2015
######################################################################################

MethPed_analysis <-
  function(Data,
           TargetID = "TargetID",
           prob = TRUE,
           ...)
  {
    ###### Data Selection
    predictionData <-
      Data[Data$TargetID %in% colnames(predictors),] # Selects from the probes needed for the classification and transposes the matix
    predictionData <- droplevels(predictionData)
    #predictionData.trans <- as.data.frame(t(dplyr::select(predictionData, -contains('ID'))))
    predictionData.trans <-
      as.data.frame(t(predictionData[, -(match(TargetID, names(predictionData)))]))
    colnames(predictionData.trans) <-
      predictionData$TargetID
    ###### Checks if all probes used for predcition are contained in the new data
    ##  If probes are misisng a new RF will be constructed
    if (length(setdiff(colnames(predictors)[1:899], Data$TargetID)) > 0)
    {
      predictorsRestricted <-
        predictors[, colnames(predictors) %in% predictionData$TargetID]
      predictorsRestricted$subtype <- predictors$subtype
      rf_meth_ped_reduced <-
        randomForest::randomForest(
          subtype ~ .,
          data = predictorsRestricted,
          mtry = mtry,
          ntree = ntree,
          importance = TRUE,
          proximity = TRUE,
          do.trace = 100
        )
      rfList <-
        list(
          probes_missing = paste(setdiff(
            colnames(predictors)[1:899], Data$TargetID
          )),
          oob_err.rate = rf_meth_ped_reduced$err.rate[1000, 'OOB'] * 100,
          predictions = as.data.frame(
            predict(rf_meth_ped_reduced, predictionData.trans, type = "prob")
          )
        )
      return(rfList)
    }
    ## If probes are not missing, the original RF MethPed will be used for classification
    else
    {
      rf_meth_ped <-
        randomForest::randomForest(
          subtype ~ .,
          data = predictors,
          mtry = mtry,
          ntree = ntree,
          importance = TRUE,
          proximity = TRUE,
          do.trace = TRUE
        )
      rfList1 <-
        list(
          probes_missing = "No missing probe",
          oob_err.rate = 1.70,
          predictions = predict(rf_meth_ped, predictionData.trans, type =
                                  "prob")
        )
      return(rfList1)
    }
    message("Finished data analysis")
  }
mashranga/MethPed documentation built on May 29, 2019, 4:40 a.m.