R/classify.R

#' Machine learning interface for RNA-Seq data
#'
#' This package applies several machine learning methods, including SVM, bagSVM, Random Forest and CART, to RNA-Seq data.
#'
#' \tabular{ll}{
#'   Package: \tab MLSeq\cr
#'   Type: \tab Package\cr
#'   License: \tab GPL (>= 2)\cr
#' }
#'
#' @author Gokmen Zararsiz, Dincer Goksuluk, Selcuk Korkmaz, Vahap Eldem, Izzet Parug Duru, Turgay Unver, Ahmet Ozturk
#'
#' -----------------
#'
#' Maintainers:
#'
#' Gokmen Zararsiz, \email{gokmenzararsiz@erciyes.edu.tr}
#'
#' Dincer Goksuluk \email{dincer.goksuluk@hacettepe.edu.tr}
#'
#' Selcuk Korkmaz \email{selcukorkmaz@hotmail.com}
#'
#' @docType package
#' @name MLSeq-package
#' @rdname MLSeq-package
#' @aliases MLSeq-package
#' @keywords package
NULL


#' Cervical cancer data
#'
#' Cervical cancer data measures the expressions of 714 miRNAs of human samples. There are 29 tumor and 29 non-tumor cervical
#' samples and these two groups are treated as two separete classes.
#'
#' @format A data frame with 58 observations on the following 715 variables.
#'
#' @source \url{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880020/#supplementary-material-sec}
#'
#' @references Witten, D., et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker
#' analysis in a collection of cervical tumours and matched controls. BMC Biology, 8:58
#'
#' @author Gokmen Zararsiz, Dincer Goksuluk, Selcuk Korkmaz, Vahap Eldem, Izzet Parug Duru, Turgay Unver, Ahmet Ozturk
#'
#' @docType data
#' @name cervical
#' @rdname cervical
#' @aliases cervical
#' @keywords cervical data
#'
#' @examples
#' data(cervical)
NULL


#' Fitting classification models to sequencing data
#'
#' This function fits classification algorithms to sequencing data and measures model performances using various statistics
#'
#' In RNA-Seq studies, normalization is used to adjust between-sample differences for further analysis. In this package,
#' "deseq" and "tmm" normalization methods are available. "deseq" estimates the size factors by dividing each sample by
#' the geometric means of the transcript counts. "tmm" trims the lower and upper side of the data by log fold changes to
#' minimize the log-fold changes between the samples and by absolute intensity. After normalization, it is useful to
#' transform the data for classification. MLSeq package has "voomCPM" and "vst" transformation methods. "voomCPM"
#' transformation applies a logarithmic transformation (log-cpm) to normalized count data. Second transformation method is
#' the "vst" transformation and this approach uses an error modeling and the concept of variance stabilizing transformations
#' to estimate the mean-dispersion relationship of data.
#'
#' For model validation, k-fold cross-validation ("cv" option in MLSeq package) is a widely used technique. Using this technique,
#' training data is randomly splitted into k non-overlapping and equally sized subsets. A classification model is trained on (k-1)
#' subsets and tested in the remaining subsets. MLSeq package also has the repeat option as "rpt" to obtain more generalizable models.
#' Giving a number of m repeats, cross validation concept is applied m times.
#'
#' @param data a \code{DESeqDataSet} object, see the constructor functions
#' \code{\link[DESeq2]{DESeqDataSet}},
#' \code{\link[DESeq2]{DESeqDataSetFromMatrix}},
#' \code{\link[DESeq2]{DESeqDataSetFromHTSeqCount}} in DESeq2 package.
#' @param method a character string indicating the name of classification method. There are four methods available to perform classification:
#' \itemize{
#' \item \code{svm}: support vector machines using radial-based kernel function
#' \item \code{bagsvm}: support vector machines with bagging ensemble
#' \item \code{randomForest}: random forest algorithm
#' \item \code{cart}: classification and regression trees algorithm
#' }
#' @param normalize a character string indicating the name of normalization method for count data.
#' Available options are:
#' \itemize{
#' \item \code{none}: Normalization is not applied. Count data is used for classification.
#' \item \code{deseq}: deseq normalization.
#' \item \code{tmm}: Trimmed mean of \code{M} values.
#' }
#' @param transformation a character string indicating the normalization method. Note that transformation method is applied after normalization.
#' Available options are \code{vst}: variance stabilizing transformation and \code{voomCPM}: voom transformation (log of counts-per-million).
#' @param control a list including all the control parameters passed to model training process. This arguement is a wrapper for the
#' arguement \code{trControl} from caret package. See \bold{?trainControl} for details.
#' @param B an integer. It is the number of bootstrap samples for method \code{bagsvm}. Default is 100.
#' @param ref a character string indicating the user defined reference class. Default is \code{NULL}. If NULL is selected,
#' first category of class labels is used as reference.
#' @param \dots optional arguments for \code{train(...)} function from \code{caret} package.
#'
#' @return an \code{MLSeq} object for trained model.
#'
#' @author Gokmen Zararsiz, Dincer Goksuluk, Selcuk Korkmaz, Vahap Eldem, Izzet Parug Duru, Turgay Unver, Ahmet Ozturk
#'
#' @references
#'
#' Kuhn M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software,
#' (\url{http://www.jstatsoft.org/v28/i05/})
#'
#' Anders S. Huber W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11:R106
#'
#' Witten DM. (2011). Classification and clustering of sequencing data using a poisson model. The Annals of Applied Statistics, 5(4), 2493:2518
#'
#' Charity WL. et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, Genome Biology,
#' 15:R29, doi:10.1186/gb-2014-15-2-r29
#'
#' Witten D. et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis
#' in a collection of cervical tumours and matched controls. BMC Biology, 8:58
#'
#' Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-Seq data.
#' Genome Biology, 11:R25, doi:10.1186/gb-2010-11-3-r25
#'
#' @keywords RNA-seq classification
#'
#' @seealso \code{\link{predictClassify}}, \code{\link[caret]{train}}, \code{\link[caret]{trainControl}}
#'
#' @examples
#' library(DESeq2)
#' data(cervical)
#'
#' # a subset of cervical data with first 150 features.
#' data <- cervical[c(1:150), ]
#'
#' # defining sample classes.
#' class <- data.frame(condition = factor(rep(c("N","T"), c(29, 29))))
#'
#' n <- ncol(data)  # number of samples
#' p <- nrow(data)  # number of features
#'
#' # number of samples for test set (20% test, 80% train).
#' nTest <- ceiling(n*0.2)
#' ind <- sample(n, nTest, FALSE)
#'
#' # train set
#' data.train <- data[ ,-ind]
#' data.train <- as.matrix(data.train + 1)
#' classtr <- data.frame(condition=class[-ind, ])
#'
#' # train set in S4 class
#' data.trainS4 <- DESeqDataSetFromMatrix(countData = data.train,
#'                    colData = classtr, formula(~ condition))
#' data.trainS4 <- DESeq(data.trainS4, fitType = "local")
#'
#' ## Number of repeats (repeats) might change model accuracies ##
#' # Classification and Regression Tree (CART) Classification
#' cart <- classify(data = data.trainS4, method = "cart",
#'           normalize = "deseq", transformation = "vst", ref = "T",
#'           control = trainControl(method = "repeatedcv", number = 5,
#'                                  repeats = 3, classProbs = TRUE))
#' cart
#'
#' # Random Forest (RF) Classification
#' # rf <- classify(data = data.trainS4, method = "randomforest",
#' #         normalize = "deseq", transformation = "vst", ref = "T",
#' #         control = trainControl(method = "repeatedcv", number = 5,
#' #                                repeats = 3, classProbs = TRUE))
#' # rf
#'
#' @import methods DESeq2 Biobase
#'
#' @importFrom edgeR DGEList calcNormFactors
#' @importFrom stats model.matrix predict relevel xtabs
#' @importFrom Biobase ExpressionSet exprs
#' @importFrom limma voom
#' @importFrom caret bagControl confusionMatrix svmBag trainControl bag
#'
#' @export
classify <- function (data, method = c("svm", "bagsvm", "randomforest", "cart"),
                      normalize = c("deseq", "none", "tmm"), transformation = c("vst", "voomCPM"),
                      control = trainControl(method = "repeatedcv", number = 5, repeats = 10),
                      B = 100, ref = NULL, ...){
  if (!is.null(ref)) {
    if (!is.character(ref))
      stop("Reference class should be \"character\"")
  }
  if (is.null(ref)) {
    ref = levels(data$condition)[1]
  }
  if (class(data)[1] != "DESeqDataSet") {
    stop("Data should be a \"DESeqDataSet Object\" of S4 class.")
  }
  if (is.null(method)) {
    stop("Classification method is not specified.")
  }
  method = match.arg(method)
  normalize = match.arg(normalize)
  transformation = match.arg(transformation)
  if ((normalize == "tmm" & transformation == "vst")) {
    transformation = "voomCPM"
    warning("\"vst\" transformation can be applied with \"deseq\" normalization. \"voom-CPM\" transformation is used.")
  }
  if (normalize == "none") {
    transformation = "NULL"
    warning("\"transformation\" method is ignored since normalization is not applied.")
  }
  conditions = as.factor(data$condition)
  conditions = relevel(conditions, which(levels(conditions) == ref))
  counts = counts(data)
  org.classes = conditions
  org.class.levels = levels(org.classes)
  ctrl <- control

  if (normalize == "none") {
    counts = t(counts)
    conditions = conditions
    dataexp = data.frame(counts, conditions)
  }
  if (normalize == "tmm") {
    counts = counts(data)
    y <- DGEList(counts = counts, genes = rownames(counts))
    y <- calcNormFactors(y)
    design <- model.matrix(~conditions)
    v <- voom(y, design, plot = FALSE)
    dataexp = data.frame(t(v$E), conditions)
    counts = dataexp[, -length(dataexp)]
    conditions = as.factor(dataexp[, length(dataexp)])
  }
  if (normalize == "deseq") {
    data = estimateSizeFactors(data)
    if (transformation == "vst") {
      data = estimateDispersions(data, fitType = "local")
      datavst = getVarianceStabilizedData(data)
      datavst <- as.matrix(datavst)
      cond = data.frame(conditions, row.names = colnames(datavst))
      datavst = ExpressionSet(datavst, AnnotatedDataFrame(cond))
      dataexp = data.frame(t(exprs(datavst)), conditions)
      counts = dataexp[, -length(dataexp)]
      conditions = as.factor(dataexp[, length(dataexp)])
    }
    if (transformation == "voomCPM") {
      counts = counts(data, normalized = TRUE)
      y <- DGEList(counts = counts, genes = rownames(counts))
      design <- model.matrix(~conditions)
      v <- voom(y, design, plot = FALSE)
      dataexp = data.frame(t(v$E), conditions)
      counts = dataexp[, -length(dataexp)]
      conditions = as.factor(dataexp[, length(dataexp)])
    }
  }
  if (method == "svm") {
    train <- train(counts, conditions, method = "svmRadial",
                   trControl = ctrl, ...)
  }
  if (method == "bagsvm") {
    train <- train(counts, conditions, method = "bag", B = B,
                   bagControl = bagControl(fit = svmBag$fit, predict = svmBag$pred,
                                           aggregate = svmBag$aggregate), trControl = ctrl,
                   ...)
  }
  if (method == "randomforest") {
    train <- train(counts, conditions, method = "rf", trControl = ctrl,
                   ...)
  }
  if (method == "cart") {
    train <- train(counts, conditions, method = "rpart",
                   trControl = ctrl, ...)
  }
  train.pred = predict(train)
  tbl.trn = table(Predicted = train.pred, Actual = org.classes)
  confM = confusionMatrix(tbl.trn, reference = ref)
  result = new("MLSeq", confusionMat = confM, trainedModel = train,
               method = method, normalization = normalize, transformation = transformation,
               ref = ref)
  result
}
gokmenzararsiz/MLSeq documentation built on May 17, 2019, 7:41 a.m.