classify: Fitting classification models to sequencing data
In gokmenzararsiz/MLSeq: Machine Learning Interface for RNA-Seq Data

Description Usage Arguments Details Value Author(s) References See Also Examples

This function fits classification algorithms to sequencing data and measures model performances using various statistics

classify(data, method = c("svm", "bagsvm", "randomforest", "cart"),
  normalize = c("deseq", "none", "tmm"), transformation = c("vst",
  "voomCPM"), control = trainControl(method = "repeatedcv", number = 5,
  repeats = 10), B = 100, ref = NULL, ...)

`data`	a `DESeqDataSet` object, see the constructor functions `DESeqDataSet`, `DESeqDataSetFromMatrix`, `DESeqDataSetFromHTSeqCount` in DESeq2 package.
`method`	a character string indicating the name of classification method. There are four methods available to perform classification: `svm`: support vector machines using radial-based kernel function `bagsvm`: support vector machines with bagging ensemble `randomForest`: random forest algorithm `cart`: classification and regression trees algorithm
`normalize`	a character string indicating the name of normalization method for count data. Available options are: `none`: Normalization is not applied. Count data is used for classification. `deseq`: deseq normalization. `tmm`: Trimmed mean of `M` values.
`transformation`	a character string indicating the normalization method. Note that transformation method is applied after normalization. Available options are `vst`: variance stabilizing transformation and `voomCPM`: voom transformation (log of counts-per-million).
`control`	a list including all the control parameters passed to model training process. This arguement is a wrapper for the arguement `trControl` from caret package. See ?trainControl for details.
`B`	an integer. It is the number of bootstrap samples for method `bagsvm`. Default is 100.
`ref`	a character string indicating the user defined reference class. Default is `NULL`. If NULL is selected, first category of class labels is used as reference.
`...`	optional arguments for `train(...)` function from `caret` package.

In RNA-Seq studies, normalization is used to adjust between-sample differences for further analysis. In this package, "deseq" and "tmm" normalization methods are available. "deseq" estimates the size factors by dividing each sample by the geometric means of the transcript counts. "tmm" trims the lower and upper side of the data by log fold changes to minimize the log-fold changes between the samples and by absolute intensity. After normalization, it is useful to transform the data for classification. MLSeq package has "voomCPM" and "vst" transformation methods. "voomCPM" transformation applies a logarithmic transformation (log-cpm) to normalized count data. Second transformation method is the "vst" transformation and this approach uses an error modeling and the concept of variance stabilizing transformations to estimate the mean-dispersion relationship of data.

For model validation, k-fold cross-validation ("cv" option in MLSeq package) is a widely used technique. Using this technique, training data is randomly splitted into k non-overlapping and equally sized subsets. A classification model is trained on (k-1) subsets and tested in the remaining subsets. MLSeq package also has the repeat option as "rpt" to obtain more generalizable models. Giving a number of m repeats, cross validation concept is applied m times.

an MLSeq object for trained model.

Gokmen Zararsiz, Dincer Goksuluk, Selcuk Korkmaz, Vahap Eldem, Izzet Parug Duru, Turgay Unver, Ahmet Ozturk

Kuhn M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, (http://www.jstatsoft.org/v28/i05/)

Anders S. Huber W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11:R106

Witten DM. (2011). Classification and clustering of sequencing data using a poisson model. The Annals of Applied Statistics, 5(4), 2493:2518

Charity WL. et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, Genome Biology, 15:R29, doi:10.1186/gb-2014-15-2-r29

Witten D. et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8:58

Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biology, 11:R25, doi:10.1186/gb-2010-11-3-r25

predictClassify, train, trainControl

library(DESeq2)
data(cervical)

# a subset of cervical data with first 150 features.
data <- cervical[c(1:150), ]

# defining sample classes.
class <- data.frame(condition = factor(rep(c("N","T"), c(29, 29))))

n <- ncol(data)  # number of samples
p <- nrow(data)  # number of features

# number of samples for test set (20% test, 80% train).
nTest <- ceiling(n*0.2)
ind <- sample(n, nTest, FALSE)

# train set
data.train <- data[ ,-ind]
data.train <- as.matrix(data.train + 1)
classtr <- data.frame(condition=class[-ind, ])

# train set in S4 class
data.trainS4 <- DESeqDataSetFromMatrix(countData = data.train,
                   colData = classtr, formula(~ condition))
data.trainS4 <- DESeq(data.trainS4, fitType = "local")

## Number of repeats (repeats) might change model accuracies ##
# Classification and Regression Tree (CART) Classification
cart <- classify(data = data.trainS4, method = "cart",
          normalize = "deseq", transformation = "vst", ref = "T",
          control = trainControl(method = "repeatedcv", number = 5,
                                 repeats = 3, classProbs = TRUE))
cart

# Random Forest (RF) Classification
# rf <- classify(data = data.trainS4, method = "randomforest",
#         normalize = "deseq", transformation = "vst", ref = "T",
#         control = trainControl(method = "repeatedcv", number = 5,
#                                repeats = 3, classProbs = TRUE))
# rf