classify: Fitting classification models to sequencing data

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/classify.R

Description

This function fits classification algorithms to sequencing data and measures model performances using various statistics.

Usage

1
2
3
4
classify(data, method = "rpart", B = 25, ref = NULL,
  class.labels = NULL, preProcessing = c("deseq-vst", "deseq-rlog",
  "deseq-logcpm", "tmm-logcpm", "logcpm"), normalize = c("deseq", "TMM",
  "none"), control = NULL, ...)

Arguments

data

a DESeqDataSet object, see the constructor functions DESeqDataSet, DESeqDataSetFromMatrix, DESeqDataSetFromHTSeqCount in DESeq2 package.

method

a character string indicating the name of classification method. Methods are implemented from the caret package. Run availableMethods() for a list of available methods.

B

an integer. It is the number of bootstrap samples for bagging classifiers, for example "bagFDA" and "treebag". Default is 25.

ref

a character string indicating the user defined reference class. Default is NULL. If NULL is selected, first category of class labels is used as reference.

class.labels

a character string indicating the column name of colData(...). Should be given as "character". The column from colData() which matches with given column name is used as class labels of samples. If NULL, first column is used as class labels. Default is NULL.

preProcessing

a character string indicating the name of the preprocessing method. This option consists both the normalization and transformation of the raw sequencing data. Available options are:

  • deseq-vst: Normalization is applied with deseq median ratio method. Variance stabiling transformation is applied to the normalized data.

  • deseq-rlog: Normalization is applied with deseq median ratio method. Regularized logarithmic transformation is applied to the normalized data.

  • deseq-logcpm: Normalization is applied with deseq median ratio method. Log of counts-per-million transformation is applied to the normalized data.

  • tmm-logcpm: Normalization is applied with trimmed mean of M values (TMM) method. Log of counts-per-million transformation is applied to the normalized data.

  • logcpm: Normalization is not applied. Log of counts-per-million transformation is used for the raw counts.

IMPORTANT: See Details for further information.

normalize

a character string indicating the type of normalization. Should be one of 'deseq', 'tmm' and 'none'. Default is 'deseq'. This option should be used with discrete and voom-based classifiers since no transformation is applied on raw counts. For caret-based classifiers, the argument 'preProcessing' should be used.

control

a list including all the control parameters passed to model training process. This arguement should be defined using wrapper functions trainControl for caret-based classifiers, discreteControl for discrete classifiers (PLDA, PLDA2 and NBLDA) and voomControl for voom-based classifiers (voomDLDA, voomDQDA and voomNSC). See related functions for further details.

...

optional arguments passed to selected classifiers.

Details

MLSeq consists both microarray-based and discrete-based classifiers along with the preprocessing approaches. These approaches include both normalization techniques, i.e. deseq median ratio (Anders et al., 2010) and trimmed mean of M values (Robinson et al., 2010) normalization methods, and the transformation techniques, i.e. variance- stabilizing transformation (vst)(Anders and Huber, 2010), regularized logarithmic transformation (rlog)(Love et al., 2014), logarithm of counts per million reads (log-cpm)(Robinson et al., 2010) and variance modeling at observational level (voom)(Law et al., 2014). Users can directly upload their raw RNA-Seq count data, preprocess their data, build one of the numerous classification models, optimize the model parameters and evaluate the model performances.

MLSeq package consists of a variety of classification algorithms for the classification of RNA-Seq data. These classifiers are categorized into two class: i) microarray-based classifiers after proper transformation, ii) discrete-based classifiers. First option is to transform the RNA-Seq data to bring it hierarchically closer to microarrays and apply microarray-based algorithms. These methods are implemented from the caret package. Run availableMethods() for a list of available methods. Note that voom transformation both exports transformed gene-expression matrix as well as the precision weight matrices in same dimension. Hence, the classifier should consider these two matrices. Zararsiz (2015) presented voom-based diagonal discriminant classifiers and the sparse voom-based nearest shrunken centroids classifier. Second option is to build new discrete-based classifiers to classify RNA-Seq data. Two methods are currently available in the literature. Witten (2011) considered modeling these counts with Poisson distribution and proposed sparse Poisson linear discriminant analysis (PLDA) classifier. The authors suggested a power transformation to deal with the overdispersion problem. Dong et al. (2016) extended this approach into a negative binomial linear discriminant analysis (NBLDA) classifier. More detailed information can be found in referenced papers.

Value

an MLSeq object for trained model.

Author(s)

Dincer Goksuluk, Gokmen Zararsiz, Selcuk Korkmaz, Vahap Eldem, Ahmet Ozturk and Ahmet Ergun Karaagaoglu

References

Kuhn M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, (http://www.jstatsoft.org/v28/i05/)

Anders S. Huber W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11:R106

Witten DM. (2011). Classification and clustering of sequencing data using a poisson model. The Annals of Applied Statistics, 5(4), 2493:2518

Law et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, Genome Biology, 15:R29, doi:10.1186/gb-2014-15-2-r29

Witten D. et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8:58

Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biology, 11:R25, doi:10.1186/gb-2010-11-3-r25

M. I. Love, W. Huber, and S. Anders (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol, 15(12):550,. doi: 10.1186/s13059-014-0550-8.

Dong et al. (2016). NBLDA: negative binomial linear discriminant analysis for rna-seq data. BMC Bioinformatics, 17(1):369, Sep 2016. doi: 10.1186/s12859-016-1208-1.

Zararsiz G (2015). Development and Application of Novel Machine Learning Approaches for RNA-Seq Data Classification. PhD thesis, Hacettepe University, Institute of Health Sciences, June 2015.

See Also

predictClassify, train, trainControl, voomControl, discreteControl

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
## Not run: 
library(DESeq2)
data(cervical)

# a subset of cervical data with first 150 features.
data <- cervical[c(1:150), ]

# defining sample classes.
class <- data.frame(condition = factor(rep(c("N","T"), c(29, 29))))

n <- ncol(data)  # number of samples
p <- nrow(data)  # number of features

# number of samples for test set (30% test, 70% train).
nTest <- ceiling(n*0.3)
ind <- sample(n, nTest, FALSE)

# train set
data.train <- data[ ,-ind]
data.train <- as.matrix(data.train + 1)
classtr <- data.frame(condition = class[-ind, ])

# train set in S4 class
data.trainS4 <- DESeqDataSetFromMatrix(countData = data.train,
                   colData = classtr, formula(~ 1))

## Number of repeats (repeats) might change model accuracies
## 1. caret-based classifiers:
# Random Forest (RF) Classification
 rf <- classify(data = data.trainS4, method = "rf",
         preProcessing = "deseq-vst", ref = "T",
         control = trainControl(method = "repeatedcv", number = 5,
                                repeats = 2, classProbs = TRUE))
rf

# 2. Discrete classifiers:
# Poisson Linear Discriminant Analysis
pmodel <- classify(data = data.trainS4, method = "PLDA", ref = "T",
                   class.labels = "condition",normalize = "deseq",
                   control = discreteControl(number = 5, repeats = 2,
                                             tuneLength = 10, parallel = TRUE))
pmodel

# 3. voom-based classifiers:
# voom-based Nearest Shrunken Centroids
vmodel <- classify(data = data.trainS4, normalize = "deseq", method = "voomNSC",
                   class.labels = "condition", ref = "T",
                   control = voomControl(number = 5, repeats = 2, tuneLength = 10))
vmodel

## End(Not run)

dncR/MLSeq documentation built on May 17, 2020, 6:45 p.m.