cross_validation: Cross validation method

Description Usage Arguments Details Value Author(s) Examples

Description

The ML-based classification model is trained and tested with N-fold cross validation method.

Usage

1
2
cross_validation(seed = 1, method = c("randomForest", "svm", "nnet" ), 
                 featureMat, positives, negatives, cross = 5, cpus = 1, ...)

Arguments

seed

an integer number specifying a random seed for randomly partitioning dataset.

method

a character string specifying machine learning method. Possible values are "randomForest", "nnet" or "svm"

featureMat

a numeric feature matrix.

positives

a character vector reocrding positive samples

negatives

a character vector recording negative samples.

cross

number of fold for cross validation.

cpus

an integer number specifying the number of cpus to be used for parallel computing.

...

Further parameters used to cross validation. Same with the parameters used in the classifer function.

Details

In machine learning, the cross validation method has been widely used to evaluate the performance of ML-based classification models (classifiers).

For N-fold cross validation, positive and negative samples are randomly partitioned into N groups with approximately equal amount of samples, and each group is successively used for testing the performance of the ML-based classifier trained with the other N-1 groups of positive and negative samples.

For each round of cross validation, the prediction accuracy of the ML-based classifier was assessed using the receiver operating characteristic (ROC) curve analysis.The ROC curve is a two-dimensional plot of the false positive rate (FPR, x-axis) against the true positive rate (TPR, y-axis) at all possible thresholds. The value of area under the ROC curve (AUC) was used to quantitatively score the prediction accuracy of the ML-based classifer. The AUC value is ranged from 0 to 1.0, with higher AUC value indicates a better prediction accuracy of the ML-based classifer.

After N groups have been successively used as the testing set, the N sets of (FPR, TPR) pairs were imported into R package ROCR to visualize the ROC curves. The mean value of N AUCs was then computed as the overall performance of the ML-based classification model.

Value

A list recording results from each fold cross validation including the components:

positves.train

positive samples used to train prediction model.

negatives.train

negative samples used to train prediction model.

positives.test

positive samples used to test prediction model.

negatives.test

negative samples used to test prediction model.

ml

machine learning method.

classifier

prediction model constructed with the best parameters obtained from training dataset.

positives.train.score

scores of postive samples in training dataset predicted by classifier.

positives.train.score

scores of postive samples in training dataset predicted by classifier.

positives.test.score

scores of postive samples in testing dataset predicted by classifier.

negatives.test.score

scores of negative samples in testing dataset predicted by classifier.

train.AUC

AUC value of the ML-based classifer on training dataset.

test.AUC

AUC value of the ML-based classifer on testing dataset.

Author(s)

Chuang Ma, Xiangfeng Wang

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
## Not run: 

   ##generate expression feature matrix
   sampleVec1 <- c(1, 2, 3, 4, 5, 6)
   sampleVec2 <- c(1, 2, 3, 4, 5, 6)
   featureMat <- expFeatureMatrix( expMat1 = ControlExpMat, sampleVec1 = sampleVec1, 
                                   expMat2 = SaltExpMat, sampleVec2 = sampleVec2, 
                                   logTransformed = TRUE, base = 2,
                               features = c("zscore", "foldchange", "cv", "expression"))

   ##positive samples
   positiveSamples <- as.character(sampleData$KnownSaltGenes)
   ##unlabeled samples
   unlabelSamples <- setdiff( rownames(featureMat), positiveSamples )
   idx <- sample(length(unlabelSamples))
   ##randomly selecting a set of unlabeled samples as negative samples
   negativeSamples <- unlabelSamples[idx[1:length(positiveSamples)]]

   ##five-fold cross validation
   seed <- randomSeed() #generate a random seed
   cvRes <- cross_validation(seed = seed, method = "randomForest", 
                             featureMat = featureMat, 
                             positives = positiveSamples, 
                             negatives = negativeSamples, 
                             cross = 5, cpus = 1,
                             ntree = 100 ) ##parameters for random forest algorithm

   ##get AUC values for five rounds of cross validation
   aucVec <- rep(0, 5) 
   for( i in 1:5 ) 
     aucVec[i] = cvRes[[i]]$test.AUC
  
   
   ##average AUC values as the final performance of the ML-based classifier
   mean(aucVec)

 

## End(Not run)

mlDNA documentation built on May 2, 2019, 2:15 p.m.