Description Usage Arguments Details Value Author(s) References See Also Examples
Perform cross validation as k-fold cross validation, Leave-One-Out cross validation(LOOCV) or grouped cross validation (GCV).
1 2 3 4 5 6 7 8 | ## kbsvm(......, cross=0, noCross=1, .....)
## please use kbsvm for cross validation and do not call the
## performCrossValidation method directly
## S4 method for signature 'ExplicitRepresentation'
performCrossValidation(object, x, y, sel,
model, cross, noCross, groupBy, perfParameters, verbose)
|
object |
a kernel matrix or an explicit representation |
x |
an optional set of sequences |
y |
a response vector |
sel |
sample subset for which cross validation should be performed |
model |
KeBABS model |
cross |
an integer value K > 0 indicates that k-fold cross validation should be performed. A value -1 is used for Leave-One-Out (LOO) cross validation. (see above) Default=0 |
noCross |
an integer value larger than 0 is used to specify the number of repetitions for cross validation. This parameter is only relevant if 'cross' is different from 0. Default=1 |
groupBy |
allows a grouping of samples during cross validation. The parameter is only relevant when 'cross' is larger than 1. It is an integer vector or factor with the same length as the number of samples used for training and specifies for each sample to which group it belongs. Samples from the same group are never spread over more than one fold. Grouped cross validation can also be used in grid search for each grid point. Default=NULL |
perfParameters |
a character vector with one or several values from the set "ACC" , "BACC", "MCC", "AUC" and "ALL". "ACC" stands for accuracy, "BACC" for balanced accuracy, "MCC" for Matthews Correlation Coefficient, "AUC" for area under the ROC curve and "ALL" for all four. This parameter defines which performance parameters are collected in cross validation for display purpose. The summary values are computed as mean of the fold values. AUC computation from pooled decision values requires a calibrated classifier output and is currently not supported. Default=NULL |
verbose |
boolean value that indicates whether KeBABS should print additional messages showing the internal processing logic in a verbose manner. The default value depends on the R session verbosity option. Default=getOption("verbose") this parameter is not relevant for cross validation because
the method |
Overview
Cross validation (CV) provides an estimate for the generalization
performance of a model based on repeated training on different subsets of
the data and evaluating the prediction performance on the remaining data
not used for training. Dependent on the strategy of splitting the data
different variants of cross validation exist. KeBABS implements k-fold cross
validation, Leave-One-Out cross validation and Leave-Group-Out cross
validation which is a specific variant of k-fold cross validation. Cross
validation is invoked with kbsvm
through setting the
parameters cross
and noCross
. It can either
be used for a given kernel and specific values of the SVM hyperparameters to
compute the cross validation error of a single model or in conjuction with
grid search (see gridSearch) and model selection (see
modelSelection) to determine the performance of multiple models.
k-fold Cross Validation and Leave-One-Out Cross Validation(LOOCV)
For k-fold cross validation the data is split into k roughly equal sized
subsets called folds. Samples are assigned to the folds randomly. In k
successive training runs one of the folds is kept in round-robin manner
for predicting the performance while using the other k-1 folds together as
training data. Typical values for the number of folds k are 5 or 10
dependent on the number of samples used for CV. For LOOCV the fold size
decreases to 1 and only a single sample is kept as hold out fold for
performance prediction requiring the same number of training runs in one
cross validation run as the number of sequences used for CV.
Grouped Cross Validation (GCV)
For grouped cross validation samples are assigned to groups by the
user before running cross validation, e.g. via clustering the sequences.
The predefined group assignment is passed to CV with the parameter
groupBy
in kbsvm
. GCV is a special version of k-fold
cross validation which respects group boundaries by avoiding to distribute
samples of one group over multiple folds. In this way the group(s) in the
test fold do not occur during training and learning is forced to concentrate
on more complex features instead of the simple features splitting the
groups. For GCV the parameter cross must be smaller than or equal to the
number of groups.
Cross Validation Result
The cross validation error, which is the average of the predicition errors
in all held out folds, is used as an estimate for the generalization error
of the model assiciated with the cross validation run. For classification
the fraction of incorrectly classified samples and for regression the mean
squared error (MSE) is used as prediction error. Multiple cross validation
runs can be performed through setting the parameter noCross
. The
cross validation result can be extracted from the model object returned by
cross validation with the cvResult
accessor. It contains the
mean CV error over all runs, the CV errors of the single runs and the
CV error for each fold. The CV result object can be plotted with the method
plot
showing the variation of the CV error for the different
runs as barplot. With the parameter perfParameters
in
kbsvm
the accuracy, the balanced accuracy and the Matthews
correlation coefficient can be requested as additional performance
parameters to be recorded in the CV result object which might be of interest
especially for unbalanced datasets.
cross validation stores the cross validation results in the
KeBABS model object returned by . They can be retrieved with the accessor
cvResult
returned by kbsvm
.
Johannes Palme <kebabs@bioinf.jku.at>
http://www.bioinf.jku.at/software/kebabs
J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package
for kernel-based analysis of biological sequences.
Bioinformatics, 31(15):2574-2576, 2015.
DOI: 10.1093/bioinformatics/btv176.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | ## load transcription factor binding site data
data(TFBS)
enhancerFB
## select a few samples for training - here for demonstration purpose
## normally you would use 70 or 80% of the samples for training and
## the rest for test
## train <- sample(1:length(enhancerFB), length(enhancerFB) * 0.7)
## test <- c(1:length(enhancerFB))[-train]
train <- sample(1:length(enhancerFB), 50)
## create a kernel object for the gappy pair kernel with normalization
gappy <- gappyPairKernel(k=1, m=4)
## show details of kernel object
gappy
## run cross validation with the kernel on C-svc in LiblineaR for cost=10
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappy,
pkg="LiblineaR", svm="C-svc", cost=10, cross=3)
## show cross validation result
cvResult(model)
## Not run:
## perform tive cross validation runs
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappy,
pkg="LiblineaR", svm="C-svc", cost=10, cross=10, noCross=5)
## show cross validation result
cvResult(model)
## plot cross validation result
plot(cvResult(model))
## run Leave-One-Out cross validation
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappy,
pkg="LiblineaR", svm="C-svc", cost=10, cross=-1)
## show cross validation result
cvResult(model)
## run gouped cross validation with full data
## on coiled coil dataset
##
## In this example the groups were determined through single linkage
## clustering of sequence similarities derived from ungapped heptad-specific
## pairwise alignment of the sequences. The variable {\tt ccgroup} contains
## the pre-calculated group assignments for the individual sequences.
data(CCoil)
ccseq
head(yCC)
head(ccgroups)
gappyK1M6 <- gappyPairKernel(k=1, m=4)
## run k-fold CV without groups
model <- kbsvm(x=ccseq, y=as.numeric(yCC), kernel=gappyK1M6,
pkg="LiblineaR", svm="C-svc", cost=10, cross=3, noCross=2,
perfObjective="BACC",perfParameters=c("ACC", "BACC"))
## show result without groups
cvResult(model)
## run grouped CV
model <- kbsvm(x=ccseq, y=as.numeric(yCC), kernel=gappyK1M6,
pkg="LiblineaR", svm="C-svc", cost=10, cross=3,
noCross=2, groupBy=ccgroups, perfObjective="BACC",
perfParameters=c("ACC", "BACC"))
## show result with groups
cvResult(model)
## For grouped CV the samples in the held out fold are from a group which
## is not present in training on the other folds. The simimar CV error
## with and without groups shows that learning is not just assigning
## labels based on similarity within the groups but is focusing on features
## that are indicative for the class also in the CV without groups. For the
## GCV no information about group membership for the samples in the held
## out fold is present in the model. This example should show how GCV
## is performed. Because of package size limitations no specific dataset is
## available in this package where GCV is necessary.
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.