Description Usage Arguments Details Value Author(s) References See Also Examples
Perform grid search with one or multiple sequence kernels on one or multiple SVMs with one or multiple SVM parameter sets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ## kbsvm(...., kernel=list(kernel1, kernel2), pkg=pkg1, svm=svm1,
## cost=cost1, ...., cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=pkg1, svm=svm1,
## cost=c(cost1, cost2), ...., cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=c(pkg1, pkg1, pkg1),
## svm=c(svm1, svm2, svm3), cost=c(cost1, cost2, cost3), ....,
## cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=c(pkg1, pkg2, pkg3),
## svm=c(svm1, svm2, svm3), cost=c(cost1, cost2, cost3), ....,
## cross=0, noCross=1, ....)
## kbsvm(...., kernel=list(kernel1, kernel2, kernel3), pkg=c(pkg1, pkg2),
## svm=c(svm1, svm2), cost=c(cost1, cost2), ...., cross=0,
## noCross=1, ....)
## for details see below
|
kernel |
and other parameters see |
Overview
To simplify the selection of an appropriate sequence kernel (including
setting of the kernel parameters), SVM implementation and setting of SVM
hyperparameters KeBABS provides grid search functionality. In
addition to the possibility of running the same learning tasks for
different settings of the SVM hyperparameters the concept of grid search
is seen here in the broader context of finding good values for all major
variable parts of the learning task which includes:
selection of the sequence kernel and standard kernel parameters: spectrum, mismatch, gappy pair or motif kernel
selection of the kernel variant: regular, annotation-specific, position-specific or distance weighted kernel variants
selection of the SVM implementation via package and SVM
selection of the SVM hyperparameters for the SVM implementation
KeBABS supports the joint variation of any combination of these learning
aspects together with cross validation (CV) to find the best selection based
on cross validation performance. After the grid search the performance
values of the different settings and the best setting of the grid search
run can be retrieved from the KeBABS model with the accessor
modelSelResult
.
Grid search is started with the method kbsvm
by passing
multiple values to parameters for which in regular training only a single
parameter value is used. Multiple values can be passed for the parameter
kernel
as list of kernel objects and for the parameters pkg
,
svm
and the hyperparameters of the used SVMs as vectors (numeric or
integer vector dependent on the hyperparameter). The parameter cost in the
usage section above is just one representative of SVM hyperparameters that
can be varied in grid search. Following types of grid search are supported
(for examples see below):
variation of one or multiple hyperparameter(s) for a given SVM implementation and one specific kernel by passing hyperparameter values as vectors
variation of the kernel parameters of a single kernel:
for the sequence kernels in addition to the standard kernel parameters
like k for spectrum or m for gappy pair analysis can be performed in a
position-independent or position-dependent manner with multiple
distance weighting functions and different parameter settings for the
distance weighting functions (see positionMetadata
) or
with or without annotation specific functionality (see
annotationMetadata
using one specific or multiple
annotations resulting in considerable variation possibilities on the
kernel side. The kernel objects for the different parameter settings
of the kernel must be precreated and are passed as list to
kbsvm
. Usually each kernel has the best performance
at differernt hyperparameter values. Therefore in general just varying
the kernel parameters without varying the hyperparameter values does
not make sense but both must be varied together as described below.
variation of multiple SVMs from the same or different R packages with identical or different SVM hyperparameters (dependent on the formulation of the SVM objective) for one specific kernel
combination of the previous three variants as far as runtime allows (see also runtime hints below)
For collecting performance values grid search is organized in a matrix
like manner with different kernel objects representing the rows and
different hyperparameter settings or SVM and hyperparameter settings as
columns of the matrix. If multiple hyperparameters are used on a single
SVM the same entry in all hyperparameter vectors is used as one parameter
set corresponding to a single column in the grid matrix. The same
applies to multiple SVMs, i.e. when multiple SVMs are used from the same
package the pkg
parameter still must have one entry for each entry
in the svm
parameter (see examples below). The best performing
setting is reported dependent on the performance objective.
Instead of a single training and test cycle for each grid point cross
validation should be used to get more representative results. In this case
CV is executed for each parameter setting. For larger datasets or kernels
with higher complexity the runtime for the full grid search should be
limited through adequate selection of the parameter cross
.
Performance measures and performance objective
The usual performance measure for grid search is the cross validation error
which is stored by default for each grid point. For e.g. non-symmetrical
class distribution of the dataset other performance measures can be more
expressive. For such sitations also the accuracy, the balanced accuracy and
the Matthews correlation coefficient can be stored for a grid point (see
parameter perfParameters
in kbsvm
. (The accuracy
corresponds fully to the CV error because it is just the inverted measure.
It is included for easier comparability with the balanced accuracy). The
performance values can be retrieved from the model selection result object
with the accessor performance
. The objective for selecting the
best performing paramters settings is by default the CV error. With the
parameter perfObjective
in kbsvm
one of the other
above mentioned performance parameters can be chosen as objective for the
best settings instead of the cross validation error.
Runtime Hints
When parameter showCVTimes
in kbsvm
is set to TRUE the
runtime for the individual cross validation runs is shown for each grid
point. In this way quick runtime estimates can be gathered through running
the grid search for a reduced grid and extrapolating the runtimes to the
full grid. Display of a progress indication in grid search is available
with the parameter showProgress
in kbsvm
.
Dependent on the number of sequences, the complexity of the kernel
processing, the type of chosen cross validation and the degree of variation
of parameters in grid search the runtime can grow drastically.
One possible strategy for reducing the runtime could be a stepwise
approach searching for areas with good performance in a first coarse grid
search run and then refining the areas of good performance with additional
more fine grained grid searches.
The implementation of the sequence kernels was done with a strong focus on
runtime performance which brings a considerable improvement compared to
other implementations. In KeBABS also an interface to the very fast SVM
implementations in package LiblineaR is available. Beyond these performance
improvements KeBABS also supports the generation of sparse explicit
representations for every sequence kernel which can be used instead of the
kernel matrix for learning. In many cases especially with a large number of
samples where the kernel matrix would become too large this alternative
provides additional dynamical benefits. The current implementation of grid
search does not make use of multi-core infrastructures, the entire
processing is done on a single core.
grid search stores the results in the KeBABS model. They can be
retrieved with the accessor modelSelResult{KBModel}
.
Johannes Palme <kebabs@bioinf.jku.at>
http://www.bioinf.jku.at/software/kebabs
J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package
for kernel-based analysis of biological sequences.
Bioinformatics, 31(15):2574-2576, 2015.
DOI: 10.1093/bioinformatics/btv176.
kbsvm
,
spectrumKernel
, mismatchKernel
,
gappyPairKernel
, motifKernel
,
positionMetadata
, annotationMetadata
,
performModelSelection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | ## load transcription factor binding site data
data(TFBS)
enhancerFB
## The C-svc implementation from LiblineaR is chosen for most of the
## examples because it is the fastest SVM implementation. With SVMs from
## other packages slightly better results could be achievable.
## To get a realistic image of possible performance values, kernel behavior
## and speed of grid search together with 10-fold cross validation a
## resonable number of sequences is needed which would exceed the runtime
## restrictions for automatically executed examples. Therefore the grid
## search examples must be run manually. In these examples we use the full
## dataset for grid search.
train <- sample(1:length(enhancerFB), length(enhancerFB))
## grid search with single kernel object and multiple hyperparameter values
## create gappy pair kernel with normalization
gappyK1M3 <- gappyPairKernel(k=1, m=3)
## show details of single gappy pair kernel object
gappyK1M3
## grid search for a single kernel object and multiple values for cost
pkg <- "LiblineaR"
svm <- "C-svc"
cost <- c(0.01,0.1,1,10,100,1000)
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappyK1M3,
pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3)
## show grid search results
modelSelResult(model)
## Not run:
## create the list of spectrum kernel objects with normalization and
## kernel parameters values for k from 1 to 5
specK15 <- spectrumKernel(k=1:5)
## show details of the four spectrum kernel objects
specK15
## run grid search with several kernel parameter settings for the
## spectrum kernel with a single SVM parameter setting
## ATTENTION: DO NOT USE THIS VARIANT!
## This variant does not bring comparable performance for the different
## kernel parameter settings because usually the best performing
## hyperparameter values could be quite different for different kernel
## parameter settings or between different kernels, grid search for
## multiple kernel objects should be done as shown in the next example
pkg <- "LiblineaR"
svm <- "C-svc"
cost <- 2
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK15,
pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=10)
## show grid search results
modelSelResult(model)
## grid search with multiple kernel objects and multiple values for
## hyperparameter cost
pkg <- "LiblineaR"
svm <- "C-svc"
cost <- c(0.01,0.1,1,10,50,100,150,200,500,1000)
model <- kbsvm(x=enhancerFB, sel=train, y=yFB[train], kernel=specK15,
pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=10,
showProgress=TRUE)
## show grid search results
modelSelResult(model)
## grid search for a single kernel object with multiple SVMs
## from different packages
## here with display of cross validation runtimes for each grid point
## pkg, svm and cost vectors must have same length and the corresponding
## entry in each of these vectors are one SVM + SVM hyperparameter setting
pkg <- rep(c("kernlab", "e1071", "LiblineaR"),3)
svm <- rep("C-svc", 9)
cost <- rep(c(0.01,0.1,1),each=3)
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappyK1M3,
pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3,
showCVTimes=TRUE)
## show grid search results
modelSelResult(model)
## run grid search for a single kernel with multiple SVMs from same package
## here all from LiblineaR: C-SVM, L2 regularized SVM with L2 loss and
## SVM with L1 regularization and L2 loss
## attention: for different formulation of the SMV objective use different
## values for the hyperparameters even if they have the same name
pkg <- rep("LiblineaR", 9)
svm <- rep(c("C-svc","l2rl2l-svc","l1rl2l-svc"), each=3)
cost <- c(1,150,1000,1,40,100,1,40,100)
model <- kbsvm(x=enhancerFB, sel=train, y=yFB[train], kernel=gappyK1M3,
pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3)
## show grid search results
modelSelResult(model)
## create the list of kernel objects for gappy pair kernel
gappyK1M15 <- gappyPairKernel(k=1, m=1:5)
## show details of kernel objects
gappyK1M15
## run grid search with progress indication with ten kernels and ten
## hyperparameter values for cost and 10 fold cross validation on full
## dataset (500 samples)
pkg <- rep("LiblineaR", 10)
svm <- rep("C-svc", 10)
cost <- c(0.0001,0.001,0.01,0.1,1,10,100,1000,10000,100000)
model <- kbsvm(x=enhancerFB, y=yFB, kernel=c(specK15, gappyK1M15),
pkg=pkg, svm=svm, cost=cost, cross=10, explicit="yes",
showCVTimes=TRUE, showProgress=TRUE)
## show grid search results
modelSelResult(model)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.