Description Usage Arguments Details Value Author(s) References See Also Examples
Train an SVM-model with a sequence kernel on biological sequences
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ## S4 method for signature 'BioVector'
kbsvm(x, y, kernel = NULL, pkg = "auto",
svm = "C-svc", explicit = "auto", explicitType = "auto",
featureType = "linear", featureWeights = "auto",
weightLimit = .Machine$double.eps, classWeights = numeric(0), cross = 0,
noCross = 1, groupBy = NULL, nestedCross = 0, noNestedCross = 1,
perfParameters = character(0), perfObjective = "ACC", probModel = FALSE,
sel = integer(0), features = NULL, showProgress = FALSE,
showCVTimes = FALSE, runtimeWarning = TRUE,
verbose = getOption("verbose"), ...)
## S4 method for signature 'XStringSet'
kbsvm(x, y, kernel = NULL, pkg = "auto",
svm = "C-svc", explicit = "auto", explicitType = "auto",
featureType = "linear", featureWeights = "auto",
weightLimit = .Machine$double.eps, classWeights = numeric(0), cross = 0,
noCross = 1, groupBy = NULL, nestedCross = 0, noNestedCross = 1,
perfParameters = character(0), perfObjective = "ACC", probModel = FALSE,
sel = integer(0), features = NULL, showProgress = FALSE,
showCVTimes = FALSE, runtimeWarning = TRUE,
verbose = getOption("verbose"), ...)
## S4 method for signature 'ExplicitRepresentation'
kbsvm(x, y, kernel = NULL, pkg = "auto",
svm = "C-svc", explicit = "auto", explicitType = "auto",
featureType = "linear", featureWeights = "auto",
weightLimit = .Machine$double.eps, classWeights = numeric(0), cross = 0,
noCross = 1, groupBy = NULL, nestedCross = 0, noNestedCross = 1,
perfParameters = character(0), perfObjective = "ACC", probModel = FALSE,
sel = integer(0), showProgress = FALSE, showCVTimes = FALSE,
runtimeWarning = TRUE, verbose = getOption("verbose"), ...)
## S4 method for signature 'KernelMatrix'
kbsvm(x, y, kernel = NULL, pkg = "auto",
svm = "C-svc", explicit = "no", explicitType = "auto",
featureType = "linear", featureWeights = "no",
classWeights = numeric(0), cross = 0, noCross = 1, groupBy = NULL,
nestedCross = 0, noNestedCross = 1, perfParameters = character(0),
perfObjective = "ACC", probModel = FALSE, sel = integer(0),
showProgress = FALSE, showCVTimes = FALSE, runtimeWarning = TRUE,
verbose = getOption("verbose"), ...)
|
x |
multiple biological sequences in the form of a
|
y |
response vector which contains one value for each sample in 'x'. For classification tasks this can be either a character vector, a factor or a numeric vector, for regression tasks it must be a numeric vector. For numeric labels in binary classification the positive class must have the larger value, for factor or character based labels the positive label must be at the first position when sorting the labels in descendent order according to the C locale. If the parameter sel is used to perform training with a sample subset the response vector must have the same length as 'sel'. |
kernel |
a sequence kernel object or a string kernel from package kernlab. In case of grid search or model selection a list of sequence kernel objects can be passed to training. |
pkg |
name of package which contains the SVM implementation to be used
for training, e.g. |
svm |
name of the SVM used for the classification or regression task,
e.g. "C-svc". For gridSearch or model selection multiple SVMs can be passed
as character vector. For each entry in this character vector a corresponding
entry in the character vector for parameter |
explicit |
this parameter controls whether training should be performed
with the kernel matrix (see |
explicitType |
this parameter is only relevant when parameter 'explicit' is different from "no". The values "sparse" and "dense" indicate whether a sparse or dense explicit representation should be used. When the parameter is set to "auto" KeBABS selects a variant. Default="auto" |
featureType |
when the parameter is set to "linear" single features areused in the analysis (with a linear kernel matrix or a linear kernel applied to the linear explicit representation). When set to "quadratic" the analysis is based on feature pairs. For an SVM from LiblineaR (which does not support kernels) KeBABS generates a quadratic explicit representation. For the other SVMs a polynomial kernel of degree 2 is used for learning via explicit representation. In the case of learning via kernel matrix a quadratic kernel matrix (quadratic here in the sense of linear kernel matrix with each element taken to power 2) is generated. Default="linear" |
featureWeights |
with the values "no" and "yes" the user can control whether feature weights are calulated as part of the training. When the parameter is set to "auto" KeBABS selects a variant (see below). Default="auto" |
weightLimit |
the feature weight limit is a single numeric value and allows pruning of feature weights. All feature weights with an absolute value below this limit are set to 0 and are not considered in the model and for further predictions. This parameter is only relevant when featureWeights are calculated in KeBABS during training. Default=.Machine$double.eps |
classWeights |
a numeric named vector of weights for the different classes, used for asymmetric class sizes. Each element of the vector must have one of the class names but not all class names must be present. Default=1 |
cross |
an integer value K > 0 indicates that k-fold cross validation should be performed. A value -1 is used for Leave-One-Out (LOO) cross validation. (see above) Default=0 |
noCross |
an integer value larger than 0 is used to specify the number of repetitions for cross validation. This parameter is only relevant if 'cross' is different from 0. Default=1 |
groupBy |
allows a grouping of samples during cross validation. The
parameter is only relevant when 'cross' is larger than 1. It is an integer
vector or factor with the same length as the number of samples used for
training and specifies for each sample to which group it belongs. Samples
from the same group are never spread over more than one fold. (see
|
nestedCross |
in integer value K > 0 indicates that a model selection with nested cross validation should be performed with a k-fold outer cross validation. The inner cross validation is defined with the 'cross' parameter (see below), Default=0 |
noNestedCross |
an integer value larger than 0 is used to specify the number of repetitions for the nested cross validation. This parameter is only relevant if 'nestedCross' is larger than 0. Default=1 |
perfParameters |
a character vector with one or several values from the set "ACC" , "BACC", "MCC", "AUC" and "ALL". "ACC" stands for accuracy, "BACC" for balanced accuracy, "MCC" for Matthews Correlation Coefficient, "AUC" for area under the ROC curve and "ALL" for all four. This parameter defines which performance parameters are collected in cross validation, grid search and model selection for display purpose. The value "AUC" is currently not supported for multiclass classification. Default=NULL |
perfObjective |
a singe character string from the set "ACC", "BACC" and "MCC" (see previous parameter). The parameter is only relevant in grid search and model selection and defines which performance measure is used to determine the best performing parameter set. Default="ACC" |
probModel |
when setting this boolean parameter to TRUE a probability model is determined as part of the training (see below). Default=FALSE |
sel |
subset of indices into |
features |
feature subset of the specified kernel in the form of a character vector. When a feature subset is passed to the function all other features in the feature space are not considered for training (see below). A feature subset can only be used when a single kernel object is specified in the 'kernel' parameter. Default=NULL |
showProgress |
when setting this boolean parameter to TRUE the progress of a cross validation is displayed. The parameter is only relevant for cross validation. Default=FALSE |
showCVTimes |
when setting this boolean parameter to TRUE the runtimes of the cross validation runs are shown after the cross validation is finished. The parameter is only relevant for cross validation. Default=FALSE |
runtimeWarning |
when setting this boolean parameter to FALSE a warning for long runtimes will not be shown in case of large feature space dimension or large number of samples. Default=TRUE |
verbose |
boolean value that indicates whether KeBABS should print additional messages showing the internal processing logic in a verbose manner. The default value depends on the R session verbosity option. Default=getOption("verbose") |
... |
additional parameters which are passed to SVM training transparently. |
Overview
The kernel-related functionality provided in this package is specifically
centered around biological sequences, i.e. DNA-, RNA- or AA-sequences (see
also DNAStringSet
, RNAStringSet
and
AAStringSet
) and Support Vector Machine (SVM) based methods.
Apart from the implementation of the most relevant kernels for sequence
analysis (see spectrumKernel
, mismatchKernel
,
gappyPairKernel
and motifKernel
) KeBABS also
provides a framework which allows easy interworking with existing SVM
implementations in other R packages. In the current implementation the SVMs
provided in the packages kernlab
,
e1071
and
LiblineaR
are in focus. Starting with
version 1.2.0 KeBABS also contains the dense implementation of LIBSVM which
is functionally equivalent to the sparse implementation of LIBSVM in package
e1071
but additionally supports dense kernel
matrices as preferred implementation for learning via kernel matrices.
This framework can be considered like a "meta-SVM", which provides
a simple and unified user interface to these SVMs for classification (binary
and multiclass) and regression tasks. The user calls the "meta-SVM" in a
classical SVM-like manner by passing sequence data, a sequence kernel with
kernel parameters and the SVM which should be used for the learning task
togehter with SVM parameters. KeBABS internally generates the relevant
representations (see getKernelMatrix
or getExRep
)
from the sequence data using the specified kernel, adapts parameters and
formats to the selected SVM and internally calls the actual SVM
implementation in the requested package. KeBABS unifies the
result returned from the invoked SVM and returns a unified data structure,
the KeBABS model, which also contains the SVM-specific model (see
svmModel
.
The KeBABS model is used in prediction (see predict
) to
predict the response for new sequence data. On user request the feature
weights are computed and stored in the Kebabs model during training (see
below). The feature weights are used for the generation of prediction
profiles (see getPredictionProfile
) which show the importance
of sequence positions for a specfic learning task.
Training of biological sequences with a sequence kernel
Training is performed via the method kbsvm
for classification and
regression tasks. The user passes sequence data, the response vector, a
sequence kernel object and the requested SVM along with SVM parameters
to kbsvm
and receives the training results in the form of a
KeBABS model object of class KBModel
. The accessor
svmModel
allows to retrieve the SVM specific model from the KeBABS
model object. However, for regular operation a detailed look into the SVM
specific model is usually not necessary.
The standard data format for sequences in KeBABS are the
XStringSet
-derived classes DNAStringSet
,
RNAStringSet
and AAStringSet
.
(When repeat regions are coded as lowercase characters and should be
excluded from the analysis the sequence data can be passed as
BioVector
which also supports lowercase characters
instead of XStringSet
format. Please note that the
classes derived from XStringSet
are much more
powerful than the BioVector
derived classes and
should be used in all cases where lowercase characters are not needed).
Instead of sequences also a precomputed explicit representation or
a precomputed kernel matrix can be used for training. Examples for
training with kernel matrix and explicit representation can be found on
the help page for the prediction method predict
.
Apart from SVM training kbsvm
can be also used for cross
validation (see crossValidation and parameters cross
and
noCross
), grid search for SVM- and kernel-parameter values (see
gridSearch) and model selection (see modelSelection and
parameters nestedCross
and noNestedCross
).
Package and SVM selection
The user specifies the SVM implementation to be used for a learning task by
selecting the package with the pkg
parameter and the SVM method in
the package with the SVM
parameter. Currently the packages
codekernlab, e1071
and
LiblineaR
are supported. The names for
SVM methods vary from package to package and KeBABS provide following
unified names which can be selected across packages. The following table
shows the available SVM methods:
SVM name | description |
----------------------- | ----------------------------------------- --------- |
C-svc: | C classification (with L2 regularization and L1 loss) |
l2rl2l-svc: | classif. with L2 regularization and L2 loss (dual) |
l2rl2lp-svc: | classif. with L2 regularization and L2 loss (primal) |
l1rl2l-svc: | classification with L1 regularization and L2 loss |
nu-svc: | nu classification |
C-bsvc: | bound-constraint SVM classification |
mc-natC: | Crammer, Singer native multiclass |
mc-natW: | Weston, Watkins native multiclass |
one-svc: | one class classification |
eps-svr: | epsilon regression |
nu-svr: | nu regression |
eps-bsvr: | bound-constraint svm regression |
Pairwise multiclass can be selected for C-svc
and nu-svc
if
the label vector contains more than two classes. For
LiblineaR
the multiclass implementation
is always based on "one against the rest" for all SVMs except for
mc-natC
which implements native multiclass according to Crammer and
Singer. The following table shows which SVM method is available in which
package:
SVM name | kernlab | e1071 | LiblineaR |
-------------------- | -------------- | -------------- | ------ -------- |
C-svc: | x | x | x |
l2rl2l-svc: | - | - | x |
l2rl2lp-svc: | - | - | x |
l1rl2l-svc: | - | - | x |
nu-svc: | x | x | - |
C-bsvc: | x | - | - |
mc-natC: | x | - | x |
mc-natW: | x | - | - |
one-svc: | x | x | - |
eps-svr: | x | x | - |
nu-svr: | x | x | - |
eps-bsvr: | x | - | - |
SVM parameters
To avoid unnecessary changes of parameters names when switching between SVM
implementation in different packages unified names for identical parameters
are available. They are translated by KeBABS to the SVM specific name. The
obvious example is the cost parameter for the C-svm. It is named C
in
kernlab
and cost
in
e1071
and
LiblineaR
. The unified name in KeBABS is
cost. If the parameter is passed to kbsvm
in a package specific
version it is translated back to the KeBABS name internally. This applies to
following parameters - here shown with their unified names:
parameter name | description |
----------------------- | ----------------------------------------- ----------- |
cost: | cost parameter of C-SVM |
nu: | nu parameter of nu-SVM |
eps: | epsilon parameter of eps-SVR and nu-SVR |
classWeights: | class weights for asymmetrical class size |
tolerance: | tolerance as termination crit. for optimization |
cross: | number of folds in k-fold cross validation |
Hint: If a tolerance value is specified in kbsvm
the same value
should be used throughout the complete analysis to make results
comparable.
The following table shows the relevance of the SVM parameters cost, nu and
eps for the different SVMs:
SVM name | cost | nu | eps |
-------------------- | -------------- | -------------- | ----- --------- |
C-svc: | x | - | - |
l1rl2l-svc: | x | - | - |
l1rl2lp-svc: | x | - | - |
l1rl2l-svc: | x | - | - |
nu-svc: | - | x | - |
C-bsvc: | x | - | - |
mc-natC: | x | - | - |
mc-natW: | x | - | - |
one-svc: | x | - | - |
eps-svr: | - | - | x |
nu-svr: | - | x | - |
eps-bsvr: | - | - | x |
Hint: Please be aware that identical parameter names between different SVMs
do not necessarily mean, that their values are also identical between
packages but they depend on the actual SVM formulation which could be
different. For example the cost
parameter is identical between
C-SVMs in packages kernlab
,
e1071
and
LiblineaR
but is for example different
from the cost
parameter in l2rl2l-svc in
LiblineaR
because the C-SVM
uses a linear loss but the l2rl2l-svc uses a quadratic loss.
Feature weights
On user request (see parameter featureWeights
) feature weights are
computed amd stored in the model (for a detailed description see
getFeatureWeights
). Pruning of feature weights can be achieved
with the parameter weightLimit
which defines the cutoff for
small feature weights not stored in the model.
Hint: For training with a precomputed kernel matrix feature weights are
not available. For multiclass prediction is currently not performed via
feature weights but native in the SVM.
Cross validation, grid search and model selection
Cross validation can be controlled with the parameters cross
and
noCross
. For details on cross validation see crossValidation.
Grid search can be performed by passing multiple SVM parameter values as
vector instead of a single value to kbsvm
. Also multiple sequence
kernel objects and multiple SVMs can be used for grid search. For details
see gridSearch. For model selection nested cross validation is used
with the parameters nestedCross
and noNestedCross
for the
outer and cross
and noCross
for the inner cross validation.
For details see modelSelection.
Training with feature subset
After performing feature selection repeating the learning task with a
feature subset can easily be achieved by specifying a feature subset
with the parameter features
as character vector. The feature subset
must be a subset from the feature space of the sequence kernel passed in
the parameter kernel
. Grid search and model selection with a feature
subset can only be used for a single sequence kernel object in the parameter
kernel
.
Hint: For normalized kernels all features of the feature space are used for
normalization not just the feature subset. For a normalized motif kernel
(see motifKernel
) only the features listed in the motif list
are part of the feature space. Therefore the motif kernel defined with the
same feature subset leads to a different result in the normalized case.
Probability model
SVMs from the packages kernlab
and
e1071
support the generation of a probability
model using Platt scaling (for details see
kernlab
,
predict.ksvm
,
svm
and
predict.svm
)
allowing the computation of class probabilities during prediction. The
parameter probabilityModel
controls the generation of a probability
model during training (see also parameter predictionType
in
predict
).
kbsvm: upon successful completion, the function returns a model of class
KBModel
. Results for cross validation can be retrieved
from this model with the accessor cvResult
, results for grid
search or model selection with modelSelResult
. In case of
model selection the results of the outer cross validation loop can be
retrieved with with the accessor cvResult
.
Johannes Palme <kebabs@bioinf.jku.at>
http://www.bioinf.jku.at/software/kebabs
J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package
for kernel-based analysis of biological sequences.
Bioinformatics, 31(15):2574-2576, 2015.
DOI: 10.1093/bioinformatics/btv176.
predict
, getKernelMatrix
,
getExRep
, kernelParameters-method
,
spectrumKernel
, mismatchKernel
,
gappyPairKernel
, motifKernel
,
getFeatureWeights
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | ## load transcription factor binding site data
data(TFBS)
enhancerFB
## we use 70 of the samples for training and the rest for test
train <- sample(1:length(enhancerFB), length(enhancerFB) * 0.7)
test <- c(1:length(enhancerFB))[-train]
## create the kernel object for dimers without normalization
specK2 <- spectrumKernel(k=2)
## show details of kernel object
specK2
## run training with kernel matrix on e1071 (via the
## dense LIBSVM implementation integrated in kebabs)
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="no")
## show KeBABS model
model
## show class of KeBABS model
class(model)
## show native SVM model contained in KeBABS model
svmModel(model)
## show class of native SVM model
class(svmModel(model))
## Not run:
## examples for package and SVM selection
## now run the same samples with the same kernel on e1071 via
## explicit representation
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="yes")
## show KeBABS model
model
## show native SVM model contained in KeBABS model
svmModel(model)
## show class of native SVM model
class(svmModel(model))
## run the same samples with the same kernel on e1071 with nu-SVM
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="nu-svc",nu=0.7, explicit="yes")
## show KeBABS model
model
## training with feature weights
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="yes",
featureWeights="yes")
## show feature weights
dim(featureWeights(model))
featureWeights(model)[,1:5]
## training without feature weights
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="yes",
featureWeights="no")
## show feature weights
featureWeights(model)
## pruning of feature weights
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="yes",
featureWeights="yes", weightLimit=0.5)
dim(featureWeights(model))
## training with precomputed kernel matrix
## feature weights cannot be computed for precomputed kernel matrix
km <- getKernelMatrix(specK2, x=enhancerFB, selx=train)
model <- kbsvm(x=km, y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="no")
## training with precomputed explicit representation
exrep <- getExRep(enhancerFB, sel=train, kernel=specK2)
model <- kbsvm(x=exrep, y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="yes")
## computing of probability model via Platt scaling during training
## in prediction class membership probabilities can be computed
## from this probability model
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
pkg="e1071", svm="C-svc", C=10, explicit="yes",
probModel=TRUE)
## show parameters of the fitted probability model which are the parameters
## probA and probB for the fitted sigmoid function in case of classification
## and the value sigma of the fitted Laplacian in case of a regression
probabilityModel(model)
## cross validation, grid search and model selection are also performed
## via the kbsvm method. Examples can be found on the respective help pages
## (see Details section)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.