Description Usage Arguments Details Value Author(s) References See Also Examples
Assign position related metadata and reate a kernel object with position dependency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | linWeight(d, sigma = 1)
expWeight(d, sigma = 1)
gaussWeight(d, sigma = 1)
swdWeight(d)
## S4 method for signature 'XStringSet'
## positionMetadata(x) <- value
## S4 method for signature 'BioVector'
## positionMetadata(x) <- value
## S4 method for signature 'XStringSet'
positionMetadata(x)
## S4 method for signature 'BioVector'
positionMetadata(x)
|
d |
a numeric vector of distance values |
sigma |
a positive numeric value defining the peak width or in case of gaussWeight the width of the bell function (details see below) |
x |
biological sequences in the form of a
|
value |
for assignment of position metadata the value is an integer vector with gives the offset to the start position 1 for each sequence. Positive and negative offset values are possible. Without position metadata all sequences must be aligned and start at position 1. For deletion of position metadata set value to NULL. |
Position Dependent Kernel
For the standard spectrum kernel kmers are considered independent of their
position in the calculation of the similarity value between two sequences.
For position dependent kernels the position of a kmer/pattern is also of
importance. Position information for a pair of sequences can be used in
a sequenceKernel
in three different ways representing the
full range of position dependency:
Position independent kernel: ignores the position of patterns and
just takes the number of their occurances or their presence (see parameter
presence
in functions spectrumKernel,
gappyPairKernel, motifKernel
) in the sequences
into account for similarity determination.
Distance weighted kernel: uses the position related distance between the occurance of the same pattern in the two sequences in weighted form as contribution to the similarity value (see below under Distance Weighted kernel)
Position specific kernel: considers patterns only if they occur at the same position in the two sequences (see below under Position Specific Kernel)
Position dependency is available in all kernels except the mismatch
kernel.
Distance Weighted Kernel
These kernels weight the contribution to the similarity value
based on the distance of their start positions in the two sequences. The
user can define the distance weights either through passing a distance
weighting function or a weight vector to the kernel. Through this weighting
the degree of locality in the similarity consideration between two sequences
can be adjusted flexibly. Such a position dependent kernel can be used in
the same way as the normal position independent kernel variant. Distance
weighting can be used for all kernels in this package except the mismatch
kernel. The package defines four predefined weighting functions (see also
examples):
linWeigth: a weighting function with linear decrease
expWeight: a weighting function with exponential decrease
gaussWeigth: a bell-shaped weighting function with a decrease similar to a gaussian distribution
swdWeight: the distance weighting function used in the Shifted Weighted Degree (SWD) kernel which is similar to an exponential decrease but it has a smaller peak and larger tails
Also user-defined functions can be used for distance weighting.
(see below)
Position Specific Kernel
One variant of position dependent kernels is the position specific kernel.
This kernel takes patterns into account only if they are located at
identical positions in the two sequences. This kernel can be selected
through passing a distance weight value of 1 to the kernel indicating that
the neighborhood of a pattern in the other sequence is irrelevant for the
similarity consideration. This kernel is in fact one end of the
spectrum (sic!) where locality is reduced to the exact location and the
normal position independent kernel is at the other end - not caring about
position at all. Through adjustment of sigma in the predefined functions
a continous blending between these two extremes is possible for the degree
of locality. Evaluation of position information is controlled through
setting the parameter distWeight
to 1 in the functions
spectrumKernel, gappyPairKernel, motifKernel
.
This parameter value is in fact interpreted as a numeric vector with 1 for
zero distance and 0 for all other distances.
Positive Definiteness
The standard SVMs only support positive definite kernels / kernel matrices.
This means that the distance weighting function must must be chosen such
that the resulting kernel is positive definite. For positive definiteness
also symmetry of the distance weighting function is important. Unlike usual
distances the relative distance value here can have positive and negative
values dependent on whether the pattern in the second sequence is located
at higher or lower positions than the pattern in the first sequence. The
predefined distance weighting functions except for swdWeight deliver a
positive definite kernel for all parameter settings. According to Sonnenburg
et al. 2005 the SWD kernel has empirically shown positive definiteness but
it is not proved for this kernel. If a weight vector with predefined weights
per distance is passed to the kernel instead of a distance weighting
function positive definiteness of the kernel must also be ensured by
adequate selection of the weight values.
User-Defined Distance Function
For user defined distance functions symmetry and positive definitness of
the resulting kernel are important. Such a function gets a numeric distance
vector 'x' as input (and possibly other parameters controlling the weighting
behavior) and returns a weight vector of identical length. When
called with a missing parameter x all other parameters must be supplied or
have appropriate default values. In this case the function must return a
new function with just the single parameter x which calls the original user
defined function with x and all the other parameters set to the values passed
in the call.
This behavior is needed for assignment of the function with missing
parameter x to the distWeight parameter in the kernel. At the time of kernel
definition the actual distance values are not available. Later when
sequence data is passed to this kernel for generation of a kernel matrix or
an explicit representation this single argument function is called to get
the distance dependent weights. The code for the predefined expWeight
function in the example section below shows how a user-specific
function can be set up.
Offset
To allow flexible alignment of sequence positions without redefining the
XStringSet or BioVector an additional metadata element named offset can be
assigned to the sequence set via positionMetadata<-
(see example
below). Position metadata is a numeric vector with the same number of
elements as the sequence set and gives for each sequence an offset to
position 1. When positions metadata is not assigned to a sequence set the
position 1 is associated with the first character in each sequence of the
sequence set., i.e. in this case the sequences should be aligned such that
all have the same starting positions with respect to the learning task
(e.g. all sequences start at a transcription start site). Offset information
is only evaluated in position dependent kernel variants.
The distance weighting functions return a numerical vector with distance weights.
Johannes Palme <kebabs@bioinf.jku.at>
http://www.bioinf.jku.at/software/kebabs
(Bodenhofer, 2009) – U. Bodenhofer, K. Schwarzbauer, M. Ionescu and
S. Hochreiter. Modelling position specificity in sequence kernels by fuzzy
equivalence relations.
(Sonnenburg, 2005) – S. Sonnenburg, G. Raetsch and B. Schoelkopf.
Large Scale Genomic Sequence SVM Classifiers.
J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package
for kernel-based analysis of biological sequences.
Bioinformatics, 31(15):2574-2576, 2015.
DOI: 10.1093/bioinformatics/btv176.
spectrumKernel
, gappyPairKernel
,
motifKernel
, annotationMetadata
,
metadata
, mcols
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | ## plot predefined weighting functions for sigma=10
curve(linWeight(x, sigma=10), from=-20, to=20, xlab="pattern distance",
ylab="weight", main="Predefined Distance Weighting Functions", col="green")
curve(expWeight(x, sigma=10), from=-20, to=20, col="blue", add=TRUE)
curve(gaussWeight(x, sigma=10), from=-20, to=20, col="red", add=TRUE)
curve(swdWeight(x), from=-20, to=20, col="orange", add=TRUE)
legend('topright', inset=0.03, title="Weighting Functions", c("linWeight",
"expWeight", "gaussWeight", "swdWeight"),
fill=c("green", "blue", "red", "orange"))
text(14, 0.70, "sigma = 10")
## instead of user provided sequences in XStringSet format
## for this example a set of DNA sequences is created
## RNA- or AA-sequences can be used as well with the motif kernel
dnaseqs <- DNAStringSet(c("AGACTTAAGGGACCTGGTCACCACGCTCGGTGAGGGGGACGGGGTGT",
"ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC",
"CAGGAATCAGCACAGGCAGGGGCACGGCATCCCAAGACATCTGGGCC",
"GGACATATACCCACCGTTACGTGTCATACAGGATAGTTCCACTGCCC",
"ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC"))
names(dnaseqs) <- paste("S", 1:length(dnaseqs), sep="")
## create a distance weighted spectrum kernel with linear decrease of
## weights in a range of 20 bases
spec20 <- spectrumKernel(k=3, distWeight=linWeight(sigma=20))
## show details of kernel object
kernelParameters(spec20)
## this kernel can be now be used in a classification or regression task
## in the usual way or a kernel matrix can be generated for use with
## another learning method
km <- spec20(x=dnaseqs, selx=1:5)
km[1:5,1:5]
## Not run:
## instead of a distance weighting function also a weight vector can be
## passed in the distWeight parameter but the values must be chosen such
## that they lead to a positive definite kernel
##
## in this example only patterns within a 5 base range are considered with
## slightly decreasing weights
specv <- spectrumKernel(k=3, distWeight=c(1,0.95,0.9,0.85,0.8))
km <- specv(dnaseqs)
km[1:5,1:5]
## position specific spectrum kernel
specps <- spectrumKernel(k=3, distWeight=1)
km <- specps(dnaseqs)
km[1:5,1:5]
## get position specific kernel matrix
km <- specps(dnaseqs)
km[1:5,1:5]
## example with offset to align sequence positions (e.g. the
## transcription start site), the value gives the offset to position 1
positionOne <- c(9,6,3,1,6)
positionMetadata(dnaseqs) <- positionOne
## show position metadata
positionMetadata(dnaseqs)
## generate kernel matrix with position-specific spectrum kernel
km1 <- specps(dnaseqs)
km1[1:5,1:5]
## example for a user defined weighting function
## please stick to the order as described in the comments below and
## make sure that the resulting kernel is positive definite
expWeightUserDefined <- function(x, sigma=1)
{
## check presence and validity of all parameters except for x
if (!isSingleNumber(sigma))
stop("'sigma' must be a number")
## if x is missing the function returns a closure where all parameters
## except for x have a defined value
if (missing(x))
return(function(x) expWeightUserDefined(x, sigma=sigma))
## pattern distance vector x must be numeric
if (!is.numeric(x))
stop("'x' must be a numeric vector")
## create vector of distance weights from the
## input vector of pattern distances x
exp(-abs(x)/sigma)
}
## define kernel object with user defined weighting function
specud <- spectrumKernel(k=3, distWeight=expWeightUserDefined(sigma=5),
normalized=FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.