spectrumKernel: Spectrum Kernel
In kebabs: Kernel-Based Analysis Of Biological Sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Create a spectrum kernel object

spectrumKernel(k = 3, r = 1, annSpec = FALSE, distWeight = numeric(0),
  normalized = TRUE, exact = TRUE, ignoreLower = TRUE, presence = FALSE,
  revComplement = FALSE, mixCoef = numeric(0))

## S4 method for signature 'SpectrumKernel'
getFeatureSpaceDimension(kernel, x)

`k`	length of the substrings (also called kmers). This parameter defines the size of the feature space, i.e. the total number of features considered in this kernel is \|A\|^k, with \|A\| as the size of the alphabet (4 for DNA and RNA sequences and 21 for amino acid sequences). When multiple kernels with different k values should be generated e.g. for model selection a range e.g. k=3:5 can be specified. In this case a list of kernel objects with the individual k values from the range is generated as result. Default=3
`r`	exponent which must be > 0 (details see below). Default=1
`annSpec`	boolean that indicates whether sequence annotation should be taken into account (details see on help page for `annotationMetadata`). For the annotation specific spectrum kernel the total number of features increases to \|A\|^k * \|a\|^k with \|A\| as the size of the sequence alphabet and \|a\| as the size of the annotation alphabet. Default=FALSE
`distWeight`	a numeric distance weight vector or a distance weighting function (details see on help page for `gaussWeight`). Default=NULL
`normalized`	a kernel matrix or explicit representation generated with this kernel will be normalized(details see below). Default=TRUE
`exact`	use exact character set for the evaluation (details see below). Default=TRUE
`ignoreLower`	ignore lower case characters in the sequence. If the parameter is not set lower case characters are treated like uppercase. Default=TRUE
`presence`	if this parameter is set only the presence of a kmers will be considered, otherwise the number of occurances of the kmer is used. Default=FALSE
`revComplement`	if this parameter is set a kmer and its reverse complement are treated as the same feature. Default=FALSE
`mixCoef`	mixing coefficients for the mixture variant of the spectrum kernel. A numeric vector of length k is expected for this parameter with the unused components in the mixture set to 0. Default=numeric(0)
`kernel`	a sequence kernel object
`x`	one or multiple biological sequences in the form of a `DNAStringSet`, `RNAStringSet`, `AAStringSet` (or as `BioVector`)

Creation of kernel object

The function 'spectrumKernel' creates a kernel object for the spectrum kernel. This kernel object can then be used with a set of DNA-, RNA- or AA-sequences to generate a kernel matrix or an explicit representation for this kernel. The spectrum kernel uses all subsequences for length k (also called kmers). For sequences shorter than k the self similarity (i.e. the value on the main diagonal in the square kernel matrix) is 0. The explicit representation contains only zeros for such a sample. Dependent on the learning task it might make sense to remove such sequences from the data set as they do not contribute to the model but still influence performance values.

For values different from 1 (=default value) parameter r leads to a transfomation of similarities by taking each element of the similarity matrix to the power of r. Only integer values larger than 1 should be used for r in context with SVMs requiring positive definite kernels. If normalized=TRUE, the feature vectors are scaled to the unit sphere before computing the similarity value for the kernel matrix. For two samples with the feature vectors x and y the similarity is computed as:

s=(x^T y)/(|x| |y|)

For an explicit representation generated with the feature map of a normalized kernel the rows are normalized by dividing them through their Euclidean norm. For parameter exact=TRUE the sequence characters are interpreted according to an exact character set. If the flag is not set ambigous characters from the IUPAC characterset are also evaluated. For sequences shorter than k the self similarity (i.e. the value on the main diagonal in the square kernel matrix) is 0.

The annotation specific variant (for details see annotationMetadata) and the position dependent variants (for details see positionMetadata) either in the form of a position specific or a distance weighted kernel are supported for the spectrum kernel. The generation of an explicit representation is not possible for the position dependent variants of this kernel.

Creation of kernel matrix

The kernel matrix is created with the function getKernelMatrix or via a direct call with the kernel object as shown in the examples below.

spectrumKernel: upon successful completion, the function returns a kernel object of class SpectrumKernel.

of getDimFeatureSpace: dimension of the feature space as numeric value

Johannes Palme <kebabs@bioinf.jku.at>

http://www.bioinf.jku.at/software/kebabs

(Leslie, 2002) – C. Leslie, E. Eskin and W.S. Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification.

(Bodenhofer, 2009) – U. Bodenhofer, K. Schwarzbauer, M. Ionescu and S. Hochreiter. Modelling position specificity in sequence kernels by fuzzy equivalence relations.

(Mahrenholz, 2011) – C.C. Mahrenholz, I.G. Abfalter, U. Bodenhofer, R. Volkmer and S. Hochreiter. Complex networks govern coiled-coil oligomerizations - predicting and profiling by means of a machine learning approach.

J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics, 31(15):2574-2576, 2015. DOI: 10.1093/bioinformatics/btv176.

kernelParameters-method, getKernelMatrix, getExRep, mismatchKernel, motifKernel, gappyPairKernel, SpectrumKernel

## instead of user provided sequences in XStringSet format
## for this example a set of DNA sequences is created
## RNA- or AA-sequences can be used as well with the spectrum kernel
dnaseqs <- DNAStringSet(c("AGACTTAAGGGACCTGGTCACCACGCTCGGTGAGGGGGACGGGGTGT",
                          "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC",
                          "CAGGAATCAGCACAGGCAGGGGCACGGCATCCCAAGACATCTGGGCC",
                          "GGACATATACCCACCGTTACGTGTCATACAGGATAGTTCCACTGCCC",
                          "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC"))
names(dnaseqs) <- paste("S", 1:length(dnaseqs), sep="")

## create the kernel object for dimers without normalization
speck <- spectrumKernel(k=2, normalized=FALSE)
## show details of kernel object
speck

## generate the kernel matrix with the kernel object
km <- speck(dnaseqs)
dim(km)
km[1:5,1:5]

## alternative way to generate the kernel matrix
km <- getKernelMatrix(speck, dnaseqs)
km[1:5,1:5]

## Not run: 
## plot heatmap of the kernel matrix
heatmap(km, symm=TRUE)

## End(Not run)

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colMeans, colSums, colnames, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

Loading required package: kernlab

Attaching package: 'kernlab'

The following object is masked from 'package:Biostrings':

    type

Spectrum Kernel: k=2, normalized=FALSE
[1] 5 5
An object of class "KernelMatrix"
    S1  S2  S3  S4  S5
S1 230  94 172 131  94
S2  94 186 121 135 186
S3 172 121 242 139 121
S4 131 135 139 184 135
S5  94 186 121 135 186
An object of class "KernelMatrix"
    S1  S2  S3  S4  S5
S1 230  94 172 131  94
S2  94 186 121 135 186
S3 172 121 242 139 121
S4 131 135 139 184 135
S5  94 186 121 135 186