annotationSpecificKernel: Annotation Specific Kernel
In kebabs: Kernel-Based Analysis Of Biological Sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Assign annotation metadata to sequences and create a kernel object which evaluates annotation information

Show biological sequence together with annotation

showAnnotatedSeq(x, sel = 1, ann = TRUE, pos = TRUE, start = 1,
  end = width(x)[sel], width = NA)

## S4 method for signature 'XStringSet'
## annotationMetadata(x, annCharset= ...) <- value

## S4 method for signature 'BioVector'
## annotationMetadata(x, annCharset= ...) <- value

## S4 replacement method for signature 'BioVector'
annotationMetadata(x, ...) <- value

## S4 method for signature 'XStringSet'
annotationMetadata(x)

## S4 method for signature 'BioVector'
annotationMetadata(x)

## S4 method for signature 'XStringSet'
annotationCharset(x)

## S4 method for signature 'BioVector'
annotationCharset(x)

`x`	biological sequences in the form of a `DNAStringSet`, `RNAStringSet`, `AAStringSet` (or as `BioVector`)
`sel`	single index into x for displaying a specific sequence. Default=1
`ann`	show annotation information along with the sequence
`pos`	show position information
`start`	first postion to be displayed, by default the full sequence is shown
`end`	last position to be displayed or use parameter 'width'
`width`	number of positions to be displayed or use parameter 'end'
`...`	additional parameters which are passed transparently.
`value`	character vector with annotation strings with same length as the number of sequences. Each anntation string must have the same number of characters as the corresponding sequence. In addition to the characters defined in the annotation character set the character "-" can be used in the annotation strings for masking sequence parts.
`annCharset`	character string listing all characters used in annotation sorted ascending according to the C locale, up to 32 characters are possible

Annotation information for sequences

For the annotation specific kernel additional annotation information is added to the sequence data. The annotation for one sequence consist of a character string with a single annotation character per position, i.e. the annotation sequence has the same length as the sequence. The character set used for annotation is defined user specific on XStringSet level with up to 32 different characters. Each biological sequence needs an associated annotation sequence assigned consisting of characters from this character set. The evaluation of annotation information as part of the kernel processing during generation of a kernel matrix or an explict representation can be activated per kernel object.

Assignment of annotation information

The annotation characterset consists of a character string listing all allowed annotation characters in alphabetical order. Any single byte ASCII character from the decimal range between 32 and 126, except 45, is allowed. The character '-' (ASCII dec. 45) is used for masking sequence parts which should not be evaluated. As it has assigned this special masking function it must not be used in annotation charactersets.

The annotation characterset is assigned to the sequence set with the annotationMetadata function (see below). It is stored in the metadata list as named element annotationCharset and can be stored along with other metadata assigned to the sequence set. The annotation strings for the individual sequences are represented as a character vector and can be assigned to the XStringSet together with the assignment of the annotation characterset as element related metadata. Element related metadata is stored in a DataFrame and the columns of this data frame represent the different types of metadata that can be assigned in parallel. The column name for the sequence related annotation information is "annotation". (see Example section for an example of annotation metadata assignment) Annotation metadata can be assigned together with position metadata (see positionMetadata to a sequence set.

Annotation Specific Kernel Processing

The annotation specific kernel variant of a kernel, e.g. the spectrum kernel appends the annotation characters corresponding to a specific kmer to this kmer and treats the resulting pattern as one feature - the basic unit for similarity determination. The full feature space of an annotation specific spectrum kernel is the cartesian product of the set of all possible sequence patterns with the set of all possible anntotions patterns. Dependent on the number of characters in the annotation character set the feature space increases drastically compared to the normal spectrum kernel. But through annotation the similarity consideration between two sequences can be split into independent parts considered separately, e.g. coding/non-coding, exon/intron, etc... . For amino acid sequences e.g. a heptad annotation (consisting of a usually periodic pattern of 7 characters (a to g) can be used as annotation like in prediction of coiled coil structures. (see reference Mahrenholz, 2011)

The flag annSpec passed during creation of a kernel object controls whether annotation information is evaluated by the kernel. (see functions spectrumKernel, gappyPairKernel, motifKernel) In this way sequences with annotation can be evaluated annotation specific and without annotation through using two different kernel objects. (see examples below) The annotation specific kernel variant is available for all kernels in this package except for the mismatch kernel.

annotationMetadata function

With this function annotation metadata can be assigned to sequences defined as XStringSet (or BioVector). The sequence annotation strings are stored as element related information and can be retrieved with the method mcols. The characters used for anntation are stored as annotation characterset for the sequence set and can be retrieved with the method metadata. For the assignment of annotation metadata to biological sequences this function should be used instead of the lower level functions metadata and mcols. The function annotationMetadata performs several checks and also takes care that other metadata or element metadata assigned to the object is kept. Annotation metadata are deleted if the parameters annCharset and annotation are set to NULL.

showAnnotatedSeq function

This function displays individual sequences aligned with the annotation string with 50 positions per line. The two header lines show the start postion for each bock of 10 characters.

Accessor-like methods

The method annotationMetadata<- assigns annotation metadata to a sequence set. In the assignment also the annotation characterset must be specified. Annotation characters which are not listed in the characterset are treated like invalid sequence characters. They interrupt open patterns and lead to a restart of the pattern search at this position.

annotationMetadata: a character vector with the annotation strings

annotationCharset: a character vector with the annotation

Johannes Palme <kebabs@bioinf.jku.at>

http://www.bioinf.jku.at/software/kebabs

C.C. Mahrenholz, I.G. Abfalter, U. Bodenhofer, R. Volkmer and S. Hochreiter (2011) Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics. DOI: 10.1074/mcp.M110.004994.

J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics, 31(15):2574-2576, 2015. DOI: 10.1093/bioinformatics/btv176.

spectrumKernel, gappyPairKernel, motifKernel, positionMetadata, metadata, mcols

## create a set of annotated DNA sequences
## instead of user provided sequences in XStringSet format
## for this example a set of DNA sequences is created
x <- DNAStringSet(c("AGACTTAAGGGACCTGGTCACCACGCTCGGTGAGGGGGACGGGGTGT",
                    "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC",
                    "CAGGAATCAGCACAGGCAGGGGCACGGCATCCCAAGACATCTGGGCC",
                    "GGACATATACCCACCGTTACGTGTCATACAGGATAGTTCCACTGCCC",
                    "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC"))
names(x) <- paste("S", 1:length(x), sep="")
## define the character set used in annotation
## the masking character '-' is is not part of the character set
anncs <- "ei"
## annotation strings for each sequence as character vector
## in the third and fourth sample a part of the sequence is masked
annotStrings <- c("eeeeeeeeeeeeiiiiiiiiieeeeeeeeeeeeeeeeiiiiiiiiii",
                  "eeeeeeeeeiiiiiiiiiiiiiiiiiiieeeeeeeeeeeeeeeeeee",
                  "---------eeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiii",
                  "eeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiii----",
                  "eeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiiiieeeeeeeeeeee")
## assign metadata to DNAString object
annotationMetadata(x, annCharset=anncs) <- annotStrings
## show annotation
annotationMetadata(x)
annotationCharset(x)

## show sequence 3 aligned with annotation string
showAnnotatedSeq(x, sel=3)

## create annotation specific spectrum kernel
speca <- spectrumKernel(k=3, annSpec=TRUE, normalized=FALSE)

## show details of kernel object
kernelParameters(speca)

## this kernel object can be now be used in a classification or regression
## task in the usual way or you can use the kernel for example to generate
## the kernel matrix for use with another learning method in another R
## package.
kma <- speca(x)
kma[1:5,1:5]
## generate a dense explicit representation for annotation-specific kernel
era <- getExRep(x, speca, sparse=FALSE)
era[1:5,1:8]

## when a standard spectrum kernel is used with annotated
## sequences the anntotation information is not evaluated
spec <- spectrumKernel(k=3, normalized=FALSE)
km <- spec(x)
km[1:5,1:5]

## finally delete annotation metadata if no longer needed
annotationMetadata(x) <- NULL
## show empty metadata
annotationMetadata(x)
annotationCharset(x)