Assign annotation metadata to sequences and create a kernel
object which evaluates annotation information
Show biological sequence together with annotation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
showAnnotatedSeq(x, sel = 1, ann = TRUE, pos = TRUE, start = 1, end = width(x)[sel], width = NA) ## S4 method for signature 'XStringSet' ## annotationMetadata(x, annCharset= ...) <- value ## S4 method for signature 'BioVector' ## annotationMetadata(x, annCharset= ...) <- value ## S4 replacement method for signature 'BioVector' annotationMetadata(x, ...) <- value ## S4 method for signature 'XStringSet' annotationMetadata(x) ## S4 method for signature 'BioVector' annotationMetadata(x) ## S4 method for signature 'XStringSet' annotationCharset(x) ## S4 method for signature 'BioVector' annotationCharset(x)
biological sequences in the form of a
single index into x for displaying a specific sequence. Default=1
show annotation information along with the sequence
show position information
first postion to be displayed, by default the full sequence is shown
last position to be displayed or use parameter 'width'
number of positions to be displayed or use parameter 'end'
additional parameters which are passed transparently.
character vector with annotation strings with same length as the number of sequences. Each anntation string must have the same number of characters as the corresponding sequence. In addition to the characters defined in the annotation character set the character "-" can be used in the annotation strings for masking sequence parts.
character string listing all characters used in annotation sorted ascending according to the C locale, up to 32 characters are possible
Annotation information for sequences
For the annotation specific kernel additional annotation information is added to the sequence data. The annotation for one sequence consist of a character string with a single annotation character per position, i.e. the annotation sequence has the same length as the sequence. The character set used for annotation is defined user specific on XStringSet level with up to 32 different characters. Each biological sequence needs an associated annotation sequence assigned consisting of characters from this character set. The evaluation of annotation information as part of the kernel processing during generation of a kernel matrix or an explict representation can be activated per kernel object.
Assignment of annotation information
The annotation characterset consists of a character string listing all allowed annotation characters in alphabetical order. Any single byte ASCII character from the decimal range between 32 and 126, except 45, is allowed. The character '-' (ASCII dec. 45) is used for masking sequence parts which should not be evaluated. As it has assigned this special masking function it must not be used in annotation charactersets.
The annotation characterset is assigned to the sequence set with the
annotationMetadata function (see below). It is stored in the
metadata list as named element
annotationCharset and can be stored
along with other metadata assigned to the sequence set. The annotation
strings for the individual sequences are represented as a character vector
and can be assigned to the XStringSet together with the assignment of the
annotation characterset as element related metadata. Element related
metadata is stored in a DataFrame and the columns of this data frame
represent the different types of metadata that can be assigned in parallel.
The column name for the sequence related annotation information is
"annotation". (see Example section for an example of annotation metadata
assignment) Annotation metadata can be assigned together with position
positionMetadata to a sequence set.
Annotation Specific Kernel Processing
The annotation specific kernel variant of a kernel, e.g. the spectrum kernel appends the annotation characters corresponding to a specific kmer to this kmer and treats the resulting pattern as one feature - the basic unit for similarity determination. The full feature space of an annotation specific spectrum kernel is the cartesian product of the set of all possible sequence patterns with the set of all possible anntotions patterns. Dependent on the number of characters in the annotation character set the feature space increases drastically compared to the normal spectrum kernel. But through annotation the similarity consideration between two sequences can be split into independent parts considered separately, e.g. coding/non-coding, exon/intron, etc... . For amino acid sequences e.g. a heptad annotation (consisting of a usually periodic pattern of 7 characters (a to g) can be used as annotation like in prediction of coiled coil structures. (see reference Mahrenholz, 2011)
annSpec passed during creation of a kernel object controls
whether annotation information is evaluated by the kernel. (see functions
spectrumKernel, gappyPairKernel, motifKernel)
In this way sequences with annotation can be evaluated annotation specific
and without annotation through using two different kernel objects. (see
examples below) The annotation specific kernel variant is available for all
kernels in this package except for the mismatch kernel.
With this function annotation metadata can be assigned to sequences defined as XStringSet (or BioVector). The sequence annotation strings are stored as element related information and can be retrieved with the method
mcols. The characters used for anntation are stored as
annotation characterset for the sequence set and can be retrieved
with the method
metadata. For the assignment of annotation
metadata to biological sequences this function should be used instead of the
lower level functions metadata and mcols. The function
annotationMetadata performs several checks and also takes care
that other metadata or element metadata assigned to the object is kept.
Annotation metadata are deleted if the parameters
annotation are set to NULL.
This function displays individual sequences aligned with the annotation string with 50 positions per line. The two header lines show the start postion for each bock of 10 characters.
The method annotationMetadata<- assigns annotation metadata to a sequence set. In the assignment also the annotation characterset must be specified. Annotation characters which are not listed in the characterset are treated like invalid sequence characters. They interrupt open patterns and lead to a restart of the pattern search at this position.
annotationMetadata: a character vector with the annotation
annotationCharset: a character vector with the annotation
Johannes Palme <[email protected]>
C.C. Mahrenholz, I.G. Abfalter, U. Bodenhofer, R. Volkmer and S. Hochreiter (2011) Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics. DOI: 10.1074/mcp.M110.004994.
J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics, 31(15):2574-2576, 2015. DOI: 10.1093/bioinformatics/btv176.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## create a set of annotated DNA sequences ## instead of user provided sequences in XStringSet format ## for this example a set of DNA sequences is created x <- DNAStringSet(c("AGACTTAAGGGACCTGGTCACCACGCTCGGTGAGGGGGACGGGGTGT", "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC", "CAGGAATCAGCACAGGCAGGGGCACGGCATCCCAAGACATCTGGGCC", "GGACATATACCCACCGTTACGTGTCATACAGGATAGTTCCACTGCCC", "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC")) names(x) <- paste("S", 1:length(x), sep="") ## define the character set used in annotation ## the masking character '-' is is not part of the character set anncs <- "ei" ## annotation strings for each sequence as character vector ## in the third and fourth sample a part of the sequence is masked annotStrings <- c("eeeeeeeeeeeeiiiiiiiiieeeeeeeeeeeeeeeeiiiiiiiiii", "eeeeeeeeeiiiiiiiiiiiiiiiiiiieeeeeeeeeeeeeeeeeee", "---------eeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiii", "eeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiii----", "eeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiiiieeeeeeeeeeee") ## assign metadata to DNAString object annotationMetadata(x, annCharset=anncs) <- annotStrings ## show annotation annotationMetadata(x) annotationCharset(x) ## show sequence 3 aligned with annotation string showAnnotatedSeq(x, sel=3) ## create annotation specific spectrum kernel speca <- spectrumKernel(k=3, annSpec=TRUE, normalized=FALSE) ## show details of kernel object kernelParameters(speca) ## this kernel object can be now be used in a classification or regression ## task in the usual way or you can use the kernel for example to generate ## the kernel matrix for use with another learning method in another R ## package. kma <- speca(x) kma[1:5,1:5] ## generate a dense explicit representation for annotation-specific kernel era <- getExRep(x, speca, sparse=FALSE) era[1:5,1:8] ## when a standard spectrum kernel is used with annotated ## sequences the anntotation information is not evaluated spec <- spectrumKernel(k=3, normalized=FALSE) km <- spec(x) km[1:5,1:5] ## finally delete annotation metadata if no longer needed annotationMetadata(x) <- NULL ## show empty metadata annotationMetadata(x) annotationCharset(x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.