OverlapEncodings-class: OverlapEncodings objects
In GenomicAlignments: Representation and manipulation of short genomic alignments

Description Usage Arguments Details OverlapEncodings getters Coercing an OverlapEncodings object Low-level encoding utilities Author(s) See Also Examples

The OverlapEncodings class is a container for storing the "overlap encodings" returned by the encodeOverlaps function.

## -=-= OverlapEncodings getters =-=-

## S4 method for signature 'OverlapEncodings'
Loffset(x)
## S4 method for signature 'OverlapEncodings'
Roffset(x)
## S4 method for signature 'OverlapEncodings'
encoding(x)
## S4 method for signature 'OverlapEncodings'
levels(x)
## S4 method for signature 'OverlapEncodings'
flippedQuery(x)

## -=-= Coercing an OverlapEncodings object =-=-

## S4 method for signature 'OverlapEncodings'
as.data.frame(x, row.names=NULL, optional=FALSE, ...)

## -=-= Low-level encoding utilities =-=-

encodingHalves(x, single.end.on.left=FALSE, single.end.on.right=FALSE,
                  as.factors=FALSE)
Lencoding(x, ...)
Rencoding(x, ...)

## S4 method for signature 'ANY'
njunc(x)

Lnjunc(x, single.end.on.left=FALSE)
Rnjunc(x, single.end.on.right=FALSE)

isCompatibleWithSplicing(x)

`x`	An OverlapEncodings object. For the low-level encoding utilities, `x` can also be a character vector or factor containing encodings.
`row.names`	`NULL` or a character vector.
`optional`	Ignored.
`...`	Extra arguments passed to the `as.data.frame` method for OverlapEncodings objects are ignored. Extra arguments passed to `Lencoding` or `Rencoding` are passed down to `encodingHalves`.
`single.end.on.left, single.end.on.right`	By default the 2 halves of a single-end encoding are considered to be NAs. If `single.end.on.left` (resp. `single.end.on.right`) is `TRUE`, then the left (resp. right) half of a single-end encoding is considered to be the unmodified encoding.
`as.factors`	By default `encodingHalves` returns the 2 encoding halves as a list of 2 character vectors parallel to the input. If `as.factors` is `TRUE`, then it returns them as a list of 2 factors parallel to the input.

Given a query and a subject of the same length, both list-like objects with top-level elements typically containing multiple ranges (e.g. IntegerRangesList objects), the "overlap encoding" of the i-th element in query and i-th element in subject is a character string describing how the ranges in query[[i]] are qualitatively positioned relatively to the ranges in subject[[i]].

The encodeOverlaps function computes those overlap encodings and returns them in an OverlapEncodings object of the same length as query and subject.

The topic of working with overlap encodings is covered in details in the "OverlapEncodings" vignette located this package (GenomicAlignments) and accessible with vignette("OverlapEncodings").

In the following code snippets, x is an OverlapEncodings object typically obtained by a call to encodeOverlaps(query, subject).

length(x): Get the number of elements (i.e. encodings) in x. This is equal to length(query) and length(subject).

Loffset(x), Roffset(x): Get the "left offsets" and "right offsets" of the encodings, respectively. Both are integer vectors of the same length as x.

Let's denote Qi = query[[i]], Si = subject[[i]], and [q1,q2] the range covered by Qi i.e. q1 = min(start(Qi)) and q2 = max(end(Qi)), then Loffset(x)[i] is the number L of ranges at the head of Si that are strictly to the left of all the ranges in Qi i.e. L is the greatest value such that end(Si)[k] < q1 - 1 for all k in seq_len(L). Similarly, Roffset(x)[i] is the number R of ranges at the tail of Si that are strictly to the right of all the ranges in Qi i.e. R is the greatest value such that start(Si)[length(Si) + 1 - k] > q2 + 1 for all k in seq_len(L).

encoding(x): Factor of the same length as x where the i-th element is the encoding obtained by comparing each range in Qi with all the ranges in tSi = Si[(1+L):(length(Si)-R)] (tSi stands for "trimmed Si"). More precisely, here is how this encoding is obtained:

All the ranges in Qi are compared with tSi[1], then with tSi[2], etc... At each step (one step per range in tSi), comparing all the ranges in Qi with tSi[k] is done with rangeComparisonCodeToLetter(compare(Qi, tSi[k])). So at each step, we end up with a vector of M single letters (where M is length(Qi)).
Each vector obtained previously (1 vector per range in tSi, all of them of length M) is turned into a single string (called "encoding block") by pasting its individual letters together.
All the encoding blocks (1 per range in tSi) are pasted together into a single long string and separated by colons (":"). An additional colon is prepended to the long string and another one appended to it.
Finally, a special block containing the value of M is prepended to the long string. The final string is the encoding.

levels(x): Equivalent to levels(encoding(x)).

flippedQuery(x): Whether or not the top-level element in query used for computing the encoding was "flipped" before the encoding was computed. Note that this flipping generally affects the "left offset", "right offset", in addition to the encoding itself.

In the following code snippets, x is an OverlapEncodings object.

: as.data.frame(x): Return x as a data frame with columns "Loffset", "Roffset" and "encoding".

In the following code snippets, x can be an OverlapEncodings object, or a character vector or factor containing encodings.

encodingHalves(x, single.end.on.left=FALSE, single.end.on.right=FALSE, as.factors=FALSE): Extract the 2 halves of paired-end encodings and return them as a list of 2 character vectors (or 2 factors) parallel to the input.

Paired-end encodings are obtained by encoding paired-end overlaps i.e. overlaps between paired-end reads and transcripts (typically). The difference between a single-end encoding and a paired-end encoding is that all the blocks in the latter contain a "--" separator to mark the separation between the "left encoding" and the "right encoding".

See examples below and the "Overlap encodings" vignette located in this package for examples of paired-end encodings.

Lencoding(x, ...), Rencoding(x, ...): Extract the "left encodings" and "right encodings" of paired-end encodings.

Equivalent to encodingHalves(x, ...)[[1]] and encodingHalves(x, ...)[[2]], respectively.

njunc(x), Lnjunc(x, single.end.on.left=FALSE), Rnjunc(x, single.end.on.right=FALSE): Extract the number of junctions in each encoding by looking at their first block (aka special block). If an element xi in x is a paired-end encoding, then Lnjunc(xi), Rnjunc(xi), and njunc(xi), return njunc(Lencoding(xi)), njunc(Rencoding(xi)), and Lnjunc(xi) + Rnjunc(xi), respectively.

isCompatibleWithSplicing(x): Returns a logical vector parallel to x indicating whether the corresponding encoding describes a splice compatible overlap i.e. an overlap that is compatible with the splicing of the transcript.

WARNING: For paired-end encodings, isCompatibleWithSplicing considers that the encoding is splice compatible if its 2 halves are splice compatible. This can produce false positives if for example the right end of the alignment is located upstream of the left end in transcript space. The paired-end read could not come from this transcript. To eliminate these false positives, one would need to have access and look at the position of the left and right ends in transcript space. This can be done with extractQueryStartInTranscript.

Hervé Pagès

The "OverlapEncodings" vignette in this package.
The encodeOverlaps function for computing "overlap encodings".
The pcompare function in the IRanges package for the interpretation of the strings returned by encoding.
The GRangesList class defined and documented in the GenomicRanges package.

## ---------------------------------------------------------------------
## A. BASIC MANIPULATION OF AN OverlapEncodings OBJECT
## ---------------------------------------------------------------------

example(encodeOverlaps)  # to generate the 'ovenc' object

length(ovenc)
Loffset(ovenc)
Roffset(ovenc)
encoding(ovenc)
levels(ovenc)
nlevels(ovenc)
flippedQuery(ovenc)
njunc(ovenc)

as.data.frame(ovenc)
njunc(levels(ovenc))

## ---------------------------------------------------------------------
## B. WORKING WITH PAIRED-END ENCODINGS (POSSIBLY MIXED WITH SINGLE-END
##    ENCODINGS)
## ---------------------------------------------------------------------

encodings <- c("4:jmmm:agmm:aagm:aaaf:", "3--1:jmm--b:agm--i:")

encodingHalves(encodings)
encodingHalves(encodings, single.end.on.left=TRUE)
encodingHalves(encodings, single.end.on.right=TRUE)
encodingHalves(encodings, single.end.on.left=TRUE,
                          single.end.on.right=TRUE)

Lencoding(encodings)
Lencoding(encodings, single.end.on.left=TRUE)
Rencoding(encodings)
Rencoding(encodings, single.end.on.right=TRUE)

njunc(encodings)
Lnjunc(encodings)
Lnjunc(encodings, single.end.on.left=TRUE)
Rnjunc(encodings)
Rnjunc(encodings, single.end.on.right=TRUE)

## ---------------------------------------------------------------------
## C. DETECTION OF "SPLICE COMPATIBLE" OVERLAPS
## ---------------------------------------------------------------------

## Reads that are compatible with the splicing of the transcript can
## be detected with a regular expression (the regular expression below
## assumes that reads have at most 2 junctions):
regex0 <- "(:[fgij]:|:[jg].:.[gf]:|:[jg]..:.g.:..[gf]:)"
grepl(regex0, encoding(ovenc))  # read4 is NOT "compatible"

## This was for illustration purpose only. In practise you don't need
## (and should not) use this regular expression, but use instead the
## isCompatibleWithSplicing() utility function:
isCompatibleWithSplicing(ovenc)