In arendsee/synder: Use Synteny to Filter Genomic Search

List of all vignettes:

This work is funded by the National Science Foundation grant NSF-IOS 1546858.

Top level commands

Search

This is the primary function of synder. A set of intervals in one genome (the query) is mapped to a set of intervals in another genome (the target). The input intervals may fall between query-side syntenic intervals, in which case, the search interval also will fall inbetween target syntenic intervals.

The output is a table with the following fields:

query interval name (e.g. AT1G20300)
query chromosome name
query start position
query stop position
target chromosome name
search interval start position on target chromsome
search interval stop position on target chromsome
search interval strand ('+' / '-')
score
contiguous set id
lower flag
0 lower bound is inside a syntenic interval
1 lower bound is between intervals in a contiguous set
2 lower bound does not overlap the contiguous set
3 lower bound is beyond any syntenic interval (near end of scaffold)
upper flag - see lower flag
between flag - 0 if query overlaps a syntenic interval, 1 otherwise

Filter

filter removes links that disagree with the synteny map.

The filter function takes two syntenic maps and finds the congruent links. Given two syntenic maps, A and B, the query-side intervals in A are mapped to target side search intervals using the syntenic map B. Then the target-side intervals in A that overlap the predicted search intervals are printed.

The most obvious usage case takes as input the results of BLASTing a sequence against a genome. Often such searches will have many hits in a swooping e-value gradient. The highest scoring hit may not be the orthologous one (for example, a weakly similar, but long hit may outscore the nearly identical, but truncated, true ortholog). synder filter will find the hits that are concordant with genomic context, reducing possibly thousands of hits to only a few.

Map

Find all query-side syntenic blocks that overlap the input interval. Then map these blocks to the target-side and print the result. If an input interval overlaps no query bock, the flanks are printed.

The output will have the columns:o 1. input interval name (e.g. AT1G01010) 2. target contig name (e.g. Chr1) 3. target start position 4. target stop position 5. missing flag, 0 if input overlaps no block, 1 otherwise

Count

Like map except it counts the number of overlaps, rather than printing them. The output is a TAB-delimited list of sequence names and counts.

Dump

Builds the internal synteny datastructure and prints the results.

For example, given the file

que   100    200    tar   1100   1200   100   +
que   1100   1800   tar   1500   1900   100   +
que   1400   1700   tar   1600   2000   100   +
que   1200   1600   tar   1700   2100   100   +
que   1300   1900   tar   1800   2200   100   +

R> synder::dump('que-tar.syn', trans='p')
que  100   200   tar  1100  1200  101.000000  +  0
que  1100  1900  tar  1500  2200  700.084964  +  0

In the above example, dump shows that blocks 2-5 are merged (since they are doubly-overlapping) and shows the results of the score transformations. It is this dumped synteny map that would have been used in any filter or search operations.

Also note the addition of a 9th column. This column specifies that contiguous set. In this case, all blocks, after merging, are in the same set.

Inputs:

The synteny map must be TAB-delimited, with no header, and must have the following fields:

qseqid - query contig id (e.g. Chr1)
qstart - query interval start
qstop - query interval stop
sseqid - target contig id
sstart - target interval start
sstop - target interval stop
score - score of the syntenic match*
strand - relative orientation

* score can be any numeric value, it will be transformed as specified by the -x option

The target and query genome lengths files must be TAB-delimited with columns: ,

Terms:

query genome - the genome referenced by the input intervals
target genome - the genome to which input intervals are mapped
synteny map - a set of query/target interval pairs inferred to be syntenic
block - a single pair of intervals from the synteny map
contig - a chromosome, scaffold, contig (synder doesn't distinguish between them)
interval adjacency - two intervals are adjacent if they are on the same contig and no other interval is fully contained between them
block adjacency - two blocks are adjacent if 1) the intervals are adjacent on both the query and target sides and 2) both have the same sign
query context - all blocks that overlap or are adjacent to the query interval
contiguous set - a set where block i is adjacent to block i+1
search intervals - a set of intervals on the target genome where the ortholog of the query interval is expected to be (the ortholog search space)

Algorithm:

Execution order for the synder search command

load synteny map
transform scores
merge doubly-overlapping blocks
determine contiguous sets
find query-side, contextually-relevant blocks for each input sequence
map to overlapping contiguous sets
calculate score for input sequence relative to each contiguous set
calculate search interval relative to each contiguous set

1. Load Synteny Map

The input synteny map must have the following columns:

qseqid query contig id (e.g. Chr1)
qstart query interval start
qstop query interval stop
sseqid target contig id
sstart target interval start
sstop target interval stop
score score of the syntenic match
strand relative orientation

2. Transform scores

Internally, scores for each block are assumed to be additive. However, there are many possible forms of input scores. Low numbers may represent good matches (e.g. e-value) or high numbers (e.g. bit scores). Scores may be additive (bitscores) or averaged (percent similarity).

For this reason, the user must specify a transform for the score column (column 7 of synteny map). This transform can be one of the following:

  S' = S            default, no transformation)
  S' = L * S        transform from score densities
  S' = L * S / 100  transform from percent identity
  S' = -log(S)      transform from e-values or p-values

Where S is input score, S' is the transformed score, and L interval length

3. Merge doubly-overlapping blocks

If two blocks overlap on both the query and target side, they should be merged. In a perfect synteny map, this would never occur. But in practice, it is common, and convolutes the formation of contiguous sets.

For example the following syntenic map

que 100 200 tar 100 200 100 +
que 300 400 tar 300 400 100 +
que 310 390 tar 310 390 100 +
que 500 600 tar 500 600 100 +

would be separated into two contiguous sets with the bounds (100, 400) and (310, 600).

Now if a input intervals at (que, 250, 450) is searched, two search intervals with the same bounds will obtain: (200, 500) and (200, 500). Each will be flagged as unbound (e.g. an interval edge is between contiguous sets).

However, if the second and third blocks are merged, we obtain a single search interval flagged as bound on both ends.

In practice, many repetitive regions of the genome can be massively doubly-overlapping, resulting in many redundant search intervals. This throws of statistics and wastes time searching extra space.

To avoid such problems, synder merges any blocks that overlap on both the query and target sides.

The bounds of the merged blocks are simply the union.

But the scores also need to be merged. Originally, I used the equation:

 da (la - lo) + db (lb - lo) + lo (da + db) / 2            E1

Where * da and db are the score densities of blocks a and b * la, lb and lo are the lengths of a, b, and their overlap

E1 is problematic for two reasons. First, it doesn't really make sense to average the overlap scores. Second, iterative pairwise averaging is assymetric, giving higher weight to the blocks merged later.

I replaced this approach with taking the max of the overlap scores:

 da (la - lo) + db (lb - lo) + lo * max(da, db)            E2

This fixes the first problem.

The second problem is a bit trickier, since it would require tracking sub-intervals. It would be a pain to implement and probably would have little effect on any real dataset. So I am content with an imperfect solution for now.

4. Determine contiguous sets

Build an interval adjacency matrix for the query context intervals and for the target context intervals. AND them together to get a block adjacency matrix. From this matrix, extract paths of adjacent blocks. Synder will merge any blocks that overlap on both genomes, this ensures there is a unique path through the adjacency matrix.

5. Find contextual blocks

Mapping query intervals to target intervals would be trivial for a perfectly linear map, e.g.

                                      [-----------]
=====      ============     ==========             ============
  |              |               |                       |  
  |              |               |                       |  
=====      ============     ==========   <--->     ============

Where <---> is the query interval and [----] the search interval

Synder reduces all blocks in the genome into into contiguous sets, which are non-overlapping sets of intervals where all blocks are adjacent.

interval adjacency - two intervals are adjacent if they are on the same contig and no other interval is fully contained between them
block adjacency - two blocks are adjacent if 1) the intervals are adjacent on both the query and target sides and 2) both have the same sign
query context - all blocks that overlap or are adjacent to the query interval
contiguous set - a set where block i is adjacent to block $i+1$
search intervals - a set of intervals on the target genome where the ortholog of the query interval is expected to be (the ortholog search space)

6. Reduce to overlapping sets

Once all overlapping and flanking query-side blocks are identified.

7. Calculate scores relative to contiguous sets

Especially with the high $k$, it is important to be able to rank search intervals. Queries that heavily overlap elements of contiguous sets are more reliable than ones that fall inbetween. Likewise, queries in dense contiguous sets, should rank higher than those in sparse ones (all else being equal).

Input scores for syntenic blocks are additive (assuming the user entered the correct transformation).

s = 0
for i in [a..b]
    if i in any block in c AND i in q
        s += d / L
    else if i in any block in c
        s += d * e^(-r * (dist(q, i)))
where
    c := the contiguous set
    q := query interval
    a := contig lower bound
    b := contig upper bound
    d := query syntenic score
    L := query length
    dist := function calculating distance

So all blocks in the contiguous set contribute to the total score. The scores of blocks that do not overlap the query decay exponentially with distance from the query bound.

The score decay rate is controlled by the parameter $r$. A value of 0.001, the current default, indicates weight will fall to 0.5 by 1000 bases from the query. k=0 would give equal weight to all elements in the contiguous set, i.e. be more affected by context. A high value, e.g. $r=100$, would completely ignore context, basing score only on overlapping elements.

8. Calculate search interval relative to contiguous set

Exactly one search interval is created for each previously selected contiguous set.

The chromosome length file tells synder how long each chromosome is. Then if some input query is closer to the end than anything in the synteny map, the search interval bound can be set to the end of the chromosome, rather than infinity.

Find input interval, i, context in the query genome. This context consists of all query intervals Sq that either overlap or flank the input.
From Sq determine which contiguous sets ,Sc, the query overlaps or borders.
For each contiguous set, c, in Sc, classify each bound of i as:
1. anchored - if overlaps a member of c
2. bound - if is inbetween two members of c
3. unbound - if is more extreme than any member of the set, but is inbetween two contiguous sets
4. extreme - No entry is found

The snapping rules are detailed below:

KEY:
|x--  --y| - bounds of the contiguous block; start==x, stop==y
a========b - a syntenic block with start == a and stop == b
  <---q    - the query interval, with stop == q (start doesn't matter)
a==b--c==d - query bounding blocks in the contiguous set
[===]      - a non-bounding block in the same contiguous set
 ...  F=== - nearest non-adjacent block ***ON THE TARGET***, F=start
     ^     - search interval bound

Possible snap positions (relative to query)
  |x...[===]-----a=======b-----c=======d-----[===]...y|  ...  F===
                 ^       ^     ^       ^     ^                ^

We always snap to one bound of the relevant block. If the bound falls between the blocks (i1, j1) and (i2, j2), it would be reasonable to set the search interval to (j1 + 1, i2 - 1), however, this results in negative lenghts when the blocks are adjacent. So instead we set such intervals to to (j1, i2).

Query bound is precedes both contiguous set bounds

         |x-----y|
      ...a=======b
         ^
  <---q

Query bound falls within a contiguous set element

  |x...a=======b...y|
               ^
       <--q

Query bound falls between elements of the contiguous set

  |x...a=======b-------c=======d...y|
                       ^
                 <--q

Query is beyond the bounds of the contiguous set

  Target side           A=======B      F===   1. map b-\>B   
                                |------^      2. map B->F
                                |             3. set F as SI bound
  Query side     |x...--a=======b| ...
                                <---q

Query is beyond anything in the synteny map

  Target side           A=======B      THE_END   
                                |------^      
                                |             1. map b->B   
  Query side     |x...--a=======b| ...        2. B is contig extremum
                                <---q         3. set SI bound to contig length

arendsee/synder documentation built on May 10, 2019, 1:26 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

arendsee/synder
Use Synteny to Filter Genomic Search

In arendsee/synder: Use Synteny to Filter Genomic Search

Top level commands

Search

Filter

Map

Count

Dump

Inputs:

Terms:

Algorithm:

Query bound is precedes both contiguous set bounds

Query bound falls within a contiguous set element

Query bound falls between elements of the contiguous set

Query is beyond the bounds of the contiguous set

Query is beyond anything in the synteny map

R Package Documentation

Browse R Packages

We want your feedback!

arendsee/synder Use Synteny to Filter Genomic Search

In arendsee/synder: Use Synteny to Filter Genomic Search

Top level commands

Search

Filter

Map

Count

Dump

Inputs:

Terms:

Algorithm:

Query bound is precedes both contiguous set bounds

Query bound falls within a contiguous set element

Query bound falls between elements of the contiguous set

Query is beyond the bounds of the contiguous set

Query is beyond anything in the synteny map

R Package Documentation

Browse R Packages

We want your feedback!

arendsee/synder
Use Synteny to Filter Genomic Search