List of all vignettes:

This work is funded by the National Science Foundation grant NSF-IOS 1546858.

Top level commands

Search

This is the primary function of synder. A set of intervals in one genome (the query) is mapped to a set of intervals in another genome (the target). The input intervals may fall between query-side syntenic intervals, in which case, the search interval also will fall inbetween target syntenic intervals.

The output is a table with the following fields:

  1. query interval name (e.g. AT1G20300)
  2. query chromosome name
  3. query start position
  4. query stop position
  5. target chromosome name
  6. search interval start position on target chromsome
  7. search interval stop position on target chromsome
  8. search interval strand ('+' / '-')
  9. score
  10. contiguous set id
  11. lower flag
  12. 0 lower bound is inside a syntenic interval
  13. 1 lower bound is between intervals in a contiguous set
  14. 2 lower bound does not overlap the contiguous set
  15. 3 lower bound is beyond any syntenic interval (near end of scaffold)
  16. upper flag - see lower flag
  17. between flag - 0 if query overlaps a syntenic interval, 1 otherwise

Filter

filter removes links that disagree with the synteny map.

The filter function takes two syntenic maps and finds the congruent links. Given two syntenic maps, A and B, the query-side intervals in A are mapped to target side search intervals using the syntenic map B. Then the target-side intervals in A that overlap the predicted search intervals are printed.

The most obvious usage case takes as input the results of BLASTing a sequence against a genome. Often such searches will have many hits in a swooping e-value gradient. The highest scoring hit may not be the orthologous one (for example, a weakly similar, but long hit may outscore the nearly identical, but truncated, true ortholog). synder filter will find the hits that are concordant with genomic context, reducing possibly thousands of hits to only a few.

Map

Find all query-side syntenic blocks that overlap the input interval. Then map these blocks to the target-side and print the result. If an input interval overlaps no query bock, the flanks are printed.

The output will have the columns:o 1. input interval name (e.g. AT1G01010) 2. target contig name (e.g. Chr1) 3. target start position 4. target stop position 5. missing flag, 0 if input overlaps no block, 1 otherwise

Count

Like map except it counts the number of overlaps, rather than printing them. The output is a TAB-delimited list of sequence names and counts.

Dump

Builds the internal synteny datastructure and prints the results.

For example, given the file

que   100    200    tar   1100   1200   100   +
que   1100   1800   tar   1500   1900   100   +
que   1400   1700   tar   1600   2000   100   +
que   1200   1600   tar   1700   2100   100   +
que   1300   1900   tar   1800   2200   100   +
R> synder::dump('que-tar.syn', trans='p')
que  100   200   tar  1100  1200  101.000000  +  0
que  1100  1900  tar  1500  2200  700.084964  +  0

In the above example, dump shows that blocks 2-5 are merged (since they are doubly-overlapping) and shows the results of the score transformations. It is this dumped synteny map that would have been used in any filter or search operations.

Also note the addition of a 9th column. This column specifies that contiguous set. In this case, all blocks, after merging, are in the same set.

Inputs:

The synteny map must be TAB-delimited, with no header, and must have the following fields:

* score can be any numeric value, it will be transformed as specified by the -x option

The target and query genome lengths files must be TAB-delimited with columns: ,

Terms:

Algorithm:

Execution order for the synder search command

  1. load synteny map
  2. transform scores
  3. merge doubly-overlapping blocks
  4. determine contiguous sets
  5. find query-side, contextually-relevant blocks for each input sequence
  6. map to overlapping contiguous sets
  7. calculate score for input sequence relative to each contiguous set
  8. calculate search interval relative to each contiguous set

1. Load Synteny Map

The input synteny map must have the following columns:

  1. qseqid query contig id (e.g. Chr1)
  2. qstart query interval start
  3. qstop query interval stop
  4. sseqid target contig id
  5. sstart target interval start
  6. sstop target interval stop
  7. score score of the syntenic match
  8. strand relative orientation

2. Transform scores

Internally, scores for each block are assumed to be additive. However, there are many possible forms of input scores. Low numbers may represent good matches (e.g. e-value) or high numbers (e.g. bit scores). Scores may be additive (bitscores) or averaged (percent similarity).

For this reason, the user must specify a transform for the score column (column 7 of synteny map). This transform can be one of the following:

  S' = S            default, no transformation)
  S' = L * S        transform from score densities
  S' = L * S / 100  transform from percent identity
  S' = -log(S)      transform from e-values or p-values

Where S is input score, S' is the transformed score, and L interval length

3. Merge doubly-overlapping blocks

If two blocks overlap on both the query and target side, they should be merged. In a perfect synteny map, this would never occur. But in practice, it is common, and convolutes the formation of contiguous sets.

For example the following syntenic map

que 100 200 tar 100 200 100 +
que 300 400 tar 300 400 100 +
que 310 390 tar 310 390 100 +
que 500 600 tar 500 600 100 +

would be separated into two contiguous sets with the bounds (100, 400) and (310, 600).

Now if a input intervals at (que, 250, 450) is searched, two search intervals with the same bounds will obtain: (200, 500) and (200, 500). Each will be flagged as unbound (e.g. an interval edge is between contiguous sets).

However, if the second and third blocks are merged, we obtain a single search interval flagged as bound on both ends.

In practice, many repetitive regions of the genome can be massively doubly-overlapping, resulting in many redundant search intervals. This throws of statistics and wastes time searching extra space.

To avoid such problems, synder merges any blocks that overlap on both the query and target sides.

The bounds of the merged blocks are simply the union.

But the scores also need to be merged. Originally, I used the equation:

 da (la - lo) + db (lb - lo) + lo (da + db) / 2            E1

Where * da and db are the score densities of blocks a and b * la, lb and lo are the lengths of a, b, and their overlap

E1 is problematic for two reasons. First, it doesn't really make sense to average the overlap scores. Second, iterative pairwise averaging is assymetric, giving higher weight to the blocks merged later.

I replaced this approach with taking the max of the overlap scores:

 da (la - lo) + db (lb - lo) + lo * max(da, db)            E2

This fixes the first problem.

The second problem is a bit trickier, since it would require tracking sub-intervals. It would be a pain to implement and probably would have little effect on any real dataset. So I am content with an imperfect solution for now.

4. Determine contiguous sets

Build an interval adjacency matrix for the query context intervals and for the target context intervals. AND them together to get a block adjacency matrix. From this matrix, extract paths of adjacent blocks. Synder will merge any blocks that overlap on both genomes, this ensures there is a unique path through the adjacency matrix.

5. Find contextual blocks

Mapping query intervals to target intervals would be trivial for a perfectly linear map, e.g.

                                      [-----------]
=====      ============     ==========             ============
  |              |               |                       |  
  |              |               |                       |  
=====      ============     ==========   <--->     ============

Where <---> is the query interval and [----] the search interval

Synder reduces all blocks in the genome into into contiguous sets, which are non-overlapping sets of intervals where all blocks are adjacent.

6. Reduce to overlapping sets

Once all overlapping and flanking query-side blocks are identified.

7. Calculate scores relative to contiguous sets

Especially with the high $k$, it is important to be able to rank search intervals. Queries that heavily overlap elements of contiguous sets are more reliable than ones that fall inbetween. Likewise, queries in dense contiguous sets, should rank higher than those in sparse ones (all else being equal).

Input scores for syntenic blocks are additive (assuming the user entered the correct transformation).

s = 0
for i in [a..b]
    if i in any block in c AND i in q
        s += d / L
    else if i in any block in c
        s += d * e^(-r * (dist(q, i)))
where
    c := the contiguous set
    q := query interval
    a := contig lower bound
    b := contig upper bound
    d := query syntenic score
    L := query length
    dist := function calculating distance

So all blocks in the contiguous set contribute to the total score. The scores of blocks that do not overlap the query decay exponentially with distance from the query bound.

The score decay rate is controlled by the parameter $r$. A value of 0.001, the current default, indicates weight will fall to 0.5 by 1000 bases from the query. k=0 would give equal weight to all elements in the contiguous set, i.e. be more affected by context. A high value, e.g. $r=100$, would completely ignore context, basing score only on overlapping elements.

8. Calculate search interval relative to contiguous set

Exactly one search interval is created for each previously selected contiguous set.

The chromosome length file tells synder how long each chromosome is. Then if some input query is closer to the end than anything in the synteny map, the search interval bound can be set to the end of the chromosome, rather than infinity.

  1. Find input interval, i, context in the query genome. This context consists of all query intervals Sq that either overlap or flank the input.
  2. From Sq determine which contiguous sets ,Sc, the query overlaps or borders.
  3. For each contiguous set, c, in Sc, classify each bound of i as:
    1. anchored - if overlaps a member of c
    2. bound - if is inbetween two members of c
    3. unbound - if is more extreme than any member of the set, but is inbetween two contiguous sets
    4. extreme - No entry is found

The snapping rules are detailed below:

KEY:
|x--  --y| - bounds of the contiguous block; start==x, stop==y
a========b - a syntenic block with start == a and stop == b
  <---q    - the query interval, with stop == q (start doesn't matter)
a==b--c==d - query bounding blocks in the contiguous set
[===]      - a non-bounding block in the same contiguous set
 ...  F=== - nearest non-adjacent block ***ON THE TARGET***, F=start
     ^     - search interval bound

Possible snap positions (relative to query)
  |x...[===]-----a=======b-----c=======d-----[===]...y|  ...  F===
                 ^       ^     ^       ^     ^                ^

We always snap to one bound of the relevant block. If the bound falls between the blocks (i1, j1) and (i2, j2), it would be reasonable to set the search interval to (j1 + 1, i2 - 1), however, this results in negative lenghts when the blocks are adjacent. So instead we set such intervals to to (j1, i2).

Query bound is precedes both contiguous set bounds

         |x-----y|
      ...a=======b
         ^
  <---q

Query bound falls within a contiguous set element

  |x...a=======b...y|
               ^
       <--q

Query bound falls between elements of the contiguous set

  |x...a=======b-------c=======d...y|
                       ^
                 <--q

Query is beyond the bounds of the contiguous set

  Target side           A=======B      F===   1. map b-\>B   
                                |------^      2. map B->F
                                |             3. set F as SI bound
  Query side     |x...--a=======b| ...
                                <---q

Query is beyond anything in the synteny map

  Target side           A=======B      THE_END   
                                |------^      
                                |             1. map b->B   
  Query side     |x...--a=======b| ...        2. B is contig extremum
                                <---q         3. set SI bound to contig length


arendsee/synder documentation built on May 10, 2019, 1:26 p.m.