summarizeBlocks: Summarize Synteny Blocks
In dorolin/rearrvisr: Detect, Classify, and Visualize Genome Rearrangements

Description Usage Arguments Details Value References See Also Examples

For each synteny block, summarize rearrangements and information on the alignment between the focal genome and the compared genome

1	summarizeBlocks(SYNT, focalgenome, compgenome, ordfocal)

`SYNT`	A list of matrices that store data on different classes of rearrangements and additional information. `SYNT` must have been generated with the `computeRearrs` function (optionally filtered with the `filterRearrs` function).
`focalgenome`	Data frame representing the focal genome, containing the mandatory columns `$marker`, `$scaff`, `$start`, `$end`, and `$strand`, and optional further columns. Markers need to be ordered by their map position.
`compgenome`	Data frame representing the compared genome (e.g., an ancestral genome reconstruction, or an extant genome), with the first three columns `$marker`, `$orientation`, and `$car`, followed by columns alternating node type and node element. Markers need to be ordered by their node elements. `compgenome` must be the same data frame that was used to generate the list `SYNT` with the `computeRearrs` function.
`ordfocal`	Character vector with the IDs of the focal genome segments that will be summarized. Have to match (a subset of) IDs in `focalgenome$scaff`.

focalgenome must contain the column $marker, a vector of either characters or integers with unique ortholog IDs that can be matched to the values in the rownames of SYNT and the $marker column of compgenome. Values can be NA for markers that have no ortholog. $scaff must be a character vector giving the name of the focal genome segment (i.e., chromosome or scaffold) of origin of each marker. $start and $end must be numeric vectors giving the location of each marker on its focal genome segment. $strand must be a vector of "+" and "-" characters giving the reading direction of each marker. Additional columns are ignored and may store custom information, such as marker names. Markers need to be ordered by their map position within each focal genome segment, for example by running the orderGenomeMap function. focalgenome may contain additional rows that were absent when running the computeRearrs function. However, all markers present in SYNT need to be contained in focalgenome, with the subset of shared markers being in the same order.

A list of lists that summarizes the alignment between the focal genome and each PQ-tree, and records whether synteny blocks are part of different classes of rearrangements. The top-level list elements are focal genome segments, and the lower-level list elements contain information on synteny blocks and rearrangements for each focal genome segment. For details on PQ-trees see the description of the "compgenome" class in the Details section of the checkInfile function, Booth & Lueker 1976, Chauve & Tannier 2008, or the package vignette.

The names of the top-level list elements correspond to the strings in ordfocal. Each list element is itself a list containing the data frame $blocks and five numeric matrices $NM1, $NM2, $SM, $IV, and $IVsm, described below. In all six list elements, each synteny block is represented by a row. Note that separate blocks are also generated when the hierarchical structure of the underlying PQ-tree changes, therefore not all independent rows are caused by a rearrangement.

$blocks contains information on the alignment and structure of each PQ-tree. The columns $blocks$start and $blocks$end give the start and end positions of the synteny block in SYNT (positions start at 1 separately for each focal genome segment). $blocks$markerS and $blocks$markerE give the marker IDs of the first and last marker per block. $blocks$car gives the ID of the CAR. Nine columns per hierarchy level describe the structure of each PQ-tree and its alignment to the focal genome. Hierarchy levels of the PQ-trees are indicated by suffixes {1, 2, ...}. $blocks$type gives the node type. $blocks$elemS and $blocks$elemE give the first and last ID of the node elements per block. They correspond to the IDs in the odd columns of compgenome (note that some IDs within blocks or in-between might be missing when markers in the compared genome are absent from the focal genome). $blocks$node indicates whether the block contains PQ-tree nodes (value is 1) or only leaf elements (value is 0). The columns $blocks$nodeori, $blocks$subnode, $blocks$blockid, $blocks$blockori, and $blocks$premask summarize for each block the values in the list elements of SYNT with the corresponding names (described in the Value section in the documentation of the computeRearrs function). The column $blocks$nodeori1, for example, summarizes for each block the values in the second column (i.e., the first node level) of SYNT$nodeori.

The numeric matrices $NM1, $NM2, $SM, $IV, and $IVsm indicate whether blocks are part of different classes of rearrangements. $NM1 stores TransLocations between CARs Between focal Segments; $NM2 stores TransLocations between CARs Within focal Segments; $SM stores TransLocations within CARs Within focal Segments; $IV and $IVsm store InVersions within CARs within focal segments. In $IV, blocks that are part of a multi-marker inversion are tagged with 1, while in $IVsm, integers >0 indicate the positions of single-marker inversions (i.e., markers with switched orientation) within their blocks. Each rearrangement is represented by a separate column, and blocks that are part of a rearrangement have a tag value of >0. Note that some columns in $NM2 or $SM may be duplicated due to the functioning of the underlying algorithm in computeRearrs; although corresponding to the same rearrangement, these duplicated columns are nevertheless included for completeness. By default these columns will not be visualized with the genomeRearrPlot function. If no rearrangements were detected for a certain class, the matrix has zero columns. See the package vignette or the Value section in the documentation of the computeRearrs function for details on the meaning of different tag values in these matrices. Note that if SYNT has been filtered with the filterRearrs function, only the above matrices will be affected, while the information in $blocks will remain unchanged.

The returned data can be visualized with the genomeRearrPlot function.

Booth, K.S. & Lueker, G.S. (1976). Testing for the consecutive ones property, interval graphs, and graph planarity using PQ-Tree algorithms. Journal of Computer and System Sciences, 13, 335–379. doi: 10.1016/S0022-0000(76)80045-1.

Chauve, C. & Tannier, E. (2008). A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLOS Computational Biology, 4, e1000234. doi: 10.1371/journal.pcbi.1000234.

checkInfile, computeRearrs, filterRearrs, genomeRearrPlot.

SYNT <- computeRearrs(TOY24_focalgenome, TOY24_compgenome, doubled = TRUE)

BLOCKS <- summarizeBlocks(SYNT, TOY24_focalgenome, TOY24_compgenome,
                          c("1","2","3"))

## Not run: 

## show summary for first focal genome segment
BLOCKS[[1]]

## End(Not run)