computeRearrs: Compute Rearrangements
In dorolin/rearrvisr: Detect, Classify, and Visualize Genome Rearrangements

Description Usage Arguments Details Value Algorithm References See Also Examples

Detect and classify rearrangements along a focal genome relative to an ancestral genome reconstruction or an extant genome

1 2	computeRearrs(focalgenome, compgenome, doubled, remWgt = 0.05, splitnodes = TRUE, testlim = 100)

`focalgenome`	Data frame representing the focal genome, containing the mandatory columns `$marker`, `$scaff`, `$start`, `$end`, and `$strand`, and optional further columns. Markers need to be ordered by their map position.
`compgenome`	Data frame representing the compared genome (e.g., an ancestral genome reconstruction, or an extant genome), with the first three columns `$marker`, `$orientation`, and `$car`, followed by columns that alternate between node type and node element. Markers need to be ordered by their node elements.
`doubled`	Logical. If `TRUE`, markers in the ancestral genome reconstruction contain information about their orientation.
`remWgt`	A numeric value between `0` (inclusive) and `0.5` (exclusive). Sets the tagging weight for the component of a rearrangement that is less parsimonious to have changed position relative to the alternative component to `remWgt`, and that of the alternative component to `1 - remWgt`.
`splitnodes`	Logical. Split nodes into subnodes according to rearrangements that occurred one step further up the hierarchy during the rearrangement detection algorithm. `splitnodes = TRUE` prevents that the same rearrangement receives tags across multiple levels of the PQ-tree hierarchy.
`testlim`	A positive integer specifying the maximum number of tests performed to detect markers part of complex rearrangements. A lower value can improve speed, but might lead to less optimal results. Set to `Inf` for exhaustive testing (not recommended for highly rearranged genomes).

focalgenome must contain the column $marker, a vector of either characters or integers with unique ortholog IDs that can be matched to the values in the $marker column of compgenome. Values can be NA for markers that have no ortholog. $scaff must be a character vector giving the name of the focal genome segment (i.e., chromosome or scaffold) of origin of each marker. $start and $end must be numeric vectors giving the location of each marker on its focal genome segment. $strand must be a vector of "+" and "-" characters giving the reading direction of each marker. Additional columns are ignored and may store custom information, such as marker names. Markers need to be ordered by their map position within each focal genome segment, for example by running the orderGenomeMap function. See Examples below for the focalgenome format.

compgenome must contain the column $marker, a vector of either characters or integers with unique ortholog IDs that can be matched to the values in the $marker column of focalgenome. $orientation must be a vector of "+" and "-" characters giving the reading direction of each marker in the compared genome. If doubled = FALSE, all values should be "+". $car must be an integer vector giving the location of each marker on its compared genome segment (i.e., Contiguous Ancestral Region, or CAR), analogous to contiguous sets of genetic markers on a chromosome, scaffold, or contig. Each CAR is represented by a PQ-tree (Booth & Lueker 1976; Chauve & Tannier 2008). The PQ structure of each CAR is defined by additional columns (at least two) that have to alternate between character vectors of node type ("P", "Q", or NA) in even columns, and integer vectors of node elements in odd columns (missing values are permitted past the fifth column). Every set of node type and node element column reflects the hierarchical structure of each PQ-tree, with the rightmost columns representing the lowest level of the hierarchy. P-nodes contain contiguous markers and/or nodes in arbitrary order, while Q-nodes contain contiguous markers and/or nodes in fixed order (including their reversal). For additional details on PQ-trees see Booth & Lueker 1976, Chauve & Tannier 2008, or the package vignette. See Examples below for the compgenome format.

doubled = TRUE indicates that orientation information for the markers in the ancestral genome reconstruction is available. (This is the case for example when the genome was reconstructed with the software ANGES, Jones et al. 2012, using the option markers_doubled 1.) Orientation information facilitates detecting and classifying rearrangements as inversions or syntenic moves, and can help determining whether PQ-tree nodes are aligned to the focal genome in ascending (i.e., standard) or descending (i.e., inverted) direction.

remWgt provides the tagging weight for rearrangements consisting of alternative sets of markers, either of which may have caused an apparent nonsyntenic or syntenic move (e.g., a set of markers may have moved upstream, or alternatively another set of markers may have moved downstream). The set of markers that is more parsimonious to have changed position relative to the other set receives tag values equal 1 - remWgt, while the alternative set of markers receives tag values equal remWgt. Setting this argument to non-default may require adjusting the remThld argument in the genomeImagePlot and renomeRearrPlot functions accordingly.

A list of matrices that store data on different classes of rearrangements and additional information on the structure of each PQ-tree and its alignment to the focal genome. Markers are in rows, and the row names of each matrix correspond to the IDs in the $marker column of the focalgenome and compgenome data frames. The matrices contain all markers common to focalgenome and compgenome, and are ordered by their position in focalgenome.

The list elements $NM1, $NM2, $SM, and $IV are numeric matrices that store identified rearrangements. $NM1 stores TransLocations between CARs Between focal Segments; $NM2 stores TransLocations between CARs Within focal Segments; $SM stores TransLocations within CARs Within focal Segments; $IV stores InVersions within CARs within focal segments. See the package vignette for a detailed explanation of these classes of rearrangements.

Each rearrangement is represented by a separate column. Except for NM1, which are identified across all focal segments, columns for individual focal segments are joined across rows to save space (i.e., for NM2, SM, and IV, which are identified within focal segments). To preserve the tabular format, these matrices are padded by zeros for focal segments with a non-maximal number of rearrangements, if necessary. If no rearrangements were detected for a certain class, the matrix has zero columns. Markers that are part of a rearrangement have a tag value of >0 within their respective column. Tagged markers within a column are not necessarily consecutive, for example, when a rearrangement is split into several parts through an insertion of a different CAR, or when a rearrangement has an upstream and a downstream component (i.e., when alternative sets of markers may have caused an apparent nonsyntenic or syntenic move). Note that some columns in $NM2 or $SM may be duplicated for a particular focal segment due to the functioning of the underlying algorithm; although corresponding to the same rearrangement, these duplicated columns are nevertheless included for completeness.

For NM1, markers part of a class I nonsyntenic move have a value of 0.5 if non of the involved CAR fragments is a focal segment - CAR fragment best hit. Otherwise, markers part of the CAR fragment that is assigned as focal segment - CAR fragment best hit have a value of 0, while markers part of all other non-best hit CAR fragments have a value of 1. For NM2 and SM, markers part of a rearrangement with an upstream and a downstream component have a value of 1 - remWgt (or remWgt) when they are part of the component that is more (or less) parsimonious to have changed position; if either component is equally parsimonious to have changed position, both have a value of 0.5; all other markers part of a rearrangement have a value of 1. For IV, markers part of an inversion have a value of 1.

The list elements $NM1bS, $NM1bE, $NM2bS, $NM2bE, $SMbS, $SMbE, $IVbS, and $IVbE are numeric matrices that tag markers that denote the start ($*bS) and end ($*bE) elements for the four classes of rearrangements (i.e., the markers adjacent to rearrangement breakpoints). Each rearrangement is represented by a separate column, but columns for individual focal segments are joined for all matrices across rows (including $NM1bS and $NM1bE) to save space. Tag values correspond to the ones in $NM1, $NM2, $SM, and $IV.

The list elements $nodeori, $blockori, $blockid, $premask, and $subnode are matrices that store information on the structure of each PQ-tree, its alignment to the focal genome, and internal data. The first column of each matrix corresponds to the CAR level, and the following columns correspond to the hierarchical structure of each PQ-tree, with information on the lowest level stored in the last column. $nodeori is a numeric matrix that stores the alignment direction of each Q-node to the focal genome, with 1 indicating ascending (i.e., standard), and -1 descending (i.e., inverted) alignment. Q-nodes that have no alignment direction (e.g., single-marker nodes) have a value of 9, and P-nodes are NA. $blockori is a numeric matrix that stores the orientation of each synteny block, with 1 indicating ascending (i.e., standard), and -1 descending (i.e., inverted) orientation. Blocks that have no orientation (e.g., blocks containing a single marker, or a single PQ-tree branch) have a value of 9, and blocks that are part of P-nodes are NA. $blockid is a character matrix that stores the ID of each synteny block within its node. For Q-nodes, IDs are consecutive and start at 1, separately for each node and each hierarchy level, and reflect the order of synteny blocks. Block IDs with ".1" or ".2" suffixes (in arbitrary order) indicate blocks that were subject to an additional subdivision step. For P-nodes, IDs are 0 unless the node is part of a rearrangement, in which case IDs indicate different rearrangements, but not block order. $premask and $subnode are numeric matrices that store internal data used for the alignment and identification of rearrangements. Integers >0 in $subnode indicate subdivisions of the corresponding PQ-tree due to nonsyntenic or syntenic moves. All subdivisions have been searched separately for rearrangements one step further down the hierarchy. This is of main relevance when splitnodes = TRUE.

The returned data can be visualized with the genomeImagePlot function, or summarized and visualized with the summarizeBlocks and genomeRearrPlot functions. The returned rearrangements can be filtered by size with the filterRearrs function. Breakpoint coordinates of rearrangements can be extracted with the getBreakpoints function.

A detailed description of the implemented algorithm can be found in the Supplementary information of the manuscript associated with the package

Booth, K.S. & Lueker, G.S. (1976). Testing for the consecutive ones property, interval graphs, and graph planarity using PQ-Tree algorithms. Journal of Computer and System Sciences, 13, 335–379. doi: 10.1016/S0022-0000(76)80045-1.

Chauve, C. & Tannier, E. (2008). A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLOS Computational Biology, 4, e1000234. doi: 10.1371/journal.pcbi.1000234.

Jones, B. R. et al. (2012). ANGES: reconstructing ANcestral GEnomeS maps. Bioinformatics, 28, 2388–2390. doi: 10.1093/bioinformatics/bts457

filterRearrs, genomeImagePlot, getBreakpoints, summarizeBlocks, genomeRearrPlot, summarizeRearrs; orderGenomeMap to order the focalgenome data frame; convertPQtree or genome2PQtree to generate the compgenome data frame.

computeRearrs(TOY24_focalgenome, TOY24_compgenome, doubled = TRUE)

## Not run: 

## focalgenome format:
TOY24_focalgenome

## compgenome format:
TOY24_compgenome

## End(Not run)