Description Usage Arguments Details Value Algorithm References See Also Examples
View source: R/computeRearrs.R
Detect and classify rearrangements along a focal genome relative to an ancestral genome reconstruction or an extant genome
1 2 | computeRearrs(focalgenome, compgenome, doubled, remWgt = 0.05,
splitnodes = TRUE, testlim = 100)
|
focalgenome |
Data frame representing the focal genome, containing the
mandatory columns |
compgenome |
Data frame representing the compared genome (e.g., an
ancestral genome reconstruction, or an extant genome), with the first three
columns |
doubled |
Logical. If |
remWgt |
A numeric value between |
splitnodes |
Logical. Split nodes into subnodes according to
rearrangements that occurred one step further up the hierarchy during the
rearrangement detection algorithm. |
testlim |
A positive integer specifying the maximum number of tests
performed to detect markers part of complex rearrangements. A lower value
can improve speed, but might lead to less optimal results. Set to
|
focalgenome
must contain the column $marker
, a vector of
either characters or integers with unique ortholog IDs that can be matched
to the values in the $marker
column of compgenome
. Values can
be NA
for markers that have no ortholog. $scaff
must be a
character vector giving the name of the focal genome segment (i.e.,
chromosome or scaffold) of origin of each marker. $start
and
$end
must be numeric vectors giving the location of each marker on
its focal genome segment. $strand
must be a vector of "+"
and
"-"
characters giving the reading direction of each marker.
Additional columns are ignored and may store custom information, such as
marker names. Markers need to be ordered by their map position within each
focal genome segment, for example by running the
orderGenomeMap
function. See Examples below for the
focalgenome
format.
compgenome
must contain the column $marker
, a vector of
either characters or integers with unique ortholog IDs that can be matched
to the values in the $marker
column of focalgenome
.
$orientation
must be a vector of "+"
and "-"
characters giving the reading direction of each marker in the compared
genome. If doubled = FALSE
, all values should be "+"
.
$car
must be an integer vector giving the location of each marker on
its compared genome segment (i.e., Contiguous Ancestral Region, or
CAR), analogous to contiguous sets of genetic markers on a chromosome,
scaffold, or contig. Each CAR is represented by a PQ-tree (Booth &
Lueker 1976; Chauve & Tannier 2008). The PQ structure of each CAR is
defined by additional columns (at least two) that have to alternate between
character vectors of node type ("P"
, "Q"
, or NA
) in
even columns, and integer vectors of node elements in odd columns (missing
values are permitted past the fifth column). Every set of node type and
node element column reflects the hierarchical structure of each
PQ-tree, with the rightmost columns representing the lowest level of
the hierarchy. P-nodes contain contiguous markers and/or nodes in
arbitrary order, while Q-nodes contain contiguous markers and/or
nodes in fixed order (including their reversal). For additional details on
PQ-trees see Booth & Lueker 1976, Chauve & Tannier 2008, or the
package vignette. See Examples below for the compgenome
format.
doubled = TRUE
indicates that orientation information for the
markers in the ancestral genome reconstruction is available. (This is the
case for example when the genome was reconstructed with the software ANGES,
Jones et al. 2012, using the option markers_doubled 1
.)
Orientation information facilitates detecting and classifying
rearrangements as inversions or syntenic moves, and can help determining
whether PQ-tree nodes are aligned to the focal genome in ascending
(i.e., standard) or descending (i.e., inverted) direction.
remWgt
provides the tagging weight for rearrangements consisting of
alternative sets of markers, either of which may have caused an apparent
nonsyntenic or syntenic move (e.g., a set of markers may have moved
upstream, or alternatively another set of markers may have moved
downstream). The set of markers that is more parsimonious to have changed
position relative to the other set receives tag values equal 1 -
remWgt
, while the alternative set of markers receives tag values equal
remWgt
. Setting this argument to non-default may require adjusting
the remThld
argument in the genomeImagePlot
and
renomeRearrPlot
functions accordingly.
A list of matrices that store data on different classes of
rearrangements and additional information on the structure of each
PQ-tree and its alignment to the focal genome. Markers are in rows,
and the row names of each matrix correspond to the IDs in the
$marker
column of the focalgenome
and compgenome
data
frames. The matrices contain all markers common to focalgenome
and
compgenome
, and are ordered by their position in focalgenome
.
The list elements $NM1
, $NM2
, $SM
, and $IV
are numeric matrices that store identified rearrangements. $NM1
stores T
ransL
ocations between CARs B
etween focal
S
egments; $NM2
stores T
ransL
ocations between
CARs W
ithin focal S
egments; $SM
stores
T
ransL
ocations within CARs W
ithin focal
S
egments; $IV
stores I
nV
ersions within CARs
within focal segments. See the package vignette for a detailed explanation
of these classes of rearrangements.
Each rearrangement is represented by a separate column. Except for
NM1
, which are identified across all focal segments, columns for
individual focal segments are joined across rows to save space (i.e., for
NM2
, SM
, and IV
, which are identified within focal
segments). To preserve the tabular format, these matrices are padded by
zeros for focal segments with a non-maximal number of rearrangements, if
necessary. If no rearrangements were detected for a certain class, the
matrix has zero columns. Markers that are part of a rearrangement have a
tag value of >0
within their respective column. Tagged markers
within a column are not necessarily consecutive, for example, when a
rearrangement is split into several parts through an insertion of a
different CAR, or when a rearrangement has an upstream and a downstream
component (i.e., when alternative sets of markers may have caused an
apparent nonsyntenic or syntenic move). Note that some columns in
$NM2
or $SM
may be duplicated for a particular focal segment
due to the functioning of the underlying algorithm; although corresponding
to the same rearrangement, these duplicated columns are nevertheless
included for completeness.
For NM1
, markers part of a class I nonsyntenic move have a value of
0.5
if non of the involved CAR fragments is a focal segment - CAR
fragment best hit. Otherwise, markers part of the CAR fragment that
is assigned as focal segment - CAR fragment best hit have a value of
0
, while markers part of all other non-best hit CAR fragments
have a value of 1
. For NM2
and SM
, markers part of a
rearrangement with an upstream and a downstream component have a value of
1 - remWgt
(or remWgt
) when they are part of the component
that is more (or less) parsimonious to have changed position; if either
component is equally parsimonious to have changed position, both have a
value of 0.5
; all other markers part of a rearrangement have a value
of 1
. For IV
, markers part of an inversion have a value of
1
.
The list elements $NM1bS
, $NM1bE
, $NM2bS
,
$NM2bE
, $SMbS
, $SMbE
, $IVbS
, and $IVbE
are numeric matrices that tag markers that denote the start ($*bS
)
and end ($*bE
) elements for the four classes of rearrangements
(i.e., the markers adjacent to rearrangement breakpoints). Each
rearrangement is represented by a separate column, but columns for
individual focal segments are joined for all matrices across rows
(including $NM1bS
and $NM1bE
) to save space. Tag values
correspond to the ones in $NM1
, $NM2
, $SM
, and
$IV
.
The list elements $nodeori
, $blockori
, $blockid
,
$premask
, and $subnode
are matrices that store information on
the structure of each PQ-tree, its alignment to the focal genome,
and internal data. The first column of each matrix corresponds to the CAR
level, and the following columns correspond to the hierarchical structure
of each PQ-tree, with information on the lowest level stored in the
last column. $nodeori
is a numeric matrix that stores the alignment
direction of each Q-node to the focal genome, with 1
indicating ascending (i.e., standard), and -1
descending (i.e.,
inverted) alignment. Q-nodes that have no alignment direction (e.g.,
single-marker nodes) have a value of 9
, and P-nodes are
NA
. $blockori
is a numeric matrix that stores the orientation
of each synteny block, with 1
indicating ascending (i.e., standard),
and -1
descending (i.e., inverted) orientation. Blocks that have no
orientation (e.g., blocks containing a single marker, or a single
PQ-tree branch) have a value of 9
, and blocks that are part
of P-nodes are NA
. $blockid
is a character matrix that
stores the ID of each synteny block within its node. For Q-nodes,
IDs are consecutive and start at 1
, separately for each node and
each hierarchy level, and reflect the order of synteny blocks. Block IDs
with ".1"
or ".2"
suffixes (in arbitrary order) indicate
blocks that were subject to an additional subdivision step. For
P-nodes, IDs are 0
unless the node is part of a
rearrangement, in which case IDs indicate different rearrangements, but not
block order. $premask
and $subnode
are numeric matrices that
store internal data used for the alignment and identification of
rearrangements. Integers >0
in $subnode
indicate subdivisions
of the corresponding PQ-tree due to nonsyntenic or syntenic moves.
All subdivisions have been searched separately for rearrangements one step
further down the hierarchy. This is of main relevance when splitnodes
= TRUE
.
The returned data can be visualized with the genomeImagePlot
function, or summarized and visualized with the
summarizeBlocks
and genomeRearrPlot
functions.
The returned rearrangements can be filtered by size with the
filterRearrs
function. Breakpoint coordinates of
rearrangements can be extracted with the getBreakpoints
function.
A detailed description of the implemented algorithm can be found in the Supplementary information of the manuscript associated with the package
Booth, K.S. & Lueker, G.S. (1976). Testing for the consecutive ones property, interval graphs, and graph planarity using PQ-Tree algorithms. Journal of Computer and System Sciences, 13, 335–379. doi: 10.1016/S0022-0000(76)80045-1.
Chauve, C. & Tannier, E. (2008). A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLOS Computational Biology, 4, e1000234. doi: 10.1371/journal.pcbi.1000234.
Jones, B. R. et al. (2012). ANGES: reconstructing ANcestral GEnomeS maps. Bioinformatics, 28, 2388–2390. doi: 10.1093/bioinformatics/bts457
filterRearrs
, genomeImagePlot
,
getBreakpoints
, summarizeBlocks
,
genomeRearrPlot
, summarizeRearrs
;
orderGenomeMap
to order the
focalgenome
data frame; convertPQtree
or
genome2PQtree
to generate the compgenome
data frame.
1 2 3 4 5 6 7 8 9 10 11 | computeRearrs(TOY24_focalgenome, TOY24_compgenome, doubled = TRUE)
## Not run:
## focalgenome format:
TOY24_focalgenome
## compgenome format:
TOY24_compgenome
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.