nocite: | @ma06, @booth76, @consortium07, @ranz07 ...
knitr::opts_chunk$set( collapse = TRUE, comment = "#>"##, ##tidy.opts=list(width.cutoff=80), ##tidy=TRUE )
##devtools::load_all(".") library(rearrvisr)
\newpage
This vignette describes the R package rearrvisr
, which can be used to detect, classify, and visualize genome rearrangements along a focal genome. The package requires two input files: one that gives the positions of orthologous markers in an extant genome (the focal genome), and one that specifies the organization of orthologous markers in an ancestral genome reconstruction or a second extant genome (the compared genome).
This vignette provides a tutorial on how to use the package functions and interpret their output by applying them step by step to a Drosophila example data set. The vignette also describes the main steps of the implemented algorithm on a simple toy example, and outlines the methods used to create the Drosophila example data set from 12 publicly available genome assemblies. See ?rearrvisr
for a brief overview of the functions of the package. A detailed description of the rearrvisr
algorithm can be found in the Supplementary information of the associated manuscript.
The Drosophila example data set used in this vignette (MEL_markers
, SIM_markers
, YAK_markers
, and MSSYE_PQTREE_HEUR
) was generated from 12 publicly available genome assemblies downloaded from Ensemble Release 91 (http://dec2017.archive.ensembl.org; D. melanogaster) or Ensemble Metazoa Release 37 (http://oct2017-metazoa.ensembl.org; other genomes), using steps and software as described below. An additional example data set (TOY24_focalgenome
and TOY24_compgenome
) was created for illustrative purposes. A short description of these example data can be accessed with the commands ?MEL_markers
, ?SIM_markers
, ?YAK_markers
, ?MSSYE_PQTREE_HEUR
, ?TOY24_focalgenome
, and ?TOY24_compgenome
.
Two input files are required: a representation of the focal genome in focalgenome
, for which rearrangements will be identified, and a representation of the compared genome in compgenome
, which serves as reference for the arrangement of markers within genome segments. Deviations in the order of focal markers from the order of reference markers indicate rearrangements. Genome segments are chromosomes, scaffolds, or contiguous sets of genetic markers. Genomes do not need to be assembled at chromosome-level, however the detection of rearrangements is inherently limited for assemblies that are highly fragmented. Markers denote orthologous genomic regions that are unique within each genome, such as genes or synteny blocks. Orthologous markers require the same ID in both input files, and need to be ordered by their absolute or relative position within their genome segments. Further details are below.
The focal genome representation in focalgenome
is an extant genome in a standard linear (one-dimensional) genome map format that associates markers to genome segments and gives their map positions (e.g., in base pairs) within them. Genome segments of the focal genome are chromosomes or scaffolds and are referred to as focal genome segments or focal segments.
The focal genome map needs to be a data frame with the mandatory columns marker (orthologous marker ID; can be NA
for markers that have no ortholog), scaff (name of the focal segment where the marker is located), start (absolute start position of the marker on its focal segment, for example in base pairs), end (absolute end position of the marker), and strand (the reading direction of the marker, either "+"
or "-"
). Additional columns are ignored and may store custom information, such as marker names. Markers need to be ordered by their map position within each focal segment, for example by running the orderGenomeMap()
function. More information on the focalgenome
file format is available from the ?checkInfile
command.
\medskip
## Example for the focal genome map of Drosophila melanogaster ## (marker start and end positions are midpoints +/- 1 base pair) head(MEL_markers)
To load a text file with a focal genome map into R
, commands like the following can be used:
\medskip
## Example for reading a focal genome map from file markersfile <- "~/path/to/focalgenome.txt" focalgenome <- read.table(file = markersfile, as.is = TRUE, col.names = c("marker", "scaff", "start", "end", "strand")) ## assure that column 'scaff' is a character vector focalgenome$scaff <- as.character(focalgenome$scaff)
Important: The column scaff must be a character vector, even when all focal segment names are numeric. This is assured by the above as.character(focalgenome$scaff)
command.
orderGenomeMap()
functionThe focal genome map in focalgenome
needs to be ordered so that markers appear according to their map position within each focal segment. Focal segments can be in arbitrary order. The function orderGenomeMap()
orders markers within each focal segment by their map position (i.e., the midpoint between positions given by the columns start and end in focalgenome
). Focal segments can be ordered by user-defined names, in alphabetical order, or by their size. See ?orderGenomeMap
for details.
\medskip
## Example for ordering the focal genome map of D. melanogaster ## make random order (for illustration only) myorder <- sample(1:nrow(MEL_markers)) MEL_markers_unordered <- MEL_markers[myorder, ] head(MEL_markers_unordered) ## order genome map by size of focal segments and marker position MEL_markers_reordered <- orderGenomeMap(MEL_markers_unordered, ordnames = "all", sortby = "size") head(MEL_markers_reordered)
checkInfile()
functionTo verify that the format of the focal genome map is correct, the checkInfile()
function can optionally be run. This function is also called internally from other functions of the package where focalgenome
is used as input. The function will return an error message when a problem has been detected, or nothing otherwise. See ?checkInfile
for more information.
\medskip
## Example for checking the focal genome map of D. melanogaster checkInfile(MEL_markers, "focalgenome", checkorder = TRUE)
Note: If the marker column was assigned the class numeric
rather than the required class integer
, run for example the command focalgenome <- type.convert(focalgenome, as.is = TRUE)
to fix the resulting error.
The compared genome representation in compgenome
can be an ancestral genome reconstruction or an extant genome, and is organized into genome segments called CARs (Contiguous Ancestral Regions; following Ma et al., 2006; Chauve and Tannier, 2008). For simplicity, compared genome segments are referred to as CARs throughout, but can equally be any contiguous sets of genetic markers of an extant genome. Each CAR is represented by a PQ-tree (described below; Booth and Lueker, 1976; Chauve and Tannier, 2008), which specifies the relative positions of markers within genome segments, incorporating ambiguity in marker order (i.e., all alternative orders).
The PQ-tree format joins all PQ-trees (i.e., CARs) of the compared genome representation in a single data frame compgenome
(details below). For simplicity, the joined PQ-trees in the compgenome
data frame are referred to as PQ-structure throughout, while individual PQ-trees are referred to as CARs or PQ-trees.
An ancestral genome representation in form of PQ-trees, as output by the software ANGES [@chauve08; @jones12], can be converted into a PQ-structure with the convertPQtree()
function. A one-dimensional genome map, for example of an extant genome, can be converted into a two-dimensional PQ-structure with the genome2PQtree()
function.
A PQ-tree is a combinatorial structure consisting of leaves, P-nodes, and Q-nodes (Booth and Lueker, 1976; Chauve and Tannier, 2008; Figure \ref{fig:pqtree}). This structure allows the representation of ambiguity in marker order in an ancestral genome reconstruction (i.e., encoding all possible arrangements of markers). Each marker is represented by a leaf, which is connected to the root of the PQ-tree (i.e., a CAR) through a hierarchical sequence of internal nodes (P-nodes or Q-nodes). The type of internal nodes and the root specify the order of their children (i.e., internal nodes or leaves at the subsequent hierarchy level): they are in arbitrary order for P-nodes, and in fixed order (including their reversal) for Q-nodes. The Figure \ref{fig:pqtree} illustrates two alternative marker orders (among many) that can be encoded by a single PQ-tree.
\newpage
\medskip
knitr::include_graphics("PQTree.png")
Figure \ref{fig:pqtree}: P-nodes are illustrated with ovals and Q-nodes with rectangles. Leaves (i.e., markers) are illustrated with pentagons and include their marker ID. Children of P-nodes are in arbitrary order, and children of Q-nodes are in fixed order (including their reversal). The marker orders 1 2 3 4 5 6 7 8 9 10
in A and 3 2 1 4 8 9 10 -7 -6 -5
in B can be encoded by the same PQ-tree, as branches that origin from P-nodes can be transposed, and those that origin from Q-nodes can be reversed in their order (i.e., turning the Q-node around), as indicated by the gray arrows.
To facilitate the handling of PQ-trees in R
, their hierarchical organization is encoded as a data frame compgenome
with markers in rows and a description of the structure of PQ-trees in columns (the PQ-tree format, or PQ-structure). The Figure \ref{fig:toy24tree} illustrates graphically the tabular PQ-structure of the example data TOY24_compgenome
. All branches that origin at a node are labeled with an integer ID (i.e., the node element ID) according to their position at that node, starting at 1
. When considered jointly through all hierarchy levels, these IDs provide the relative position of each marker within their PQ-tree (e.g., compare the branch labels in the Figure \ref{fig:toy24tree} with the entries in the rows of TOY24_compgenome
for each marker below). The full compared genome can be represented by a single PQ-tree with a P-node at its root (i.e., CARs are positioned in arbitrary order to each other), which is encoded by integer IDs at the CAR level.
\medskip
knitr::include_graphics("Toy24_compgenome.png")
The data frame compgenome
contains the mandatory columns marker (orthologous marker ID), orientation (the reading direction of the marker, either "+"
or "-"
), and car (ID of the CAR where the marker is located). Subsequent columns describe the internal structure of each CAR, where two columns are needed for each subordinate hierarchy level: the first column of each set gives the node type (either "P"
or "Q"
), and the second column the integer ID of the children of the node (i.e., the node element ID). NA
s are permitted from the second subordinate hierarchy level onwards and are required to fill otherwise empty cells that arise when the preceding hierarchy level encodes leaves. Markers need to be ordered by their node element IDs within the PQ-structure. More information on the PQ-tree format is available from the ?checkInfile
command.
\medskip
## Example for a compared genome representation in PQ-tree format ## ('TOY24_compgenome'; see Figure for a graphical representation) TOY24_compgenome
To load a text file with a compared genome representation into R
, commands like the following can be used:
\medskip
## Example for reading a compared genome representation from file pqtreefile <- "~/path/to/compgenome.txt" ## (A) format is whitespace separated with missing entries ## and no column names ncol <- max(count.fields(file = pqtreefile, sep = " ")) compgenome <- read.table(file = pqtreefile, fill = TRUE, col.names = c("marker", "orientation", "car", paste0(rep(c("type", "elem"), times = (ncol - 3) / 2), rep(1:((ncol - 3) / 2), each = 2))), as.is = TRUE, na.strings = "") ## (B) format has NA for missing entries and has column names compgenome <- read.table(file = pqtreefile, header = TRUE, as.is = TRUE)
convertPQtree()
functionThe ancestral genome reconstruction that is generated by the software ANGES [@chauve08; @jones12] is encoded as a text file with the first row giving the name of the ancestor (preceded by >
), followed by two rows for each CAR. The first row of each set gives the CAR ID (preceded by #CAR
), and the second row provides the PQ-tree structure of that CAR. The children of each node in the PQ-tree are enclosed by _P
and P_
markups for P-nodes, or by _Q
and Q_
markups for Q-nodes. Markers (i.e., orthologs) that belong to a particular node are located between its corresponding markups. Markers with reversed orientation are preceded by a -
sign. The opening (_P
or _Q
) and closing (P_
or Q_
) markups can be nested to allow the representation of the hierarchical structure of the PQ-tree. Such linearly encoded PQ-trees can be converted into a PQ-structure with the convertPQtree()
function. See ?convertPQtree
for more information.
Important: The convertPQtree()
function was designed to work with the ANGES *PQTREE*
output file (e.g., the MSSYE_PQTREE_HEUR
example file). It was not designed to work with the ancillary *PQRTREE*
or the *DOUBLED*
output files.
\medskip
## Example for a linearly encoded PQ-tree for #CAR7 of the ## Drosophila ancestral genome 'MSSYE' computed with the ## software ANGES (Jones et al., 2012) MSSYE_PQTREE_HEUR[which(MSSYE_PQTREE_HEUR[ ,1] == "#CAR7") + 1, ] ## Convert linearly encoded PQ-trees into PQ-structure MSSYE_compgenome <- convertPQtree(MSSYE_PQTREE_HEUR) ## Show converted #CAR7 MSSYE_compgenome[MSSYE_compgenome$car == 7, ]
Note: For the following examples it is assumed that the object MSSYE_compgenome
has been created with the above command.
To load a text file with an ancestral genome reconstruction from the software ANGES into R
and to convert it into a PQ-structure, commands like the following can be used:
\medskip
## Example for reading an ancestral genome reconstruction ## from the software ANGES from file and converting it into ## a PQ-structure pqtreefile <- "~/path/to/PQTREE" ## read file with the genome reconstruction rawtree <- read.table(file = pqtreefile, sep = ",", comment.char = "", as.is = TRUE) ## convert into PQ-structure compgenome <- convertPQtree(rawtree)
genome2PQtree()
functionAlternatively to a genome reconstruction in PQ-tree format, a one-dimensional genome map can be converted into a two-dimensional PQ-structure. This can be, for example, a genome map of an extant genome, or a genome map of an unambiguous genome reconstruction. The compared genome map has to be in the same format as the focal genome map (described above), and needs to be ordered by the map position of markers within each compared genome segment, for example by running the orderGenomeMap()
function. An unambiguously ordered genome representation can be seen as a subclass of a PQ-structure, where each genome segment is encoded by a single Q-node that only contains leaves as children. Accordingly, the PQ-structure of a converted genome map has exactly five columns (i.e., marker, orientation, car, and two columns for node type and node element). The genome2PQtree()
function performs such a conversion. See ?genome2PQtree
for details.
Important: Compared genome segments need to be contiguous sets of genetic markers. Genome segments that are (potentially) overlapping, such as minor scaffolds or contigs that were not assembled into chromosomes and might in fact be part of assembled chromosomes or enclosed in other scaffolds, need to be excluded.
\medskip
## Example for converting the genome map of D. simulans into ## a PQ-structure ## Exclude potentially overlapping minor scaffolds from genome map SIM_markers_chr <- SIM_markers[is.element(SIM_markers$scaff, c("2L", "2R", "3L", "3R", "4", "X")), ] ## Order genome map by specified chromosomes SIM_markers_chr <- orderGenomeMap(SIM_markers_chr, c("2L", "2R", "3L", "3R", "4", "X")) ## Show original genome head(SIM_markers_chr) ## Convert genome map into PQ-structure SIM_compgenome <- genome2PQtree(SIM_markers_chr) ## Show converted genome head(SIM_compgenome) ## Print a translation between chromosome names and CAR IDs head(data.frame(chr = unique(SIM_markers_chr$scaff), car = 1:length(unique(SIM_markers_chr$scaff)), stringsAsFactors = FALSE))
Note: CAR IDs will be assigned according to the order of compared genome segments in the genome map. Markers that are NA
in the genome map will be excluded from the PQ-structure.
checkInfile()
functionTo verify that the format of the compared genome representation is correct, the checkInfile()
function can optionally be run. This function is also called internally from other functions of the package where compgenome
is used as input. The function will return an error message when a problem has been detected, or nothing otherwise. See ?checkInfile
for more information.
\medskip
## Example for checking the compared genome representation of the ## Drosophila ancestral genome 'MSSYE' that was converted into ## 'MSSYE_compgenome' above checkInfile(MSSYE_compgenome, "compgenome", checkorder = TRUE)
Note: If the marker column was assigned the class numeric
rather than the required class integer
, run for example the command compgenome <- type.convert(compgenome, as.is = TRUE)
to fix the resulting error.
The computeRearrs()
function detects and classifies rearrangements with the rearrvisr
algorithm. The summarizeBlocks()
function summarizes the output of the computeRearrs()
function for each synteny block.
computeRearrs()
functionThe computeRearrs()
function requires as input files an ordered map of the focal genome (focalgenome
), for which rearrangements will be identified, and an ordered representation of the compared genome (compgenome
), which serves as reference for the arrangement of markers within genome segments. In addition, it needs to be specified whether markers in the ancestral genome reconstruction contain information about their orientation (doubled
). If orientation information is not available (i.e., doubled = FALSE
), all values in the orientation column of compgenome
should be "+"
. See ?computeRearrs
for more information on the input and output of the function. Further details are also provided in the walk-through example of the rearrvisr
algorithm below.
\medskip
## Example for identifying rearrangements in the D. melanogaster ## genome that occurred after divergence from the Drosophila ## ancestor 'MSSYE' ## identify rearrangements (may run a few seconds) SYNT_MEL_MSSYE <- computeRearrs(MEL_markers, MSSYE_compgenome, doubled = TRUE) ## show names of the matrices in the output names(SYNT_MEL_MSSYE)
Note: For the following examples it is assumed that the object SYNT_MEL_MSSYE
has been created with the above command.
A basal step of the rearrvisr
algorithm is the alignment of the compared genome representation in compgenome
to the sorted genome map of the focal species in focalgenome
. This can be reconstructed with the following command:
\medskip
## Example for reconstructing the alignment between the genome of ## the Drosophila ancestor 'MSSYE' and the D. melanogaster genome ## make alignment MEL_MSSYE<-merge(MEL_markers, MSSYE_compgenome, by = "marker", sort = FALSE) ## marker IDs should be identical to the ones in SYNT_MEL_MSSYE identical(as.character(MEL_MSSYE$marker), rownames(SYNT_MEL_MSSYE$NM1))
Note: For the following examples it is assumed that the alignment MEL_MSSYE
has been created with the above command.
Next steps: The data returned by the computeRearrs()
function can be filtered for rearrangements of a particular size with the filterRearrs()
function, visualized with the genomeImagePlot()
function, or summarized and visualized with the summarizeBlocks()
and genomeRearrPlot()
functions. Breakpoint coordinates for rearrangements can be extrated with the getBreakpoints()
function. A summary of the number of detected rearrangements and breakpoints can be obtained with the summarizeRearrs()
function.
Output: Each matrix in the output stores data on detected rearrangements and additional information. Markers are in rows, and the row names of each matrix correspond to the IDs in the marker column of focalgenome
and compgenome
.
The matrices $NM1
, $NM2
, $SM
, and $IV
contain rearrangements of four different classes, and are briefly described below. Further details and information on the other matrices are provided in the function documentation.
Rearrangements are classified as nonsyntenic moves between compared genome segments (i.e., CARs), analogous to interchromosomal rearrangements (NM1
and NM2
), and as syntenic moves or inversions within compared genome segments, analogous to intrachromosomal rearrangements (SM
and IV
). $NM1
stores class I N
onsyntenic M
oves; $NM2
stores class II N
onsyntenic M
oves; $SM
stores S
yntenic M
oves; $IV
stores I
nV
ersions. See the Figure \ref{fig:classes} for examples of these four classes of rearrangements.
\medskip
knitr::include_graphics("Classes.png")
Figure \ref{fig:classes}: Markers located on two focal genome segments (left and right) are represented by pentagons. Markers are arranged from left to right by their position in the focal genome. The top row shows markers colored according to their CAR of origin, and integers in the bottom row indicate the order of markers within the compared genome. (i), a class I nonsyntenic move (NM1
; i.e., marker 17
of the red CAR is located in the middle of the left focal segment, while the remaining markers of the red CAR are on the right focal segment; note that in this example, the rearranged marker 17
also belongs to class ii). (ii), a class II nonsyntenic move (NM2
; i.e., marker 12
of the purple CAR is located in the middle of the red CAR within the right focal segment). (iii), a syntenic move (SM
; i.e., marker 4
has changed position within the blue CAR within the left focal segment). (iv), an inversion (IV
; i.e., markers 7 8 9
are inverted to -9 -8 -7
within the purple CAR).
Each rearrangement is represented by a separate column. Except for NM1
, which are identified across focal segments, columns for individual focal segments are joined across rows to save space (i.e., for NM2
, SM
, and IV
, which are identified within focal segments). To preserve the tabular format, these matrices are filled by zeros for focal segments with a non-maximal number of rearrangements, if necessary. If no rearrangements were detected for a certain class, the matrix has zero columns.
Markers that are part of a rearrangement have a tag value of >0
within their respective column. Tagged markers within a column are not necessarily consecutive, for example, when a rearrangement is split into several parts through an insertion of a different CAR, or when a rearrangement has an upstream and a downstream component. Larger tag values (up to 1
) correspond to a higher confidence that the marker has been rearranged [^1]. This arises from the fact that nonsyntenic or syntenic moves can only be identified relative to each other. For example, if the marker order in the compared genome is 1 2 3 4 5
, and that in the focal genome is 1 3 4 2 5
, marker 2
might have been moved to the right, or markers 3 4
might have been moved to the left. In such a case, the markers that are more parsimonious to have been moved relative to the alternative markers receive tag values of 1 - remWgt
, while alternative markers receive tag values of remWgt
(remWgt
can be set by the user). If such a distinction cannot be made, all involved markers are tagged with 0.5
. Markers that are part of inversions are always tagged with 1
. Full details are given in the description of the rearrvisr
algorithm in the Supplementary information of the associated manuscript.
[^1]: Note that these values are not probabilities, rather labels to tag markers that have arrangements in the focal genome that are in conflict to their arrangement in the compared genome.
\medskip
## Example for extracting inversions that have been identified on ## chromosome 2L of D. melanogaster when compared to the genome of ## the Drosophila ancestor 'MSSYE' ## extract marker IDs for chromosome 2L from alignment MEL_2L_markers <- MEL_MSSYE$marker[MEL_MSSYE$scaff == "2L"] ## get position of markers on chromosome 2L in SYNT tmp <- is.element(rownames(SYNT_MEL_MSSYE$IV), MEL_2L_markers) ## show inverted markers on chromosome 2L SYNT_MEL_MSSYE$IV[rowSums(SYNT_MEL_MSSYE$IV) != 0 & tmp, , drop = FALSE] ## look at inverted marker 13315 in its genomic context mpos<-which(MEL_MSSYE$marker == 13315) MEL_MSSYE[(mpos - 4):(mpos + 4), c(1, 2, 5, 8, 7, 10)] ## this shows that the genomic region where marker 13315 is located ## has been aligned in reverse direction: strand and orientation show ## inverted directionality, and the direction of elements in elem1 is ## descending; within this context, marker 13315 has identical ## directionality, which indicates an inversion (in this case, a ## single-marker inversion)
summarizeBlocks()
functionThe summarizeBlocks()
function summarizes the output of the computeRearrs()
function for each synteny block in a more compact way. The function requires as input files the output of the computeRearrs()
function (the list SYNT
), and the data frames focalgenome
and compgenome
, which have been used to generate SYNT
. In addition, the IDs of the focal genome segments that should be summarized need to be specified with the character vector ordfocal
. See ?summarizeBlocks
for more information on the input and output of the function.
\medskip
## Example for summarizing rearrangements in the D. melanogaster ## genome that occurred after divergence from the Drosophila ## ancestor 'MSSYE' ## summarize rearrangements for each synteny block ## (may run a few seconds) BLOCKS_MEL_MSSYE <- summarizeBlocks(SYNT_MEL_MSSYE, MEL_markers, MSSYE_compgenome, c("2L", "2R", "3L", "3R", "X")) ## show names of the lists in the output names(BLOCKS_MEL_MSSYE) ## show names of matrices within lists names(BLOCKS_MEL_MSSYE[[1]])
Note: For the following examples it is assumed that the object BLOCKS_MEL_MSSYE
has been created with the above command.
Next steps: The data returned by the summarizeBlocks()
function can be visualized with the genomeRearrPlot()
function.
Output: The function output is a list of lists for each focal genome segment in ordfocal
. Each of these lists summarizes the alignment between focalgenome
and compgenome
, and the rearrangements in SYNT
.
Each list per focal genome segment contains the data frame $blocks
, which contains information on the alignment and structure of each PQ-tree, and five matrices $NM1
, $NM2
, $SM
, $IV
, and $IVsm
that store detected rearrangements. Each synteny block is represented by a row. Note that separate blocks are also generated when the hierarchical structure of the underlying PQ-tree changes, therefore not all independent rows are caused by a rearrangement. Further note that if the list SYNT
has been filtered with the filterRearrs()
function, only the detected rearrangements will be affected, while the information contained in $blocks
will remain unchanged.
\medskip
## print $blocks data frame for chromosome 3R of D. melanogaster BLOCKS_MEL_MSSYE$`3R`$blocks ## this shows that chromosome 3R is composed of 17 synteny blocks; ## details on the first four blocks are given below
In the $blocks
data frame, the columns start and end give the start and end positions of the synteny block in SYNT
(positions start at 1
separately for each focal genome segment), markerS and markerE give the marker IDs of the first and last marker per block, and car gives the ID of the aligned CAR.
\medskip
## Details on the boundary between the first and second block on ## chromosome 3R of D. melanogaster between markers 1747 and 1597 ## (node elements 2050 and 893; positions 323 and 324) MEL_MSSYE[MEL_MSSYE$scaff == "3R", c(1, 2, 5, 8, 7, 10)][313:334, ] ## this shows that the alignment changes from descending ## (i.e., inverted) to ascending (i.e., standard) direction ## at the block boundary
Additional columns in the $blocks
data frame provide more information on the alignment and structure of each CAR for all hierarchy levels of the underlying PQ-tree (the first hierarchy level has columns with the suffix 1
, the second 2
, and so on). See the documentation of the summarizeBlocks()
and the computeRearrs()
function for a full description of the columns in the $blocks
data frame.
There is only one hierarchy level for chromosome 3R of D. melanogaster in the example above. The column nodeori1 stores the alignment direction of the first level node in the PQ-tree of the aligned CAR, with 1
indicating ascending (i.e., standard), and -1
descending (i.e., inverted) alignment (given that the node is a Q-node and the alignment direction could be determined). Only one CAR (CAR 1
) has been aligned to chromosome 3R of D. melanogaster in descending direction. The column blockori1 stores the orientation of each synteny block, again with 1
indicating ascending (i.e., standard), and -1
descending (i.e., inverted) orientation (given that the bock orientation could be determined). In the above example, the first block has descending and the second block ascending direction, which is also reflected by the first and last node element IDs of each block, which are given in the columns elemS1 and elemE1.
Finally, the column blockid1 stores the ID of each synteny block within its node. For Q-nodes, block IDs reflect the order of synteny blocks in compgenome
. In the above example, the first block on chromosome 3R of D. melanogaster is the last block on CAR 1
of the MSSYE ancestor. Block IDs with .1
or .2
suffixes (in arbitrary order) indicate blocks that were subject to an additional subdivision step. In the above example, the third and fourth blocks were initially joined to a descending block with node elements 1346
and 1345
. Block subdivisions are used to assign a more parsimonious SM
instead of an IV
to a block.
\medskip
## Details on the boundary between the third and fourth block on ## chromosome 3R of D. melanogaster between node elements 1346 ## and 1345 (positions 776 and 777) MEL_MSSYE[MEL_MSSYE$scaff == "3R", c(1, 2, 5, 8, 7, 10)][772:781, ] ## this shows that the larger-scale alignment around the two node ## elements is ascending (i.e., from 1341 to 1350), while elements ## 1346 and 1345 form a descending block. As their strand and ## orientation show identical directionality, it is more ## parsimonious to assign a syntenic move instead of an inversion, ## which would require two additional single-marker inversions to ## achieve the required inverted directionality. Accordingly, the ## initial block of two elements is split into two single-element ## blocks
The matrices $NM1
, $NM2
, $SM
, $IV
, and $IVsm
summarize the four classes of rearrangements in SYNT
for the synteny blocks characterized in $blocks
(tag values within $NM1
, $NM2
, and $SM
are the same as in SYNT
, see above). Inversions that are part of a multi-marker inversion are stored in $IV
, and blocks part of such inversions are tagged with 1
(with separate columns for different inversions). Single-marker inversions (i.e., markers with switched orientation) are stored in $IVsm
. The positions of such single-marker inversions within their block are indicated by integers >0
(with separate columns for different single-marker inversions).
\medskip
## print $IV matrix for chromosome 3R of D. melanogaster BLOCKS_MEL_MSSYE$`3R`$IV ## this shows that two multi-marker inversions have been detected; ## the first one spans blocks 2:11, and the second one involves ## block 16
\medskip
## print $IVsm matrix for chromosome 3R of D. melanogaster BLOCKS_MEL_MSSYE$`3R`$IVsm ## this shows that two single-marker inversions have been ## detected, the first is on position 419 within block 2, the ## second on position 18 within block 17 ## look at the first single-marker inversion, which is located ## at position 743 (start position of block 2 + position of the ## inversion) MEL_MSSYE[MEL_MSSYE$scaff == "3R", c(1, 2, 5, 8, 7, 10)][738:746, ] ## this shows that marker 3588 is located in an ascending block, ## but has inverted directionality between strand and orientation
\medskip
## print $SM matrix for chromosome 3R of D. melanogaster BLOCKS_MEL_MSSYE$`3R`$SM ## this shows a total of eight syntenic moves, each involving a ## single block. In all cases, blocks are tagged with 0.5, ## indicating that it could not be distinguished which part of ## a syntenic move would be more parsimonious to have changed ## position relative to the alternative part (e.g., for the ## syntenic move between the third and fourth block on ## chromosome 3R, described above, it cannot be distinguished ## whether node element 1346 has moved one block up, or element ## 1345 one block down)
genomeImagePlot()
functionThe genomeImagePlot()
function visualizes rearrangements that were detected with the computeRearrs()
function along the focal genome. Markers that are part of different classes of rearrangements are tagged with different colors at their map position, allowing the visual examination of the extent, number, and location of rearrangements. As an example, the Figure \ref{fig:image-plot} visualizes rearrangements in the D. melanogaster genome that occurred after divergence from the Drosophila ancestor MSSYE.
The function requires as input files the output of the computeRearrs()
function (i.e., the list SYNT
), the data frame focalgenome
, and the IDs of the focal genome segments that should be plotted (the character vector ordfocal
). The list SYNT
may have optionally been filtered with the filterRearrs()
function.
The graphic can directly be output as PDF with the argument makepdf = TRUE
(which can be much faster than plotting to screen). By default, the function automatically determines the optimal width and height of the plot (this can be turned off by setting both makepdf = FALSE
and newdev = FALSE
). Various function arguments allow to adjust the appearance of the graphic, including font sizes, axes labels, and plot margins. See ?genomeImagePlot
for more information and examples on the usage of the function.
Note: The automatic determination of the optimal width and height of the graphic is not possible when using the RStudio graphical device (RStudioGD
) as default, as it does not accept width and height as arguments. In this case, the plot is send to an alternative device. (Only relevant when makepdf = FALSE
.)
\medskip
## Example for determining and changing the default graphics device ## determine current default graphics device getOption("device") ## see list of available graphics devices ?Devices ## for example, set new default to X11 options(device = "X11")
\medskip
## Example for visualizing rearrangements in the D. melanogaster ## genome that occurred after divergence from the Drosophila ## ancestor 'MSSYE'. The large maroon sector reveals a derived ## inversion on chromosome 3R, in line with previous work ## (Ranz et al., 2007) genomeImagePlot(SYNT_MEL_MSSYE, MEL_markers, c("2L", "2R", "3L", "3R", "X"), main = "D. melanogaster - MSSYE")
yaxlab<-c("2L", "2R", "3L", "3R", "X") mymar<-c(3,0.6+max(nchar(yaxlab))*0.41,2,1) vspace<-0 ## compute fig.height and fig.width for specs above myheight<-(length(yaxlab)+(vspace/5)*(length(yaxlab)-1))*0.7+ sum(mymar[c(1,3)])*0.2 ## 4.5 mywidth<-7+sum(mymar[c(2,4)])*0.2 ## 7.484 op <- options(device = "pdf") genomeImagePlot(SYNT_MEL_MSSYE, MEL_markers, yaxlab, main = "D. melanogaster - MSSYE", mar = mymar, vspace = vspace, newdev = FALSE) options(op)
Output: On the y-axis, each focal genome segment that was included in ordfocal
is plotted from top to bottom. Marker positions are on the x-axis. Vertical lines (ticks) indicate the positions of markers, rearrangement breakpoints, and different classes of rearrangements within six rows per focal segment.
The bottom row shows the positions of all markers contained in focalgenome
as gray ticks, and the positions of markers present in SYNT
as black ticks in the upper half of the bottom row. The second row from the bottom gives the positions of rearrangement breakpoints by red ticks. Ticks are drawn at the midpoints between the positions of the two markers present in SYNT
that are adjacent to the breakpoints. Only markers with black ticks can receive colored ticks in the remaining rows, which highlight markers that are part of different classes of rearrangements. Maroon indicates markers that are part of inversions within CARs within focal genome segments (IV
); purple indicates markers that are part of syntenic moves (SM
); blue indicates markers that are part of class II nonsyntenic moves (NM2
); green indicates markers that are part of class I nonsyntenic moves (NM1
). See the description of the rearrvisr
algorithm in the Supplementary information of the associated manuscript or ?computeRearrs
for more information on these classes of rearrangements.
Unless the argument remThld
is set to a value smaller than that of remWgt
in the computeRearrs()
function, only markers that are less parsimonious to have changed position relative to alternative markers are highlighted. Lighter tints denote markers that are part of large rearrangements, while darker shades denote markers that are part of small rearrangements. To distinguish between individual large rearrangements versus multiple short adjacent rearrangements, it may be helpful to take the position of breakpoints into account.
genomeRearrPlot()
functionThe genomeRearrPlot()
function visualizes the output of the summarizeBlocks()
function along the focal genome. The produced plot may be used to reconstruct how the compared genome segments were aligned to focal genome segments, to notice the structure of the aligned PQ-trees, and to comprehend why particular synteny blocks were tagged as rearranged by the computeRearrs()
function. As an example, the Figure \ref{fig:rearr-plot} visualizes rearrangements in the D. melanogaster genome that occurred after divergence from the Drosophila ancestor MSSYE.
The function requires as input files the output of the summarizeBlocks()
function (i.e., the list BLOCKS
), the data frame compgenome
, and the IDs of the focal genome segments that should be plotted (the character vector ordfocal
).
The graphic can directly be output as PDF with the argument makepdf = TRUE
(which can be much faster than plotting to screen). By default, the function automatically determines the optimal width and height of the plot (this can be turned off by setting both makepdf = FALSE
and newdev = FALSE
). Various function arguments allow to adjust the appearance of the graphic, including font sizes, axes labels, and plot margins. See ?genomeRearrPlot
for more information and examples on the usage of the function.
Note: The automatic determination of the optimal width and height of the graphic is not possible when using the RStudio graphical device (RStudioGD
) as default, as it does not accept width and height as arguments. In this case, the plot is send to an alternative device. (Only relevant when makepdf = FALSE
.) For an example on how to determine and change the default graphics device, see the section on the genomeImagePlot()
function above.
\medskip
## Example for visualizing rearrangements in the D. melanogaster ## genome that occurred after divergence from the Drosophila ## ancestor 'MSSYE' and that were summarized with the ## summarizeBlocks() function. Note that CAR 1 joins chromosomes ## 3L and 3R of D. melanogaster genomeRearrPlot(BLOCKS_MEL_MSSYE, MSSYE_compgenome, c("2L", "2R", "3L", "3R", "X"), main = "D. melanogaster - MSSYE", blockwidth = 1.15, y0pad = 3)
yaxlab<-c("2L", "2R", "3L", "3R", "X") plotelem<-c(1,1,1,1,1) space<-data.frame(nmark=0.65, carid=1, ndori=0.4*plotelem[1], blkid=1*plotelem[2], blkori=0.4*plotelem[3], elem=0.65*plotelem[4], stat=0.75*plotelem[5], rearr=0.25*plotelem[5], scaff=2.25) blockwidth<-1.15 cex.axis<-1.2 cex.main<-2 mymar<-c(0.4,0.6+max(nchar(yaxlab))*0.42*cex.axis,2.2*cex.main,0.4) remThld<-0.05 pad<-0+0.7 y0pad<-3 ## simplify BLOCKS BLOCKS_MEL_MSSYE<-rearrvisr:::simplifyBlockTags(BLOCKS_MEL_MSSYE) ## compute fig.height and fig.width for specs above mylim<-rearrvisr:::setRearrPlot(BLOCKS_MEL_MSSYE,yaxlab,space,blockwidth, mymar,remThld,pad,y0pad) ## mylim$height ## ## 14.44 ## mylim$width ## ## 17.2016 op <- options(device = "pdf") genomeRearrPlot(BLOCKS_MEL_MSSYE, MSSYE_compgenome, yaxlab, main = "D. melanogaster - MSSYE", remThld = remThld, mar = mymar, pad = 0, y0pad = y0pad, plotelem = plotelem, simplifyTags = TRUE, blockwidth = blockwidth, yaxlab = yaxlab, cex.main = cex.main, cex.axis = cex.axis, newdev = FALSE) options(op)
Note: The function argument y0pad
sets the amount of additional space between the bottom plot margin and the bottom plot area. Setting y0pad
too small may result in some rearrangements for the bottom-most focal segment to be invisible because they will be outside the bottom plot area. This can be avoided by setting y0pad
sufficiently large.
Output: On the y-axis, each focal genome segment that was included in ordfocal
is plotted from top to bottom. Each synteny block is represented by a column (note that separate blocks are also generated when the hierarchical structure of the underlying PQ-tree changes, therefore not all column boundaries are caused by a rearrangement).
The top part of each focal segment visualizes the data contained in the $blocks
data frame in BLOCKS
(i.e., information on the structure of each PQ-tree and its alignment to the focal genome). The top row gives the number of markers within each synteny block, calculated from the start and end columns in the $blocks
data frame. The second row gives the IDs of the CARs, corresponding to entries in the car column in the $blocks
data frame (different colors distinguish CARs).
For each hierarchy level, up to four additional rows (depending on the values in the argument plotelem
) show (i) the alignment orientation of the PQ-tree node to the focal genome (white rectangles and +
indicating ascending, and black rectangles and -
indicating descending alignment), (ii) the block IDs, (iii) the block orientation within its node (white rectangles and +
indicating ascending, and black rectangles and -
indicating descending orientation), and (iv) the range of element IDs for each block within its node and for its level of hierarchy. These four rows correspond to entries in the (i) nodeori*, (ii) blockid*, (iii) blockori*, and (iv) elemS* -- elemE* columns in the $blocks
data frame. See the summarizeBlocks()
function above for a description of the $blocks
data frame (using chromosome 3R of D. melanogaster as an example), and ?genomeRearrPlot
for further details.
The bottom part of each focal segment visualizes different classes of rearrangements in the matrices $NM1
, $NM2
, $SM
, $IV
, and $IVsm
in BLOCKS
. See the description of the rearrvisr
algorithm in the Supplementary information of the associated manuscript or the ?computeRearrs
documentation for more information on these classes of rearrangements. In contrast to the genomeImagePlot()
function, the genomeRearrPlot()
function allows to visualize rearrangements that are nested, and to distinguish between individual large rearrangements versus multiple short adjacent rearrangements.
Horizontal lines that are at identical height denote the same rearrangement (potentially disrupted by inserted CARs). Green indicates blocks that are part of class I nonsyntenic moves (NM1
); blue indicates blocks that are part of class II nonsyntenic moves (NM2
); purple indicates blocks that are part of syntenic moves (SM
); maroon indicates blocks that are part of inversions (IV
). Single-marker inversions (IVsm
) are indicated by a short vertical line at their position within their block, while multi-marker inversions are indicated by a horizontal line spanning the contributing block(s).
Lighter coloration denotes smaller weights for rearrangement tags in the respective matrices in BLOCKS
. Unless the argument remThld
is set to a value smaller than that of remWgt
in the computeRearrs()
function, only lines for blocks that are less parsimonious to have changed position relative to alternative blocks are plotted.
rearrvisr
algorithmThis section describes the main steps of the rearrvisr
algorithm on a simple data set. A detailed description can be found in the Supplementary information of the associated manuscript.
The example data TOY24_focalgenome
and TOY24_compgenome
were created for illustrative purposes and are included in the package. They can be regenerated with the following commands:
\medskip
## Create a simple example data set ## focalgenome TOY24_focalgenome <- data.frame( marker = as.integer(c(1,7,2,6:4,8:10,3,13:11,14,17:15,18,21,20,22:24,19)), scaff = as.character(rep(c(1,2,3), times = c(7,11,6))), start = as.integer(c(seq(10^6, by = 10^6, length.out = 7), seq(10^6, by = 10^6, length.out = 11), seq(10^6, by = 10^6, length.out = 6))), end = as.integer(c(seq(10^6+2, by = 10^6, length.out = 7), seq(10^6+2, by = 10^6, length.out = 11), seq(10^6+2, by = 10^6, length.out = 6))), strand = rep(rep(c("+", "-"), 4), times = c(4,2,4,3,8,1,1,1)), stringsAsFactors = FALSE) ## compgenome TOY24_rawtree <- matrix( c(">TOY", "#CAR1", "_Q 1 2 3 4 5 -6 7 8 Q_", "#CAR2", "_Q 9 _Q 10 Q_ _Q 11 12 13 Q_ Q_", "#CAR3", "_Q 14 Q_", "#CAR4", "_Q _P 15 16 17 18 P_ _Q _Q -19 20 21 Q_ _Q -22 -23 24 Q_ Q_ Q_"), nrow = 9) TOY24_compgenome <- convertPQtree(TOY24_rawtree)
Rearrangements are identified by aligning the compared genome representation in TOY24_compgenome
(i.e., the PQ-structure, which is the data frame joining all PQ-trees, or CARs, of the compared genome) to the sorted genome map of the focal species in TOY24_focalgenome
(Figure \ref{fig:toy24align}A and B; where 'sorted' means that the sequence of markers within each focal genome segment is ordered by map position). Only the set of markers common to both genomes is considered. Markers (orthologous genes or synteny blocks) are called doubled if orientation information is available for the focal and the compared genome, for example when a genome reconstruction was obtained with the software ANGES [@jones12] applying the option markers_doubled 1
[@ouangraoua11].
In practice, the alignment of the markers in TOY24_compgenome
to their orthologs in TOY24_focalgenome
is achieved by ordering the rows in TOY24_compgenome
so that the sequence of markers is identical to the sequence of markers in TOY24_focalgenome
.
\medskip
## PQ-structure in 'TOY24_compgenome' ordered so that the ## sequence of markers corresponds to the sequence of markers ## in the focal genome in 'TOY24_focalgenome' (i.e., ## 'TOY24_compgenome' is "aligned" to 'TOY24_focalgenome'; see ## Figure A and B for a graphical representation of the ## alignment) TOY24_compgenome[match(TOY24_focalgenome$marker, TOY24_compgenome$marker), ]
The alignment potentially disrupts the original sequence of PQ-structure elements (i.e., node element IDs) at one or more hierarchy levels (i.e., in columns $car, $elem1, $elem2, or $elem3). Such disruptions of the contiguity of aligned node element IDs indicate rearrangements. Each hierarchy level and each node is checked separately for rearrangements, starting at the CAR level and then descending through the hierarchy levels. The definition of a disrupted sequence of node element IDs differs among CARs, P-nodes, and Q-nodes. Briefly, for CARs and P-nodes, where children are in arbitrary order, the sequence of node element IDs does not need to be ordered, but the presence of scattered elements indicates a rearrangement (e.g., element 2
within the sequence 2 2 1 2 2 2
is scattered and indicates that the coherence of markers descending from child 2
is disrupted by markers descending from child 1
). For Q-nodes, the sequence of node element IDs needs to be ordered in either ascending or descending direction, any deviation from this (e.g., 3 1 2
) indicates a rearrangement (a syntenic move or an inversion).
If markers are doubled, their directionality in the focal genome relative to the compared genome is taken into account for the detection and classification of rearrangements within Q-nodes. The directionality can be identical (then the strand in both genomes is either "+"
or "-"
) or inverted (then the strand is "+"
in one genome but is "-"
in the other). Note that if the hierarchically lowest node of a given marker in the corresponding PQ-tree does not contain any other branches or markers (a single-marker node), the directionality is irrelevant, as nodes can be reversed. The directionality is also irrelevant for the detection of rearrangements at the CAR level or for P-nodes.
While markers part of an inversion can be unambiguously delimitated, syntenic moves between sets of markers are only defined relative to each other (e.g., within the sequence 2 2 1 2 2 2
, the first two elements 2
may have been moved from original positions two and three to new positions one and two, or element 1
may have been moved from original position one to its new position three). In many cases it cannot be distinguished which component of a syntenic move (i.e., which set of elements) was causal to a detected rearrangement. This uncertainty is addressed by assigning different tag values to alternative rearrangement components based on the principle of parsimony, where larger values indicate elements that are more parsimonious to have been moved.
\newpage
knitr::include_graphics("Alignment.png")
Figure \ref{fig:toy24align}: Alignment of the compared genome representation (i.e., PQ-structure) in TOY24_compgenome
to the focal genome map in TOY24_focalgenome
, and hierarchical subdivision of the aligned PQ-structure for rearrangement identification. A, graphical representation of the PQ-structure in TOY24_compgenome
in its original order. PQ-trees of four CARs are shown in colors. P-nodes are illustrated with ovals and Q-nodes with rectangles. Marker IDs are given by black integers within pentagons at the leaves of the PQ-trees. All branches are labeled with a colored integer ID (i.e., the node element ID) according to their position at their node of origin, starting at 1
for each node and hierarchy level. B, graphical representation of the focal genome map in TOY24_focalgenome
, ordered by map position. Markers are shown as pentagons, filled with different grays according to their focal segment (scaffolds I -- III) of origin. Black dashed lines show the alignment between the PQ-structure and the focal genome map. C, subdivision of the aligned PQ-structure by focal segments at CAR level (i.e., hierarchy level 1
). D, subdivision of the aligned PQ-structure by focal segments at node levels (i.e., hierarchy levels 2
-- 4
). Hierarchy levels are shown in rows. The colored integers within pentagons are node element IDs of the aligned PQ-trees. Asterisks indicate markers that have inverted directionality in the focal genome relative to the compared genome. Subdivisions (based on identical elements at higher hierarchy levels) are indicated by small vertical gaps between markers and connected to the preceding hierarchy by gray dashed lines. For the red CAR (CAR 2), the red dashed line indicates that node elements 1 2 3 3 3
are joined across the inserted marker from the blue CAR (CAR 1) for rearrangement detection at the node level. E, tags for rearranged markers. The green bar indicates a class I nonsyntenic move (NM1
); the blue bar indicates a class II nonsyntenic move (NM2
); purple bars indicate markers that are part of syntenic moves (SM
); maroon bars indicate markers that are part of inversions (IV
). A single-marker inversion (IVsm
) is indicated by a short vertical bar, while multi-marker inversions are indicated by horizontal bars. Lighter coloration for purple bars for markers 21
and 20
on scaffold III indicate half-weight tag values, as it cannot be distinguished which of the two is more parsimonious to have been moved. Bars are only shown for tag values greater than remWgt
.
NM1
)A CAR is identified as rearranged by a class I nonsyntenic move if it was aligned to more than one focal genome segment in a way that is incompatible with a potentially incomplete focal genome assembly. Tests are performed by considering all focal genome segments jointly. The algorithm takes into account that for the focal genome, not all scaffolds may have been placed within a chromosome. Likewise, CARs are only assumed to be contiguous sets of genetic markers, but do not need to represent full (ancestral) chromosomes. For example, CAR 4 on scaffolds II and III in Figure \ref{fig:toy24align}C is not identified as NM1
, as the CAR can join the ends of the two scaffolds. By contrast, the fragments of CAR 1 were aligned to scaffolds I and II in a way that is incompatible with a potentially incomplete focal genome assembly (i.e., the smaller fragment is located in the middle of scaffold II), indicating a NM1
. (Note that for the case where a large fragment of a CAR is located in the middle of a scaffold, and a small fragment of the same CAR fully spans a small scaffold, a NM1
will not be assigned, as the small fragment might in fact be placed in a gap between contigs on the large genome segment.)
\medskip
## Reconstruct the alignment of TOY24_compgenome to ## TOY24_focalgenome TOY24_aligned <- merge(TOY24_focalgenome, TOY24_compgenome, by = "marker", sort = FALSE)
\medskip
## Show alignment of CARs in TOY24_compgenome to ## scaffolds in TOY24_focalgenome TOY24_aligned[,c(1,2,7)]
\medskip
## Identify rearrangements in TOY24_focalgenome ## relative to TOY24_compgenome SYNT <- computeRearrs(TOY24_focalgenome, TOY24_compgenome, doubled = TRUE)
\medskip
## NM1 identified in TOY24_focalgenome relative to ## TOY24_compgenome SYNT$NM1
This shows that marker 3
was identified as NM1
(i.e., it has a tag value of 1
), as the marker is part of CAR 1 that was aligned to both scaffolds I and II (see Figure \ref{fig:toy24align}C). As the smaller fragment of CAR 1 is located in the middle of scaffold II, it cannot join with the other fragment of CAR 1 on scaffold I and thus must be involved in a rearrangement. By contrast, markers part of CAR 4 are not identified as NM1
(i.e., tag values are 0
), as the CAR can join the ends of scaffolds II and III.
NM2
)A CAR is identified as rearranged by a class II nonsyntenic move if its contiguity within a focal genome segment is disrupted when aligned to the focal genome. Tests are performed separately for each focal genome segment. Again, CARs are only assumed to be contiguous sets of genetic markers, but do not need to represent full (ancestral) chromosomes. For example, CAR 3 and CAR 4 on scaffold II in Figure \ref{fig:toy24align}C are not identified as NM2
, as the CARs may be part of a larger contiguous block that was split into smaller CARs for technical reasons (i.e., analogous to small contigs that could not be joined to larger contigs in a genome assembly). By contrast, the contiguity of CAR 2 on scaffold II is disruped by a fragment of CAR 1, indicating a NM2
.
\medskip
## NM2 identified in TOY24_focalgenome relative to ## TOY24_compgenome SYNT$NM2
This shows that marker 3
was identified as NM2
(i.e., it has a tag value of 1
), as it is part of CAR 1 that disrupts the contiguity of CAR 2 on scaffold II (see Figure \ref{fig:toy24align}C).
SM
) or inversions (IV
)A CAR is identified as rearranged by a syntenic move or an inversion if its contiguity within itself is disrupted when aligned to a focal genome segment. Tests are performed separately for each focal genome segment. Starting at the hierarchically highest and then descending to the lowest nodes, tests are performed for each node separately and differ between P-nodes and Q-nodes. For example, in Figure \ref{fig:toy24align}D, only CAR 1 with a single node was aligned to scaffold I, thus only one test needs to be performed.
\medskip
## Print hierarchically highest node elements for CAR 1 on ## scaffold I TOY24_aligned$elem1[TOY24_aligned$scaff == "1" & TOY24_aligned$car == 1]
This shows the order of elements in column $elem1 is not consistent with a Q-node, indicating rearrangements (see Figure \ref{fig:toy24align}A, B, and D).
\medskip
## Syntenic moves identified on scaffold I SYNT$SM[TOY24_aligned$scaff == "1", , drop = FALSE]
This shows that one SM
was identified that involves marker 7
(tagged with 0.95
, i.e., 1 - remWgt
) and the block of markers 2
, 6
, 5
, and 4
(tagged with 0.05
, i.e., remWgt
). The larger tag value for marker 7
indicates that this element is more parsimonious to have been rearranged (i.e., it is more parsimonious that marker 7
has moved to the left than that markers 2
, 4
, 5
, and 6
have moved to the right; see Figure \ref{fig:toy24align}A and B). The third column only contains zeros and is used for padding columns of other scaffolds.
\medskip
## Inversions identified on scaffold I SYNT$IV[TOY24_aligned$scaff == "1", , drop = FALSE]
This shows that one IV
was identified that involved markers 6
, 5
, and 4
(see Figure \ref{fig:toy24align}A and B).
For scaffold II in Figure \ref{fig:toy24align}D, both CAR 2 and the fragment of CAR 4 need to be tested for rearrangements. CAR 2 has three nodes and CAR 4 has one node aligned to scaffold II, thus at least four tests need to be performed.
\medskip
## Print hierarchically highest node elements for CAR 2 on ## scaffold II TOY24_aligned$elem1[TOY24_aligned$scaff == "2" & TOY24_aligned$car == 2]
This shows that the order of elements in column $elem1 is consistent with a Q-node (see Figure \ref{fig:toy24align}A, B, and D).
\medskip
## Print node elements from second branch of the hierarchically ## highest node (i.e., the first node at the second-highest ## hierarchy level) TOY24_aligned$elem2[TOY24_aligned$scaff == "2" & TOY24_aligned$car == 2 & TOY24_aligned$elem1 == 2]
This shows that only one marker descends from the node, thus it cannot be rearranged.
\medskip
## Print node elements from third branch of the hierarchically ## highest node (i.e., the second node at the second-highest ## hierarchy level) TOY24_aligned$elem2[TOY24_aligned$scaff == "2" & TOY24_aligned$car == 2 & TOY24_aligned$elem1 == 3]
This shows that the order of elements in column $elem2 is consistent with a Q-node, as the order can be reversed.
\medskip
## Print hierarchically highest node elements for the fragment ## of CAR 4 on scaffold II TOY24_aligned$elem1[TOY24_aligned$scaff == "2" & TOY24_aligned$car == 4]
This shows that all markers are located on the same branch and thus cannot be rearranged.
\medskip
## Print node elements from first branch of the hierarchically ## highest node (i.e., the node at the second-highest hierarchy ## level that was aligned to scaffold II) TOY24_aligned$elem2[TOY24_aligned$scaff == "2" & TOY24_aligned$car == 4 & TOY24_aligned$elem1 == 1]
This shows that the order of elements in column $elem2 is consistent with a P-node, where the order is arbitrary.
\medskip
## Rearrangements identified on scaffold II SYNT$SM[TOY24_aligned$scaff == "2", , drop = FALSE] SYNT$IV[TOY24_aligned$scaff == "2", , drop = FALSE]
This shows that no SM
and no IV
was identified. Although markers -13
, -12
, and -11
on scaffold II appear to be inverted relative to markers 11
, 12
, and 13
on CAR 2 (see Figure \ref{fig:toy24align}A and B), they are all on the same Q-node, where the order can be reversed. Similarly, markers 17
, 16
, and 15
on scaffold II appear to be rearranged relative to markers 15
, 16
, and 17
on CAR 4, but as they are on a P-node, their order is arbitrary.
For scaffold III in Figure \ref{fig:toy24align}D, only the fragment of one CAR was aligned and needs to be tested. The fragment of CAR 4 on scaffold III has three nodes, thus at least three tests need to be performed.
\medskip
## Print hierarchically highest node elements for CAR 4 on ## scaffold III TOY24_aligned$elem1[TOY24_aligned$scaff == "3" & TOY24_aligned$car == 4]
This shows that all markers are located on the same branch and thus cannot be rearranged.
\medskip
## Print node elements from second branch of the hierarchically ## highest node (i.e., the node at the second-highest hierarchy ## level that was aligned to scaffold III) TOY24_aligned$elem2[TOY24_aligned$scaff == "3" & TOY24_aligned$car == 4 & TOY24_aligned$elem1 == 2]
This shows that the order of elements in column $elem2 is not consistent with a Q-node, indicating rearrangements (see Figure \ref{fig:toy24align}A, B, and D). Specifically, the contiguity of elements of the Q-node on branch 1
is disrupted by elements of the Q-node on branch 2
. In this specific example, the last element (i.e., marker -19
in Figure \ref{fig:toy24align}B) will be identified as rearranged. As the marker is part of a node that still needs to be tested for rearrangements at the next hierarchy level, the computeRearrs()
function will by default perform tests separately for the two node fragments (i.e., for markers 21
and 20
, and for marker -19
) at the following hierarchy levels. This behavior is controlled by the splitnodes
argument in the computeRearrs()
function.
\medskip
## Create subnode IDs for the two node fragments ## (for illustration of the splitnodes = TRUE setting) TOY24_aligned$subnode <- numeric(nrow(TOY24_aligned)) TOY24_aligned$subnode[TOY24_aligned$marker == 19] <- 1
\medskip
## Print node elements from first branch of the hierarchically ## second-highest node (i.e., the first node at the third-highest ## hierarchy level) for the first subnode TOY24_aligned$elem3[TOY24_aligned$scaff == "3" & TOY24_aligned$car == 4 & TOY24_aligned$elem1 == 2 & TOY24_aligned$elem2 == 1 & TOY24_aligned$subnode == 0]
This shows that exactly two branches descend from the first subnode of the Q-node, requiring special treatment for the detection and classification of rearrangements. In this specific example, one can (i) reverse the order of elements, not requiring a rearrangement, or (ii) not reversing the order, requiring a rearrangement. However when testing the next hierarchy level for rearrangements (in this particular case the test is for single-marker inversions, see below), two inversions are required for case (i), as a reversal of the order of elements implicates a reversal of their directionality, which is not realized (i.e., markers 20
and 21
are both on the "+"
strand in either genome; see Figure \ref{fig:toy24align}A and B). For case (ii), no additional reversals are required. Thus, the order of the Q-node is not reversed, and a syntenic move is assigned.
\medskip
## Print node elements from first branch of the hierarchically ## second-highest node (i.e., the first node at the third-highest ## hierarchy level) for the second subnode TOY24_aligned$elem3[TOY24_aligned$scaff == "3" & TOY24_aligned$car == 4 & TOY24_aligned$elem1 == 2 & TOY24_aligned$elem2 == 1 & TOY24_aligned$subnode == 1]
This shows that only one marker descends from the second subnode, thus it cannot be rearranged in addition to the already identified rearrangement at the preceding hierarchy level.
\medskip
## Print node elements from second branch of the hierarchically ## second-highest node (i.e., the second node at the third-highest ## hierarchy level) TOY24_aligned$elem3[TOY24_aligned$scaff == "3" & TOY24_aligned$car == 4 & TOY24_aligned$elem1 == 2 & TOY24_aligned$elem2 == 2]
This shows that the order of elements in column $elem3 is consistent with a Q-node.
\medskip
## Syntenic moves identified on scaffold III SYNT$SM[TOY24_aligned$scaff == "3", , drop = FALSE]
This shows that two SM
were identified. The first is for marker 19
(tagged with 1
). The second involves markers 21
and 20
(both tagged with 0.5
). Having half-weight tags for either marker indicates that it cannot be distinguished which of the two elements is more parsimonious to have been rearranged (i.e., whether it is more parsimonious that marker 21
has moved to the left or marker 20
has moved to the right; see Figure \ref{fig:toy24align}A and B).
\medskip
## Inversions identified on scaffold III SYNT$IV[TOY24_aligned$scaff == "3", , drop = FALSE]
This shows that one IV
was identified that involved marker 22
(i.e., single-marker inversion; see Figure \ref{fig:toy24align}A and B). Single-marker inversions signify markers whose directionality deviates from their expected directionality, given the final alignment direction of all Q-nodes traversed from the root the marker, and the number of (nested) inversions that were assigned to the marker. All traversed Q-nodes for marker 22
were aligned in ascending direction and no inversion was assigned, thus the strand of marker 22
is expected to be identical in both genomes (i.e., both strands are expected to be "+"
or both to be "-"
). As this is not the case, a single-marker inversion was assigned.
\medskip
## Visualize the identified rearrangements with the ## genomeImagePlot() function genomeImagePlot(SYNT, TOY24_focalgenome, c("1", "2", "3"), main = "TOY24", yaxlab = c("I", "II", "III"))
ordfocal<-c("1", "2", "3") yaxlab<-c("I", "II", "III") mymar<-c(3,0.6+max(nchar(yaxlab))*0.41,2,1) vspace<-0 ## compute fig.height and fig.width for specs above myheight<-(length(yaxlab)+(vspace/5)*(length(yaxlab)-1))*0.7+ sum(mymar[c(1,3)])*0.2 ## 3.1 mywidth<-7+sum(mymar[c(2,4)])*0.2 ## 7.566 op <- options(device = "pdf") genomeImagePlot(SYNT, TOY24_focalgenome, ordfocal = ordfocal, yaxlab = yaxlab, main = "TOY24", mar = mymar, vspace = vspace, newdev = FALSE) options(op)
Figure \ref{fig:image-plot-TOY24}: The three focal genome segments are shown as separate panels. The x-axis gives the position of markers on their genome segments. Markers are indicated with short black ticks in the bottom row of each panel. Rearrangement breakpoints are indicated with red ticks and are drawn at the midpoints between the markers adjacent to the breakpoints. Markers part of different classes of rearrangements are highlighted in the upper four rows of each panel by colored ticks. From top to bottom: NM1
(green), NM2
(blue), SM
(purple), and IV
(maroon). Ticks are only shown for tag values greater than remWgt
.
\newpage
\medskip
## Visualize the identified rearrangements with the ## genomeRearrPlot() function ## compute blocks BLOCKS <- summarizeBlocks(SYNT, TOY24_focalgenome, TOY24_compgenome, c("1","2","3")) ## specify colors carcol <- c("#006DDB","#920000","#004949","#490092") textcol <- rep("white", length(carcol)) ## make plot genomeRearrPlot(BLOCKS, TOY24_compgenome, c("1","2","3"), main = "TOY24", yaxlab = c("I", "II", "III"), blockwidth = 1.15, y0pad = 3, carColors = carcol, carTextColors = textcol, sortColsBySize = FALSE)
BLOCKS <- summarizeBlocks(SYNT, TOY24_focalgenome, TOY24_compgenome, c("1","2","3")) carcol <- c("#006DDB","#920000","#004949","#490092") textcol <- rep("white", length(carcol)) ordfocal<-c("1", "2", "3") yaxlab<-c("I", "II", "III") plotelem<-c(1,1,1,1,1) space<-data.frame(nmark=0.65, carid=1, ndori=0.4*plotelem[1], blkid=1*plotelem[2], blkori=0.4*plotelem[3], elem=0.65*plotelem[4], stat=0.75*plotelem[5], rearr=0.25*plotelem[5], scaff=2.25) blockwidth<-1.15 cex.axis<-1.2 cex.main<-2 mymar<-c(0.4,0.6+max(nchar(yaxlab))*0.42*cex.axis,2.2*cex.main,0.4) remThld<-0.05 pad<-0+0.7 y0pad<-3 ## simplify BLOCKS BLOCKS<-rearrvisr:::simplifyBlockTags(BLOCKS) ## compute fig.height and fig.width for specs above mylim<-rearrvisr:::setRearrPlot(BLOCKS,ordfocal,space,blockwidth, mymar,remThld,pad,y0pad) ## mylim$height ## ## 7.52 ## mylim$width ## ## 4.6524 op <- options(device = "pdf") genomeRearrPlot(BLOCKS, TOY24_compgenome, ordfocal = ordfocal, main = "TOY24", yaxlab = yaxlab, remThld = remThld, carColors = carcol, carTextColors = textcol, sortColsBySize = FALSE, mar = mymar, pad = 0, y0pad = y0pad, plotelem = plotelem, simplifyTags = TRUE, blockwidth = blockwidth, cex.main = cex.main, cex.axis = cex.axis, newdev = FALSE) options(op)
Figure \ref{fig:rearr-plot-TOY24}: The three focal genome segments are shown as separate panels. Each synteny block is represented by a column along the x-axis. The top part of each panel shows information on the structure of each PQ-tree and its alignment to the focal genome, and the bottom part highlights blocks part of different classes of rearrangements by colored bars. Green, NM1
. Blue, NM2
. Purple, SM
. Maroon, IV
(with a horizontal bar for the multi-marker inversion, and a short vertical bar for the single-marker inversion). Lighter coloration for purple bars for scaffold III indicate half-weight tag values. Bars are only shown for tag values greater than remWgt
. See ?genomeRearrPlot
for further details on the top part of the panels.
\clearpage
Peptide sequences and annotation information (pep.all.fa
and gff3
files) for genes from 12 Drosophila species were downloaded on Dec 23 2017 from Ensemble Release 91 (http://dec2017.archive.ensembl.org; D. melanogaster) or Ensemble Metazoa Release 37 (http://oct2017-metazoa.ensembl.org; D. ananassae, D. erecta, D. grimshawi, D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia, D. simulans, D. virilis, D. willistoni, and D. yakuba). Sequences that were shorter than 50 amino acids or contained premature stop codons were excluded. 20,803 orthologous groups were identified with OMA standalone v2.2.0 [@altenhoff15], using as guidance tree the phylogeny published in Drosophila 12 Genomes Consortium (2007), and default settings otherwise. A single representative splicing variant per gene was identified by OMA, and alternative splicing variants were excluded from subsequent analyses. The retained number of genes per species were 15,052, 14,998, 14,969, 13,818, 14,581, 16,856, 15,845, 16,409, 15,355, 14,477, 15,490, and 16,039 for D. ananassae, D. erecta, D. grimshawi, D. melanogaster, D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia, D. simulans, D. virilis, D. willistoni, and D. yakuba, respectively.
Sequences of 4,792 OMA orthologous groups that only included one-to-one orthologous genes present in all 12 species were extracted and individually aligned with MAFFT v7.407 [@katoh02; @katoh13], using the iterative refinement method incorporating local pairwise alignment information (--localpair --maxiterate 1000
settings). Alignments with not more than 20% missing data (4,308 orthologous groups) were concatenated, and a phylogenetic tree was computed with RAxML v8.2.12 [@stamatakis14]. The best protein substitution model (JTT with empirical base frequencies) was determined by the program (-f a -# autoMRE -m PROTGAMMAAUTO --auto-prot=ml
settings). The resulting tree was rooted at the branch that best balanced the subtree lengths.
Genome maps were prepared for all retained genes (i.e., excluding low quality sequences and non-representative splicing variants) based on gene position information extracted from the gff3
files. Gene start and end positions were calculated as the average of CDS midpoints -/+ 1 base pair to avoid the occurrence of overlapping gene positions, which are not supported by the genome reconstruction software ANGES v1.01 [@jones12]. A few remaining overlaps between gene positions were resolved manually. The ANGES marker
input file was generated using the genome maps of all 12 Drosophila species and 20,803 OMA orthologous groups (from the OMA OrthologousGroups.txt
file) with the R script oma2anges.R
(available in the exec
directory of the rearrvisr
package). The ANGES tree
input file was based on the best rooted RAxML tree computed above (including branch lengths). The ancestral genome of the melanogaster subgroup 'MSSYE' (Drosophila 12 Genomes Consortium, 2007; i.e., separating D. melanogaster, D. simulans, D. sechellia, D. yakuba, and D. erecta from the remainder of the Drosophila species) was reconstructed using the ANGES master pipeline (anges_CAR.py
) and options markers_doubled 1
(infer ancestral marker orientation), markers_unique 2
(no duplicated markers), markers_universal 1
(no missing markers in ingroup), c1p_telomeres 0
(no telomeres), and c1p_heuristic 1
(using a greedy heuristic).
A total of 27,242 Ancestral Contiguous Sets [ACS; @jones12] were identified by ANGES, of which 26,594 ACS were organized into 20 CARs (648, or 2.4%, of ACS were discarded by the program). These CARs comprised a total of 8,973 ancestral markers, and 99.2% of them were grouped into four major CARs (i.e., the 20 CARs included 4,250, 1,914, 1,711, 1,026, 28, 23, 5, 3, 2, and 11 x 1 ancestral markers).
\clearpage \newpage
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.