knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 6, fig.width = 6 )
syntR is an R package for identifying synteny blocks shared between two genetic maps. The core algorithm implemented in the package focuses on identifying synteny blocks via comparison of marker orders (e.g. from genetic maps of different species or populations). syntR identifies synteny blocks using a clustering algorithm tuned to the linear nature of genetic map data and generates an annotated list synteny block membership for each marker. syntR does not identify segmental duplications.
syntR requires marker positions (chromosome and map position for each marker) from two genetic maps. The marker orders in the two maps need to be separately inferred for each map prior to using syntR. If you need to build you maps prior to using syntR, we reccomend looking into R/qtl, ASMap, and Lep-MAP.
The input file has six columns. Each row represents the map position of a single marker (e.g. a SNP, a microsatillite, etc.) in two different maps.
library("syntR") data(ann_pet_map) knitr::kable(head(ann_pet_map), caption = "The example input file")
The columns contents are as follows:
Below, we work through an example using genetic map data from two sunflower species. Note that some of the details and tuning parameters will require tailoring to your specific maps.
We begin by loading the example data:
# install the package if needed! # devtools::install_github("ksamuk/syntR") # load the syntR package library("syntR") # load the example marker data data(ann_pet_map) as_tibble(ann_pet_map) # read in list of chromosome lengths for map1 # this is an optional file that defines the maximum # length of each chromosome (in cM) if this is known # NB: this is useful when the markers being compared # do not cover the entire map data(ann_chr_lengths) ann_chr_lengths # make_one_map places all the markers on a single scale # and adds padding between the chromosomes # to aid the algorithm and for visualization purposes map_list <- make_one_map(ann_pet_map, map1_max_chr_lengths = ann_chr_lengths)
make_one_map
outputs a list object containing three elements:
- [[1]]: the marker data file with 2 additional ordering columns
- [[2]] and [[3]]: the locations of the chromosome breaks between map 1 and map 2
We can use plot_maps()
to create a basic dot plot of our marker data:
# plot map plot_maps(map_df = map_list[[1]], map1_chrom_breaks = map_list[[2]], map2_chrom_breaks = map_list[[3]])
The chromosomes are annotated with their species-specific names, an the chromosome breaks are indicated by vertical and horizontal lines.
Notice that the maps look rather scattered. This is because the chromosomes are ordered arbitrarily (there is no inherent order to chromsomes). Top aid readability, we can reorder the order of chromosomes of species 2 to match the order in species 1 (plotted along the x-axis). We also rotate any "wholly-inverted" chromosomes so that they line up (at the chromosome scale) along the 1:1 line.
# reorder and flip some chromosomes map2_chr_order <- c("Pet8", "Pet9", "Pet12_16", "Pet15", "Pet16_17", "Pet17") flip_chrs <- c("Pet9") map_list <- make_one_map(ann_pet_map, map1_max_chr_lengths = ann_chr_lengths, map2_chr_order = map2_chr_order, flip_chrs = flip_chrs) plot_maps(map_df = map_list[[1]], map1_chrom_breaks = map_list[[2]], map2_chrom_breaks = map_list[[3]])
This results in a more sane-looking dot plot. However, there are clearly some translocations (e.g. Pet12_16 x Ann16) that do not conform to the 1:1 line.
The key tuning parameters of the syntR algorirthm are max_cluster_range
and, to a lesser extent, max_nn_dist
.
max_cluster_range
determines the maximum size of clusters of markers that will be collapsed to a singe point for further analysis. It defines the maximum distance that clustered markers can span. Lager values of max_cluster_range
will result in more markers being collapsed into a single point.
max_nn_dist
determines which markers are considered outliers and removed from further analysis. Markers that do not have a neighbour within max_nn_dist
are considered outliers. Larger values of max_nn_dist
will result in fewer markers being removed.
A simple method of finding the optimal combination of these two parameters is to fit a range of values for both and choose values that maximize coverage (% of the map assigned to synteny blocks) of the two maps and minimizes the number of outliers ("composite" in the plot below is a scaled sum of these metrics).
We provide an internal method for achieving this below. Note it is slightly computationally intensive and may take several minutes to complete.
If you'd like to skip this step, the default values of max_cluster_range = 2
and max_nn_dist = 10
are likely well suited to most genetics maps.
# find best parameter combination # run find_synteny_blocks with each parameter combination and collect summary statistics parameter_data <- test_parameters(map_list, max_cluster_range_list = seq(1, 5, by = 0.5), max_nn_dist_list = seq(10, 50, by = 10)) plot_summary_stats(parameter_data[[1]], "composite")
We choose the first (from left to right) composite coverage “peak” in the parameter matrix (see Ostevik et al. for details about this choice).
So, in this case, we select a max_cluster_range
of 2 and a max_nn_dsit
of 10.
With the two tuning parameters determined, we can run the primary function of syntR: find_synteny_blocks()
.
This function runs a set of marker data (contained in the map_list
object we created previously) through the syntR error-detection and synteny block identification algorithm described in the Ostevik et al. publication.
It can be run as follows:
# find synteny blocks synt_blocks <- find_synteny_blocks(map_list, max_cluster_range = 2, max_nn_dist = 10, plots = TRUE)
When plots=TRUE
, the function generates three plots showing the progress of the algorithm. In the first plot, mapping error is smoothed via a pre-clustering step. Next, outliers are flagged and removed. Finally, synteny blocks are identified via the syntR friends-of-friends clustering algorithm.
find_synteny_blocks()
outputs a list object with containing five data frames:
# print the contents of synt_blocks lapply(synt_blocks, as_tibble)
These elements are:
$marker_df
A data frame including the original list of markers along with their synteny block assignments.
$synteny_blocks_df
A data frame summarizing the span of each synteny block in each map (in map units) along with their inferred orientation (colinear, inverted, etc.) for synteny blocks with sufficient evidence of directionality.
$map1_breaks
A data frame listing the breakpoints between synteny blocks in map1.
$map2_breaks
A data frame listing the breakpoints between synteny blocks in map2.
$summary_stats
A data frame summarizing the results of the algorithm (coverage etc.)
We can plot the blocks with their orientation as below:
# plot synteny block orientations synt_blocks[[1]] %>% plot_maps(map_list[[2]], map_list[[3]], col = c("blue", "grey", "red", "black")[as.numeric(as.factor(.$orientation))], main = "Synteny block orientation", cex_val = 0.75)
In this case, blue = inverted, red = colinear and grey/black = unclassified.
If you have any questions about syntR, please email one of the authors or post an issue on www.github.com/ksamuk/syntR.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.