knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(snpStats) library(HaplotypeMiner) library(ggplot2)
HaplotypeMiner has been developed to allow defining haplotypes of genes of interest based on a gene-centered approach. This vignette explains the approach implemented by the package and provides an example analysis conducted on soybean (Glycine max), from parameter specification and data input to the analysis of the output of the package.
The haplotype definition approach of
HaplotypeMiner is gene-centered, that is, it defines haplotypes that represent putative alleles of genes of interest based on markers located on each side of the central position of the gene.
HaplotypeMiner essentially looks at a genomic interval surrounding a gene of interest and selects SNPs that are likely to be useful in defining haplotypes. This selection of SNPs is done in three steps (filtering, clustering and final selection) that will be described below.
The two basic inputs for
HaplotypeMiner are the position of a gene of interest and a file indicating the genotypes of a set of individuals at different SNP markers (Hapmap and VCF formats are currently supported). Providing population structure and/or kinship is optional, but recommended since these generally improve the accuracy of linkage disequilibrium estimates and thus the quality of the analysis.
The computation carried out by
HaplotypeMiner also depends on an extensive set of parameters that greatly influence the results of the analysis. These parameters will be described below along with the example code.
In the first step of the analysis,
HaplotypeMiner optionally filters SNPs based on various criteria. Most importantly, only SNPs located on the chromosome on which the gene is located and at a maximal (physical) distance from the central position of the gene are kept for further processing. The maximal distance to consider will depend on the pattern of linkage disequilibrium (LD) in the region where the gene is found, but distances of a few hundred kb up to 1 Mb will likely be a good first guess, at least in soybean.
HaplotypeMiner also only considers biallelic markers at the moment, and will filter out any multiallelic marker.
Filtering steps are optional since they can be performed externally, prior to analysing the data with
HaplotypeMiner. Including these filtering functionalities in the package ensures that users who want to experiment with different filtering parameters can do so iteratively without having to go back and forth between
R and other software. At the moment, SNPs can be filtered based on minor allele frequency, proportion of heterozygosity, proportion of missing data and minor allele count (see
?genotype_filter for a full description of filtering parameters).
After the initial filtering step,
HaplotypeMiner computes the linkage disequilibrium (using a function from package
LDcorSV) between all the pairs of markers that passed the filtering step. A clustering is then performed to reduce the set of markers to a set of informative markers. On each side of the gene, all marker pairs with a linkage disequilibrium coefficient higher than a given threshold are clustered and only one marker per cluster is kept, with the assumption that all markers in a cluster provide the same information.
After the clustering step, a final selection step is applied to keep only those markers that are considered informative in defining haplotypes in the region of interest. Two markers will be considered informative if they are located on different sides (5' vs 3') of the gene center and are in linkage disequilibrium with each other. This is based on the assumption that markers that are in linkage disequilibrium across the gene center are likely to also be in linkage disequilibrium with potentially functionally relevant genetic polymorphisms located inside the gene sequence.
As an example, we will use the package
HaplotypeMiner to define haplotypes of the GmGia maturity gene of soybean. This gene is located on the chromosome 10 in the soybean reference genome.
The genotyping data used for this example are simulated SNP array data for 67 soybean lines. This simulated dataset has been generated by extracting the genotypes at marker positions on the SoySNP50K array (Song et al. 2013) from previously published whole-genome resequencing data of a population of Canadian short-season soybean lines (Torkamaneh et al. 2018). In order to avoid uselessly large datasets, only SNPs located on chromosomes 6, 10, 19 and 20 have been included in the example dataset
SNP_data.hmp.txt distributed with the package.
The analysis parameters are built through a helper function called
haplo_params. This step separates the specification of the parameters from the actual computation of the haplotypes, and checks the validity of the input (to some extent). The parameters used in this analysis are listed and explained here. The description of other parameters can be found in the documentation of the function (
?haplo_params). The following function call builds these parameters:
# Storing the parameters of the analysis in an object params <- haplo_params( input_file = system.file("extdata", "SNP_data.hmp.txt", package = "HaplotypeMiner"), structure_file = system.file("extdata", "structure.txt", package = "HaplotypeMiner"), kinship_file = system.file("extdata", "kinship.txt", package = "HaplotypeMiner"), gene_db_file = system.file("extdata", "gene_db.txt", package = "HaplotypeMiner"), chr_db_file = system.file("extdata", "gmax_chr_sizes.txt", package = "HaplotypeMiner"), gene_name = "GmGia", R2_measure = "r2vs", cluster_R2 = "r2vs", max_missing_threshold = 0.6, max_het_threshold = 0.05, min_alt_threshold = 0.05, min_allele_count = 4, cluster_threshold = 0.8, max_marker_to_gene_distance = 250000, max_flanking_pair_distance = 250000, marker_independence_threshold = 0.5)
The parameters as listed above instruct
HaplotypeMiner to carry the analysis as follows:
input_file: name of the .hapmap or .vcf genotype file. Here, it is the example dataset
structure_file: name of the structure file (optional). Here, it is the example data
HaplotypeMiner. See this example structure file for formatting requirements.
kinship_file: name of the kinship file (optional). Here, it is the example data
HaplotypeMiner. See this example kinship file for formatting requirements.
gene_db_file: name of a file providing information about genes of interest. It is a 4-column table providing the names of the genes, the chromosome on which they are located, and their genomic start and end positions. A single such file can contain multiple genes, even though each analysis processes only one gene. The example version (
gene_db.txt) provided here contains information on four soybean maturity genes.
chr_db_file: name of the file providing information about the size of the chromosomes in the species of interest. The information in this file is only used for plotting purposes and the requirement for this file may eventually be removed. The example file (
gmax_chr_sizes.txt) provided here lists the size of the 20 chromosomes of the soybean reference genome.
gene_name: name of the gene for which haplotypes are to be defined. It has to match the name of a gene listed in
gene_db_file. Here, we are carrying the analysis for the gene
R2_measure: $R^2$ linkage disequilibrium measure to use for the final selection step. Here,
r2vsmeans $R^2$ corrected for structure and kinship.
cluster_R2: $R^2$ linkage disequilibrium measure to use for the clustering step. Defaults to the same measure as
R2_measure, but here we have been explicit for purposes of the example.
max_missing_threshold: the maximum proportion of missing genotype calls allowed for a marker (here 0.6).
max_het_threshold: the maximum proportion of heterozygous genotype calls allowed for a marker (here 0.05).
min_alt_threshold: the minimum frequency of the minor allele for a marker to be kept (here 0.05).
min_allele_count: the minimum number of times the minor allele has to be observed for a marker to be kept (here 4).
cluster_threshold: the minimum $R^2$ for two markers to be clustered together during the clustering step (here 0.8).
max_flanking_pair_distance: the maximum distance (in bp) that can separate two markers in linkage disequilibrium at the final selection step (here 250 kb).
max_marker_to_gene_distance: the maximum distance (in bp) from a marker to the center of the gene of interest (here 250 kb).
marker_independence_threshold: the minimum $R^2$ for two markers to be considered in linkage disequilibrium at the final selection step (here 0.5).
haplo_selection performs the actual computation that leads to the definition of haplotypes. The function will optionally print some information to the screen while it is running. For this example, you can expect it to run under 10 seconds, but it can take considerably longer on larger datasets. Measures will be explored explored to improve the performance of the package, but until then, it is recommended to pre-filter large SNP datasets with other tools prior to analyzing them with
HaplotypeMiner, as the package currently loads the full dataset in memory.
gmgia_haplotypes <- haplo_selection(params, verbose = TRUE)
haplo_selection calls various other functions for performing the different steps of the analysis. The single function call to
haplo_selection should satisfy the needs of most users, but for those interested in more control and troubleshooting, here is the list of functions called by
haplo_selection to do most of the work.
genotype_filterfilters the SNP dataset to keep only markers that are located in the vicinity of the gene and meet user-defined criteria.
compute_ldcomputes the linkage disequilibrium between all marker pairs at two different timesteps: marker clustering and final selection of marker pairs. If the same $R^2$ measure is used for both steps, then linkage disequilibrium is computed only once.
cluster_selectionselects pairs of markers in significant LD across the gene center. These will be used for defining haplotypes.
haplotypestakes the set of selected markers and uses them for defining possible haplotypes. Every unique combination of alleles at these positions is considered a different haplotype. This function also performs the assignment of individuals to haplotypes by checking each individual for compatibility with all haplotypes. Some individuals may be compatible with more than one haplotype if data is missing for some markers; these are not assigned to any haplotype.
You can consult the documentation of each of these functions and read the source code of
haplo_selection to see how they are implemented.
The output of
haplo_selection is a list that contains the following lists or data frames:
str(gmgia_haplotypes, max.level = 1)
As can be seen, the output keeps all the information accumulated throughout the haplotyping process. This is not very memory efficient, but should not be a problem for most datasets and has the advantage of packing all the results of the analysis in a single object. However, extracting the information from this object is not straightforward, which is why helper functions were created to generate informative output graphs and files that allow interpretation and troubleshooting with minimal knowledge of the underlying implementation.
haplo_output is the only function that most users should need for generating output. It is a wrapper around many different graphical and textual ouput functions, and provides a simple interface to select what ouput will be generated and saved to disk (see
?haplo_output for more details). For example, the object
graph_list generated below defines a list of graphs to be generated upon applying the function
haplo_output to the
gmgia_haplotypes object. The results will be output to a directory called
gmgia. Four types of graphs (
genotypes) can be generated for each step of the marker selection process, but not all these combinations are useful or make sense. We shall have a closer look at some of these graphs below.
# These commands will not run when building the vignette to avoid the side # effect of writing files to disk. However, you are encouraged to run # them on your computer and have a look at the results. graph_list <- list("All_markers" = "density", "Filtered_markers" = c("matrix", "distance", "genotypes"), "Clustered_markers" = c("matrix", "genotypes"), "Selected_clusters" = c("matrix", "genotypes"), "Selected_markers" = c("matrix", "genotypes"), "Haplotypes" = c("genotypes")) haplo_output(gmgia_haplotypes, output_dir = "gmgia", graphs = graph_list)
HaplotypeMiner does not allow overwriting a directory that already exists. If you are to perform the analysis several times, you will therefore need to either delete the results directory after each run or change the name of the output directory.
Running the commands above should have created a directory called "gmgia" containing all the output of
haplo_output. We will have a look at some of these results below. Notice how the functions that actually perform the task of generating the output (
distance_plot) are directly called as part of the vignette code. Most users should not need these functions, but they are exported by the package and can thus be used on their own if required. Graphical functions are implemented with
ggplot2; users who are familiar with this package can therefore add layers or control theme elements if they wish to do so.
The first file to look at before analyzing the results is the log file. This file gives the user information about the analysis that was performed, the number of distinct haplotypes that were identified, as well as the number of individuals assigned to each of the haplotypes. By default, results are saved to a file called
Log.txt, but the name can be changed and the output can also be sent to the console by setting
to_file = FALSE.
haplo_logfile(gmgia_haplotypes, to_file = FALSE)
So-called "genotype plots" allow visualizing the genotypes of the different individuals at the markers that have been kept at a given point of the analysis. For example, the following graph shows the genotypes of the 67 individuals for all the markers that have been kept following the initial filtering step. In this graph and in all following graphs, the double asterisks denote markers that have been kept to define haplotypes, while the vertical red bar indicates the position of the center of the gene. The letters on the right of the graph indicate the haplotype to which each individual has been assigned.
genotype_plot(snp_data = gmgia_haplotypes$Filtered_markers, gene_pos = gmgia_haplotypes$Parameters$Gene_center, kept_markers = gmgia_haplotypes$Haplotypes$Markers, assignment = gmgia_haplotypes$Haplotypes$Assignment, name_order = FALSE) + theme(axis.text = element_text(size = 0.1))
The "genotype plot" of the haplotypes also enables looking at the set of haplotypes defined at the end of the analysis:
genotype_plot(snp_data = gmgia_haplotypes$Haplotypes, gene_pos = gmgia_haplotypes$Parameters$Gene_center, kept_markers = gmgia_haplotypes$Haplotypes$Markers, assignment = NULL, name_order = TRUE)
In this case, haplotypes A and B were found to correspond to the e2-ns allele of the gene GmGia, whereas haplotypes C and D where found to correspond to the E2-in allele. There was therefore not a one-to-one correspondence between haplotypes and functionally relevant alleles, but
HaplotypeMiner was still able to classify individual in groups in a useful way.
The density plot is used to look at the density of markers along the chromosome. It is especially useful to look at the density of markers of the initial set of markers (prior to filtering), as it might indicate a lack of markers surrounding the position of the gene of interest. On the following figure, the red line is located at the center of the GmGia gene and suggests that this gene is located in a region with high marker density.
density_plot(snp_data = gmgia_haplotypes$All_markers, center_pos = gmgia_haplotypes$Parameters$Gene_center, chr_length = gmgia_haplotypes$Parameters$Chromosome_length)
These graphs are useful for diagnosing problems in the analysis and understanding the steps implemented in the package. It shows the patterns of linkage disequilibrium (LD) between different markers at different steps of the analysis. The following graph shows the pattern of LD between all markers after filtering, but before clustering. It shows obvious LD blocks:
ld_plot(snp_data = gmgia_haplotypes$Filtered_markers, center_pos = gmgia_haplotypes$Parameters$Gene_center, kept_markers = gmgia_haplotypes$Haplotypes$Markers) + theme(axis.text = element_text(size = 5))
The same type of graph shows that the number of markers has been reduced following clustering, removing redundant information:
ld_plot(snp_data = gmgia_haplotypes$Clustered_markers, center_pos = gmgia_haplotypes$Parameters$Gene_center, kept_markers = gmgia_haplotypes$Haplotypes$Markers)
The last type of graph that can be generated by
HaplotypeMiner is a graph of the relationship between linkage disequilibrium and physical distance for a set of markers. These graphs can be useful to identify the patterns of LD in the region surrounding the gene of interest, and can thus help in choosing distance and LD thresholds for the analysis. Here is a graph showing these data for the set of markers following filtering in this analysis. In this graph, the red line shows the $R^2$ threshold that has been used in the analysis for identifying pairs of markers.
distance_plot(snp_data = gmgia_haplotypes$Filtered_markers, center_pos = gmgia_haplotypes$Parameters$Gene_center, r2_threshold = gmgia_haplotypes$Parameters$Marker_independence_threshold)
The genotypes of the individuals for sets of markers generated throughout the analysis by
HaplotypeMiner can also be output as Hapmap or VCF files. By default, three Hapmap files are output: one file with the genotypes for the final set of clustered markers, one with the genotypes for the set of markers represented by these clusters, and a file with the set of haplotypes that were generated and their "genotypes" at different positions.
HaplotypeMiner is intended as a tool that is analogous to what one might do by analzying a set of markers visually, but with added reproducibility and robustness.
HaplotypeMiner should not be used as a black box to which genotype datasets are fed and output is blindly accepted; rather, arriving to meaningful results will require a good knowledge of the dataset at hand and will likely imply testing different sets of parameters and looking at the different kinds of graphs generated above.
HaplotypeMiner has so far only been tested with soybean. We do not know to what extent the model will hold for polyploid species, mainly outcrossing species, or species in which linkage disequilibrium decays more rapidly with physical distance than in soybean. If you test
HaplotypeMiner with a different species, we would be happy to know about the results and provide some advice if needed.
Song Q, DL Hyten, G Jia, CV Quigley, EW Fickus, RL Nelson, and PB Cregan. 2013. Development and evaluation of SoySNP50K, a high-density genotyping array for soybean. PloS One 8:e54985.
Torkamaneh D, J Laroche, A Tardivel, L O'Donoughue, E Cober, I Rajcan, and F Belzile. 2018. Comprehensive description of genomewide nucleotide and structural variation in short-season soya bean. Plant Biotechnology Journal 16:749-759.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.