In tpbilton/GUSMap: Genotyping Uncertainty with Sequencing Data and Linkage Mapping (GUSMap)

knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval=TRUE)
library(GUSMap)

In this tutorial, we will describe how to use GUSMap to perform linkage mapping and explain some of the functionality available. For this tutorial, we focus on linkage analysis for an F2 or intercross population as implemented in the IC class in GUSMap.

For this tutorial, we will use a simulated data set of a full-sib family (F1 population) consisting of 100 offspring, 3 chromosomes and 1000 SNPs to illustrate GUSMap. Although it is more appropriate to analyze this data set using the FS class, we use the data set to illustrate IC class. The file location to the VCF file (and associated pedigree file) for the data set can be extracted using the simDS function.

vcffile <- simDS() # extract filename
vcffile            # filename stored in object vcffile

Some important up front comments regarding GUSMap:

Having high read depth on the parents (e.g., sequenced more than once) can be hugely helpful to correctly infer the segregation type of more SNPs. However, for the progeny, it is of more value (and cost-efficient) to have more individuals in the family at lower depth than sequence fewer at higher depth.
GUSMap is designed for low coverage sequencing data which means that there is no need to pre-filter the data in terms of read depth. Furthermore, GUSMap includes standard filtering steps in terms of minor allele frequency (MAF), missing data and a segregation test (including a few other filters) which means one only needs to input a raw Variant Call Format (VCF) file obtained after the SNP calling processes into GUSMap.

Loading Data

At present, GUSMap can read in data stored in a Variant Call Format (VCF) file, provided that there is some form of allelic depth information (e.g., the number of reads for the reference and alternate alleles). This is done using the VCFtoRA function which takes in the name of a VCF file and converts it into what we call a reference/alternate (RA) file.

# convert VCF file to an RA file
rafile <- VCFtoRA(infilename = vcffile$vcf, direct = "./")

The arguments of the VCFtoRA function are:

infilename: The name of the VCF file.
direct: Specifies which directory to write the RA file to (relative to the current working directory).

Currently, VCFtoRA requires one of the fields AD, AO and RO, or DP4 to be present in the VCF file to extract allelic depth information. One other thing to note is that VCFtoRA returns the file location of the created RA file.

rafile # file path of RA file

The RA file created is a tab-delimited file with columns, CHROM (the chromosome name taken from the "#CHROM" column in the VCF file), POS (the position of the SNP taken from the "POS" column in the VCF file), and SAMPLES which consists of the sampleIDs used in the VCF (columns 10 and above in the VCF file). For example, the first five columns and first four rows of the RA file for the simulated data set are:

|CHROM |POS |P1_1 |P2_1 |P1_2 | |:------|:-------|:-----|:----|:----| |1 |866499 |25,0 |0,2 |9,0 | |1 |2070050 |8,0 |4,9 |9,0 | |1 |2232468 |1,0 |0,1 |2,0 | |1 |3459263 |12,13 |0,0 |0,1 |

The entry in the first row and the third column is "25,0" which means that there are 25 reads for the reference allele and no reads for the alternate allele at the SNP on chromosome 1 with position 866499 in sample P1_1. In addition, the fourth row of the third column is "12,13" which menas that there are 12 reads for the reference allele and 13 reads for the alternate allele at the SNP on chromosome 1 with position 3459263 in sample P1_1.

An RA file can then be loaded into R using the readRA function.

RAdata <- readRA(rafile = rafile, sampthres = 0.01, excsamp = NULL)

The arguments of the readRA function are:

rafile: Name of the RA file to be read in.
sampthres: Specifies the minimum sample depth of an individual before it is removed.
excsamp: Specifies IDs of any samples in the RA data set to be discarded (for problematic samples to be removed). The input for excsamp must correspond to the sample IDs in the RA file.

The RA data is now stored as an RA object.

class(RAdata)

Summary information of the RA object can be displayed by printing the object.

RAdata

Linkage mapping with GUSMap

In this section, we will discuss linkage mapping using GUSMap. The main steps to the process are:

Create a F2 (intercross) populations and perform appropriate filtering
Constructing linkage groups
Ordering markers across each linkage group
Computing maps for the ordered linkage groups

More details of the mathematics underpinning the methods in GUSMap can be found in @bilton2018genetics1. In the following sections, we shall discuss how each of these steps are performed in GUSMap.

Constructing a intercross population

Construction of a intercross (F2) population object is achieved using the makeIC function. At present, GUSMap only uses information from the progeny (F2 individuals in an F2 population). If parents and/or grandparents are included in the VCF that is read into GUSMap (as described previously), then they should be excluded in the creation of the intercross population. For the example data set, there are two samples for the maternal parent and two samples for the paternal parent. Sample IDs from the RA object can be extracted using the $extractVar

samID = RAdata$extractVar("indID")$indID
head(samID)

The first four IDs ("P1_1","P2_1","P1_2","P2_2") correspond to the two parents, while the remainder correspond to the progeny ("F1_001" to "F1_100"). We proceed to create the IC population using only the progeny samples as follows:

progeny = samID[-c(1:4)]
mySpecies <- makeIC(RAobj = RAdata, samID = progeny,
                      filter = list(MAF = 0.05, MISS = 0.5,
                        BIN = 100, DEPTH = 5, PVALUE = 0.01, MAXDEPTH=500))

The makeIC function takes an RA object, creates an intercross population and performs filtering on the data.

There are six different types of filtering available in GUSMap:

Minor allele frequency (MAF): SNPs are removed if the MAF is below the threshold since in intercross populations, the expected allele frequency is 0.5 (although one may want to be quite liberal with this filter to allow for true segregation distortion).
Percentage of missing data (MISS): SNPs are removed if the number of individuals without a single read for a given SNP is less than the threshold.
SNP binning distance (BIN): SNPs are binned together if the distance (in base pairs) between them is less than the threshold value. One SNP is then selected at random from each bin. This filtering is to ensure that there is only a single SNP on each sequence.
Parental depth (DEPTH): SNPs are removed if the read depth of each parent is below the threshold value. This filtering is performed since low read depth can result in parental heterozygous genotypes to be seen as homozygous resulting in incorrectly inferred segregation type.
Segregation test (PVALUE): A segregation test is performed to removed any SNPs in which the segregation type has been incorrectly inferred (e.g., parental genotypes called wrongly).
Maximum average SNP depth (MAXDEPTH): SNPs with an average depth above 500 are discarded. SNPs with really high depth compared to the rest of the SNPs often are repeative regions or simply probamatic. It is often best to discard these SNPs.

Again, more details of these various filter types is available in the package documentation of the makeIC function.

Linkage Groups

Computing 2-point recombination fraction estimates

Linkage group formation in GUSMap is based on 2-point recombination fraction estimates and associated LOD scores between all pairs of SNPs. These are computed using the function $rf_2pt.

mySpecies$rf_2pt(nClust = 3)

The $rf_2pt function is parallelized to reduce computational time, where the nClust argument specifies how many cores to use in the parallelization. Be careful not to set this to more than what is available on your computer. Note: this step can take some time, especially if there are a large number of SNPs in your data set.

Hint: Since computing the 2-point recombination fractions can take quite long, it is best to save the IC object as a R data file using the save function. For example:

save(mySpecies, file="ICobject.Rdata")

Once the 2-point recombination fraction estimates and associated LOD scores have been computed, we can plot the resulting recombination fraction estimates on a heatmap.

# plot 2-point rf matrix for BI and MI SNPs
mySpecies$plotChr(mat = "rf", lmai=0.5)

Heatmaps are helpful for examining genome assemblies if the SNPs where called using a reference genome. For our example, we see that the middle section of chromosome 2 should actually be with chromosome 3, and that the last quarter of chromosome 1 needs to be inverted (that is the marker order need to be reversed). Furthermore, there are a number of SNPs that are clearly on the wrong chromosome and should be mapped to one of the other chromosomes. For de novo assemblies, one can still plot the 2-point recombination fraction estimates although the resulting plot should look fairly random and so is usually not very informative.

One can also plot the matrix of 2-point LOD scores.

# plot 2-point LOD matrix for BI, PI and MI SNPs
mySpecies$plotChr(mat = "LOD", lmai=0.5)

Forming linkage groups

Using our 2-point recombination fraction estimates and LOD scores, we can proceed to forming linkage groups.

mySpecies$createLG(LODthres = 10, nComp = 10)

The performance of the linkage grouping algorithm is controlled by the parameters LODthres and nComp. Details of these arguments and how the linkage groups are formed is explained in the package documentation at:

?`$createLG`

Once the linkage groups have been formed, it is a good idea to look how well the grouping has performed by examining the heatmap of the 2-point recombination fraction estimates with the SNPs grouped according to the linkage groups. This can be done using the $plotLG function.

mySpecies$plotLG(interactive = F, LG = NULL)

The heatmap shows that there are seven linkage groups. We would expect there to be three linkage groups as there are 3 chromosomes. The heatmap suggests that linkage groups 3, 4, 5 and 7 are actually one single linkage group due to the low recombination fraction estimates between linkage groups. It is also possible that linkage group 6 is actually part of linkage group 1, however there is less evidence of this from the 2-point recombination fraction heatmap.

Note:

Setting interactive=TRUE for the $plotLG function results in an interactive plot being plotted. Interactive plots make interrogating the recombination fraction (or LOD) plot and finding problematic SNPs easier. We have set this to FALSE in this tutorial to keep the size of this file small. In your session, try producing the interactive plot.
At present, interactive plots generally work on a small number of SNPs (usually no more than 1000), so depending on the size of your linkage groups, you may need to plot each linkage group separately. You can use the LG argument to plot specific linkage groups.

Editing Linkage Groups

There are a number of functions in GUSMap for editing linkage groups. Ideally, one should obtain as many linkage groups as the number of chromosomes the species is known to have. However, in practice, the number of linkage groups may be more, or less, than the number of chromosomes depending on the LOD threshold used and whether there is enough linkage between SNPs on the same chromosome to group all the SNPs located on the same chromosome together in the same linkage group.

For our simulated data, we had seven linkage groups. From the heatmaps above, we see that linkage groups 3, 4, 5 and 7 appear part of the same linkage group. We can merge these linkage groups using the $mergeLG function.

mySpecies$mergeLG(LG = c(3,4,5,7))
mySpecies$print(what = "LG")

Note that after merging linkage groups, the linkage group which was previous denoted 6 is now denoted by 4. We could also merge this linkage group with linkage group 1, as suggested by moderately low recombination fraction estimates between the linkage groups. However, since we are not confident of this based on the heatmap, we decide to remove linkage group 4, which can be done using the $removeLG function.

mySpecies$removeLG(LG = 4)
mySpecies$print(what = "LG")

Alternatively, one could have also have removed the linkage group by removing all the SNPs present in the linkage group using the $removeSNP function.

mySpecies$removeSNP(snps = c(36,27,32,28,131))

Note that the $removeSNP function is more useful for removing SNP(s) from a linkage group if they appear problematic without deleting the entire linkage group.

Note: At this stage there is no undo functionality in GUSMap. So, once you have removed a SNP or linkage group, or merge two linkage groups, you cannot reverse what you have done. However, in the case where you want to undo edits to your linkage group, the best thing to do is to recreate the linkage groups using the $createLG function and proceed as you were doing before (and hence it helps to save you code in a script).

In practice, our choice for the LOD threshold previously might have been too high to begin with. Thus, we can also attempt to obtain the optimal number of linkage groups by trying different values for LODthres in the $createLG function.

mySpecies$createLG(LODthres = 8)

As we can see, a LOD threshold of 8 obtains the desired number of linkage groups in one go. In practice, however, there are likely to be lots small linkage groups (in addition to large linkage groups that correspond to the linkage groups) that either need to be merged with other linkage groups or removed using the functions described eariler.

Ordering Linkage Groups

Once we are happy with the linkage groups we have formed, we can proceed to order the SNPs in each linkage group. In GUSMap, marker ordering is performed using the multidimensional scaling (MDS) approach as described by @preedy2016tag and is implemented in the $orderLG function.

mySpecies$orderLG(mapfun = "haldane", weight = "LOD2", ndim = 30)

The two plots produced by the $orderLG function are:

The first two dimensions of the components of the MDS.
The heatmap of the 2-point recombination fraction for all SNPs on the linkage group ordered according to the MDS algorithm.

These plots are helpful for examining how well the algorithm has ordered the SNPs. Details of the MDS algorithm are given in @preedy2016tag and information for the arguments mapfun and weight can be found in the package documentation (The default arguments tend to give the best results).

?`$orderLG`

Some further checks on the order linkage maps for suspicious SNPs was performed and removed:

Compute Linkage Maps

Now that we have finished ordering and tidying up the linkage groups, we proceed to compute the linkage maps using the $computeMap function.

mySpecies$computeMap(chrom = NULL, nThreads = 3, rfthres=0.1)

The $computeMap function has a number of arguments but the main ones are chrom which specifies which linkage group to compute linkage maps for and nThreads which specifies the number of threads to use in the computation of the likelihood (using OpenMP if available). The resulting output gives the overall map distances and sequencing error estimates for each chromosome (in addition to the linkage group information).

The rfthres argument specifies what recombination fraction threshold to highlight SNPs with high recombination fraction estimates. This can be helpful for identifying potentially probabmatic SNPs. In this example, there are three SNPs with recombination fraction estimates above 0.1. The first two are not a concern here because there is only a small number of SNPs in the linkage group and based on the heatmaps displayed eariler, some large distances are expected around this portion of the map. However, RF-66 has an unrealistic recombination fraction estimate (of 1), which is the recombination rate between SNPs 24 and 173. Since SNP 173 is the last in the linkage group (there are 67 SNPs so RF-66 is the last recombination fraction estimate), it is best to delete this SNP and recompute the map for linkage group 1. In addition, there is a really large recombination fraction estimate in linkage group between SNPs 87 and 89. The output from the $orderLG function shows that there is quite a large distance between SNPs around this part of the map and SNP 89 appears to be quite far away from the curve. This suggests that removing SNP 89 as will might improve the results.

## remove SNP 173
mySpecies$removeSNP(c(124,133,89))
## Recompute linkage map
mySpecies$computeMap(chrom = c(1,2), nThreads = 3, rfthres=0.1)

The overall map distances now look reasonable and although there are a few largish recombination fraction estimates, they are not a concern as there appears to be large distances between SNPs at these sections of the maps.

GUSMap has a couple of functions for quickly examining the linkage maps computed. The first function is to plot the linkage maps.

mySpecies$plotLM(LG = NULL, fun = "haldane", col=c("cyan","lightblue","blue"))

The arguments for the $plotLM function are:

LG: The indices of the linkage groups to be plotted.
fun: The mapping function to use. Currently only the Haldane, Kosambi and Morgan mapping functions are available.
col: A vector of colors for the linkage maps.

In addition, we can also examine the order of the SNPs across the linkage groups relative to the original assembly.

mySpecies$plotSyn()

From the synteny plot above, we see that linkage group 1 corresponds to chromosome 3, linkage group 2 to chromosome 2 and linkage group 3 to chromosome 1. Furthermore, the plot shows that the linkage mapping analysis has moved the middle section of chromosome 2 to chromosome 3 (linkage group 1) and that it has inverted the SNP order at the end of chromosome 1 (linkage group 3).

Finally, one can extract the results of the linkage analysis in GUSMap by writing the results to a file using the $writeLM function.

mySpecies$writeLM(file = "mySpecies", inferGeno = TRUE)

This is allows users to produce their own custom plots in whatever software they are familiar with. The inferGeno argument indicates whether to compute the most likely haplotype for each chromosome in each individual, which is appended to the file as columns. For details of the information given in the file, see the package documentation.