knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval=TRUE)
library(GUSMap)

In this tutorial, we will describe how to use GUSMap to perform linkage mapping and explain some of the functionality available. For this tutorial, we will use a back-cross approach as implemented in the BC class. A back-cross approach may be desirable for two reasons:

a) There are not enough BI SNPs to perform a full-sib analysis using the FS class (see the FS tutorial for an introduction to the FS class) b) The recombination fraction is substantially different in the two parental meiosis.

For this tutorial, we will use a simulated data set of a full-sib family (F1 population) consisting of 100 offspring, 3 chromosomes and 1000 SNPs to illustrate GUSMap. The file location to the VCF file (and associated pedigree file) for the data set can be extracted using the simDS function.

vcffile <- simDS() # extract filename
vcffile            # filename stored in object vcffile

Some important up front comments regarding GUSMap:

  1. Having high read depth on the parents (e.g., sequencing more than once) can be hugely helpful to correctly infer the segregation type of more SNPs. However, for the progeny, it is of more value (and cost-efficient) to have more individuals in the family at lower depth than sequence fewer at higher depth. However, we recommend that it is best to ensure that the average depth for a given individual is at least 2.
  2. GUSMap is designed for low coverage sequencing data which means that there is no need to pre-filter the data in terms of read depth. Furthermore, GUSMap includes standard filtering steps in terms of minor allele frequency (MAF), missing data and a segregation test (including a few other filters) which means one only needs to input a raw Variant Call Format (VCF) file obtained after the SNP calling processes into GUSMap.

In addition, some terms that we will use throughout this tutorial are:

More details of this terminology and the mathematics underpinning the estimation of recombination fractions in GUSMap can be found in @bilton2018genetics1.

Loading Data

At present, GUSMap can read in data stored in a Variant Call Format (VCF) file, provided that there is some form of allelic depth information (e.g., the number of reads for the reference and alternate alleles). This is done using the VCFtoRA function which takes in the name of a VCF file and converts it into what we call a reference/alternate (RA) file.

# convert VCF file to an RA file
rafile <- VCFtoRA(infilename = vcffile$vcf, direct = "./", makePed=TRUE)

The arguments of the VCFtoRA function are:

  1. infilename: The name of the VCF file.
  2. direct: Specifies which directory to write the RA file to (relative to the current working directory).
  3. makePed: Specifies whether or not to initialize a pedigree file. Pedigree files are important for creating a full-sib family object and will be discussed later.

Currently, VCFtoRA requires one of the fields AD, AO and RO, or DP4 to be present in the VCF file to extract allelic depth information. One other thing to note is that VCFtoRA returns the file location of the created RA file.

rafile # file path of RA file

The RA file created is a tab-delimited file with columns, CHROM (the chromosome name taken from the "#CHROM" column in the VCF file), POS (the position of the SNP taken from the "POS" column in the VCF file), and SAMPLES which consists of the sampleIDs used in the VCF (columns 10 and above in the VCF file). For example, the first five columns and first four rows of the RA file for the simulated data set are:

|CHROM |POS |P1_1 |P2_1 |P1_2 | |:------|:-------|:-----|:----|:----| |1 |866499 |25,0 |0,2 |9,0 | |1 |2070050 |8,0 |4,9 |9,0 | |1 |2232468 |1,0 |0,1 |2,0 | |1 |3459263 |12,13 |0,0 |0,1 |

The entry in the first row and the third column is "25,0" which means that there are 25 reads for the reference allele and no reads for the alternate allele at the SNP on chromosome 1 with position 866499 in sample P1_1. In addition, the fourth row of the third column is "12,13" which menas that there are 12 reads for the reference allele and 13 reads for the alternate allele at the SNP on chromosome 1 with position 3459263 in sample P1_1.

An RA file can then be loaded into R using the readRA function.

RAdata <- readRA(rafile = rafile, sampthres = 0.01, excsamp = NULL)

The arguments of the readRA function are:

  1. rafile: Name of the RA file to be read in.
  2. sampthres: Specifies the minimum sample depth of an individual before it is removed.
  3. excsamp: Specifies IDs of any samples in the RA data set to be discarded (for problematic samples to be removed). The input for excsamp must correspond to the sample IDs in the RA file.

The RA data is now stored as an RA object.

class(RAdata)

Summary information of the RA object can be displayed by printing the object.

RAdata

Linkage mapping with GUSMap

In this section, we will discuss linkage mapping using GUSMap for a backcross population. The main steps to the process are:

  1. Create a backcross population and perform appropriate filtering
  2. Constructing linkage groups
  3. Ordering markers across each linkage group
  4. Computing maps for the ordered linkage groups

In the following sections, we shall discuss how each of these steps are performed in GUSMap for a backcross population.

Constructing a back-cross population

Construction of a 'backcross' population object is achieved using the makeBC function.

mySpecies <- makeBC(RAobj = RAdata, pedfile = vcffile$ped, inferSNPs = TRUE,
                      filter = list(MAF = 0.05, MISS = 0.5, BIN = 100,
                        DEPTH = 5, PVALUE = 0.01, MAXDEPTH=500))

The makeBC function takes an RA object and pedigree file, creates a backcross population and performs filtering on the data.

For this tutorial, we use the existing pedigree file stored in the package. The file name of this pedigree file is stored in the object vcffile$ped. More information regarding the format of the pedigree file can be found by looking at the package documentation for the makeBC function which can be accessed by:

?`$makeBC`

There are six different types of filtering available in GUSMap:

Again, more details of these various filter types is available in the package documentation of the makeBC function.

The makeBC function also has one additional argument, namely inferSNPs. If inferSNPs=TRUE, then for the SNPs where the segregation type could not be determined from the parental genotypes, an attempt is made to determine the segregation type from the progeny data only. This can help increase the number of SNPs if there are not enough SNPs when the segregation type was determined only from the parents. One caveat, however, is that for SNPs that are informative in only one parent, the progeny information cannot inform us which parent is the informative parent. We, therefore, refer to the SNPs determined to be informative in a single parent using the inferSNPs argument as semi-informative (SI) SNPs.

Linkage Groups

Computing 2-point recombination fraction estimates

Linkage group formation in GUSMap is based on 2-point recombination fraction estimates and associated LOD scores between all pairs of SNPs. These are computed using the function $rf_2pt.

mySpecies$rf_2pt(nClust = 3)

The $rf_2pt function is parallelized to reduce computational time, where the nClust argument specifies how many cores to use in the parallelization. Be careful not to set this to more than what is available on your computer. Note: this step can take considerable time, especially if there are a large number of SNPs in your data set.

Hint: Since computing the 2-point recombination fractions can take quite long, it is best to save the BC object as a R data file using the save function. For example:

save(mySpecies, file="BCobject.Rdata")

Once the 2-point recombination fraction estimates and associated LOD scores have been computed, we can plot the resulting recombination fraction estimates on a heatmap.

# plot 2-point rf matrix for BI and MI SNPs
mySpecies$plotChr(mat = "rf", parent = "maternal", lmai=0.5)

Heatmaps are helpful for examining genome assemblies if the SNPs where called using a reference genome. For our example, we see that the middle section of chromosome 2 should actually be with chromosome 3, and that the last quarter of chromosome 1 needs to be inverted (that is the marker order need to be reversed). Furthermore, there are a number of SNPs that are clearly on the wrong chromosome and should be mapped to one of the other chromosomes. For de novo assemblies, one can still plot the 2-point recombination fraction estimates although the resulting plot should look fairly random and so is usually not very informative.

One can also plot the matrix of 2-point LOD scores.

# plot 2-point LOD matrix for BI, PI and MI SNPs
mySpecies$plotChr(mat = "LOD", parent = "both", lmai=0.5)

In the above code, the parent argument has been set to "both" which means that all the SNPs will be plotted. The resulting plot looks somewhat patchy. This is because the LOD scores between PI and MI SNPs are zero as there is no information regarding recombination between these SNPs. The same also applies to the matrix of 2-point recombination fraction estimates.

Forming linkage groups

Using our 2-point recombination fraction estimates and LOD scores, we can proceed to forming linkage groups.

mySpecies$createLG(parent = "both", LODthres = 5, nComp = 10, reset = FALSE)

$createLG creates MI linkage groups (e.g., linkage groups with only MI SNPs) and PI linkage groups (e.g., linkage groups with only PI SNPs). The performance of the linkage grouping algorithm is controlled by the parameters LODthres and nComp. Details of these arguments and how the linkage groups are formed is explained in the package documentation at:

?`$createLG`

One key point with the $createLG function is that the MI and PI are grouped into separate linkage groups. We refer to these linkage groups as the "pseudo-testcross linkage groups".

A couple of additional points regarding the $createLG function:

Once the linkage groups have been formed, it is a good idea to look how well the grouping has performed by examining the heatmap of the 2-point recombination fraction estimates with the SNPs grouped according to the linkage groups. This can be done using the $plotLG function.

mySpecies$plotLG(parent = "both", interactive = FALSE, LG = NULL)

The heatmap shows that there are four MI linkage groups and two PI linkage groups. We would expect there to be three MI linkage groups and three PI linkage groups since there are 3 chromosomes, and thus there is one extra MI linkage group and one missing PI linkage group. In general, the linkage between SNPs from different linkage groups appears relatively low with high recombination fraction estimates. However, we see that linkage group 3 has high recombination fraction estimates with linkage group 5 which suggests that these two linkage groups should be joined together. Since linkage group 5 has many more SNPs that linkage group 3, it seems most likely that the segregation type for the SNPs in linkage group 3 were incorrectly inferred to be MI instead of PI (possibly due to sequencing errors and/or low read depth). Lastly, the heatmap suggests that linkage group 5 is actually two separate linkage groups (thus the missing linkage group) and so linkage group 5 needs to be split up into two separate linkage groups.

Notes:

Editing Linkage Groups

There are a number of functions in GUSMap for editing linkage groups. Ideally, one should obtain as many linkage groups as the number of chromosomes the species is known to have. However, in practice, the number of linkage groups may be more, or less, than the number of chromosomes depending on the LOD threshold used and whether there is enough linkage between SNPs on the same chromosome to group all the SNPs located on the same chromosome together in the same linkage group.

For our simulated data, we had four MI linkage groups (one more than the number of chromosomes being 3) and only two PI linkage groups. From the heatmaps above, we see that linkage group 5 appears to be two separate linkage groups and so needs to be split up. The best way to do this is recreating the paternal linkage groups using a higher LOD threshold.

mySpecies$createLG(parent = "paternal", LODthres = 10, nComp = 10)

Note: the parent argument is set to "paternal" which means that we are only recreating the PI linkage groups. With a LOD threshold of 10, the number of PI linkage groups becomes four. Again, we examine the heatmap.

mySpecies$plotLG(parent = "both", interactive = F)

The previous heatmaps also suggest that linkage group 3 should be combined with linkage group 7. We can merge these two linkage groups using the $mergeLG function.

mySpecies$mergeLG(LG = c(3,7), mergeTo="paternal")
mySpecies$print(what = "LG-pts")

Since linkage groups 2 and 7 are from different parental lines, the mergeTo argument specifies whether the resulting merged linkage group should be considered as PI SNPs (mergeTo="paternal") or MI SNPs (mergeTo="maternal").

Note: At this stage there is no undo functionality in GUSMap. So, once you have removed a SNP or linkage group, or merge two linkage groups, you cannot reverse what you have done. However, in the case where you want to undo edits to your linkage group, then the best thing is to recreate the linkage groups using the $createLG function and proceed as you were doing before (and hence it helps to save you code in a script).

In many cases, small linkage groups can have no obvious linkage with any other linkage groups and so may be spurious. In this case, a linkage group can be remove using the $removeSNPs function. For example, say we wanted to remove what was linkage group 3 (with two SNPs) instead of merging with what was linkage group 7, then we could have remove the linkage group using

mySpecies$removeLG(LG = 3)

Alternatively, one could also have removed all the SNPs in the linkage group.

mySpecies$removeSNP(snps = c(526,527))

Note that the $removeSNPs function is more useful for removing SNP(s) from a linkage group if they appear problematic without deleting the entire linkage group. This will be demonstrated later.

Adding more SNPs to existing linkage groups

Additional MI and PI SNPs (not currently in a linkage group) and the SI SNPs (which are not used in the $createLG function) can be added to the linkage groups using the $addSNPs function.

mySpecies$addSNPs(LODthres = 10, nComp = 10)

SNPs are added to the existing groups in a similar fashion as in the $createLG function. A couple of points to note:

Adding Both-Informative SNPs (optional)

If there are any BI SNPs that passed through the filters in the makeBC function, then these can be added to the pseudo-testcross linkage groups.

mySpecies$addBIsnps(LODthres = 10, nComp = 10)

A couple points regarding the addBIsnps function:

At this point, we need to make some important points regarding the linkage groups in a BC object:

After the BI SNPs have been added to the linkage groups, it is good practice to examine the heatmaps of the 2-point recombination fraction estimates for any badly behaving SNPs and edit the linkage groups accordingly if required.

Ordering Linkage Groups

Once we are happy with our pseudo-testcross linkage groups that we have formed, we can proceed to order the SNPs in each linkage group from the set of combined linkage groups. In GUSMap, marker ordering is performed using the multidimensional scaling (MDS) approach as described by @preedy2016tag and is implemented in the $orderLG function.

mySpecies$orderLG(mapfun = "haldane", weight = "LOD2", ndim = 30) 

The two plots produced by the $orderLG function are:

  1. The first two dimensions of the components of the MDS.
  2. The heatmap of the 2-point recombination fraction for all SNPs on the linkage group ordered according to the MDS algorithm.

These plots are helpful for examining how well the algorithm has ordered the SNPs. Details of the MDS algorithm are given in @preedy2016tag and information for the arguments mapfun and weight can be found in the package documentation.

?`$orderLG`

An important point is that if no BI SNPs have been added to the pseudo-testcross linkage groups using the $addBIsnps, the $orderLG function will proceed using the pseudo-testcross linkage groups without the BI SNPs.

At this stage, it is again a good idea to examine the linkage groups using the $plotLG function and discard any SNPs which appear to be problematic. Problematic SNPs are SNPs where the recombination fraction values appear to be lower or have a very different pattern than to the SNPs surrounding them. In GUSMap, individual SNPs can be removed from the linkage groups using the $removeSNP function.

Compute Linkage Maps

Now that we have finished ordering and tidying up the pseudo-testcross linkage groups, we proceed to compute the linkage maps using the $computeMap function.

mySpecies$computeMap(chrom = NULL, nThreads = 2, rfthres = 0.1)

The $computeMap function has a number of arguments but the main ones are chrom which specifies which linkage group to compute linkage maps for and nThreads which specifies the number of threads to use in the computation of the likelihood (using OpenMP if available). The resulting output gives the overall map distances and sequencing error estimates for each linkage group (in addition to the linkage group information). The rfthres argument specifies what threshold to use in reporting SNPs associated high recombination fraction values. This helps with determining which SNPs result in really large recombination fraction estimates and so can be removed using the $removeSNP function.

The rfthres argument specifies what recombination fraction threshold to highlight SNPs with high recombination fraction estimates. This can be helpful for identifying potentially problematic SNPs. In this example, there is one SNP with recombination fraction estimates above 0.1 on the second MI linkage group between the first and second SNP. It is probably best to remove the first SNP as often error prone SNPs end-up at the beginning or end of the linkage group. We therefore remove the first SNP (SNP 592) from the linkage group and recompute the map.

## remove SNP 173
mySpecies$removeSNP(c(592))
## Recompute linkage map
mySpecies$computeMap(chrom = c(2), nThreads = 3, rfthres=0.1)

GUSMap have a couple of functions for quickly examining the linkage maps computed. The first function is to plot the linkage maps.

mySpecies$plotLM(LG = NULL, fun = "haldane", col = c(rep("red",3),rep("blue",3)))

The arguments for the $plotLM function are:

In addition, we can also examine the order of the SNPs across the linkage groups relative to the original assembly.

mySpecies$plotSyn()

From the synteny plot above, we see that linkage groups 1 and 4 correspond to chromosome 3, linkage groups 2 and 5 to chromosome 2 and linkage groups 3 and 6 to chromosome 1. Furthermore, the plot shows that the linkage mapping analysis has moved the middle section of the chromosome 2 to chromosome 3 and that it has inverted the SNP order at the end of chromosome 1. In addition, the order of linkage groups 2, 4, 5 and 6 are inverted relative to the genomic assembly.

Finally, one can extract the results of the linkage analysis in GUSMap by writing the results to a file using the $writeLM function.

mySpecies$writeLM(file = "mySpecies")

This allows users to produce their own custom plots in whatever software they are familiar with. For details of the information given in the file, see the package documentation.

?`$writeLM`

References



tpbilton/GUSMap documentation built on Feb. 22, 2025, 12:27 p.m.