email: johntlovell@gmail.com -- website: lovelleeb.weebly.com -- github: github.com/jtlovell/qtlTools


BiocStyle::markdown()
library(knitr)
knitr::opts_chunk$set(echo = TRUE)

library(qtlToolsTutorials)
library(qtlTools)
data(markerData)

\newpage

Part 1: Overview

For this tutorial, we will be using some real data, collected from an F2 mapping population of P. hallii. To get the data, install the qtlToolsTutorials and qtlTools packages:

library(devtools)
install_github("jtlovell/qtlTools")
install_github("jtlovell/qtlToolsTutorials")
library(qtlToolsTutorials)
library(qtlTools)

data(markerData)
data("makeGeneticMapTutorialTmpData")
kable(cbind(markerData[1:10,1:3],apply(markerData[1:10,4:10],2,round,2)), 
            caption = "example of 7 libraries and 10 markers from the markerData dataset. Each value represents the proportion of reads mapping to the reference genome vs. the alternative genome for a given marker 'HAL2' and library (column names)")
hist(as.matrix(markerData[,-c(1:3)])*100, breaks = 50,
     main = "Distribution of % reads mapping to the reference genome",
     xlab = "% reads mapping to reference")

\newpage

Part 2: Make genotype calls

The hardest part of building a genetic (linkage) map is getting high-quality markers in a dense and consistent grid across the genome. NGS data can produce erroneous single marker calls, or results that look erroneous, but are actually OK. For example, it is difficult on a single-marker basis to distinguish between mapping bias and segregation distortion.

A good way to get around missing / erroneous data is to call markers in windows (or using imputations). Here, we use a qtlTools function swGenotype, where markers are binned and concensus genotypes are called. This is a nice approach if you have a good idea of the physical position of markers. However, if the markers are entirely ambiguous, other, more complicated approaches are needed.

swGenotype takes three vectors (chromosome id, physical position and marker id) and a matrix of the proportion of reference alleles with rows that match the chr, pos and marker.id vectors.

mar.chr = markerData$chr_HAL2
mar.pos = markerData$pos_HAL2
mar.id = markerData$HAL2
prop.ref = markerData[,-c(1:3)]
sw<-swGenotype(reference.prop = prop.ref,
               chr = mar.chr,
               pos = mar.pos,
               marker.id = mar.id)
plot(sw$means[sw$means$chr %in% c("Chr01","Chr02"),4]*100, 
     xlab = "marker order (Chr 1-2)", 
     ylab = "% reads mapping to reference",
     pch = 16, cex = 1.5, col = "grey")
points(prop.ref[sw$means$chr %in% c("Chr01","Chr02"),1]*100, cex = .25)
points((sw$softCalls[sw$softCalls$chr %in% c("Chr01","Chr02"),4]*100)+2,
       pch = 16,cex = .5, col = "purple")
points((sw$hardCalls[sw$hardCalls$chr %in% c("Chr01","Chr02"),4]*100)-2,
       pch = 16,cex = .5, col = "dodgerblue")
legend("bottomright",
       c("mean by window","raw calls","softcalls","hardcalls (maj. vote)"),
       pch = c(16,16,16), pt.cex = c(1.5,.2,.5,.5), col = c("grey","black","purple","dodgerblue"))

\newpage

Part 3: Reading markers into R/qtl

Overview ...

We now have a very complete, highly accurate genotype matrix. With markers coded as the proportion of reads (0, .5, or 1) mapping to the reference. We now want to convert these to 0/1/2 coding and make sure that all markers are unique (there is no use for identical markers in linkage map construction).

We also want to store the known location of each marker in the rownames of the calls. It is best to score the names as chr_pos. This allows us to access the physical positions later.

Extract calls, rename rows and drop metadata ...

hard.calls<-sw$hardCalls 
rownames(hard.calls)<-paste0(gsub("Chr0","",hard.calls$chr),"_",hard.calls$pos)
hard.calls<-hard.calls[,-c(1:3)]

Drop identical markers and transpose the matrix ...

hard.calls<-hard.calls[!duplicated(hard.calls),]
hard.calls<-t(hard.calls)

Recode the markers as 0/1/2 and save the matrix ...

geno2cross writes the genotype matrix into a R/qtl cross object, and parses the marker names into chromosome and position vectors.

hard.calls[hard.calls==1]<-2
hard.calls[hard.calls==0.5]<-1
hard.calls[hard.calls==0]<-0

geno2cross(hard.calls)

Write into R/qtl format ...

For the purposes of this tutorial, we want to pretend we don't know the physical location of the markers ... like if we were working with a species that lacks a complete de novo reference genome. To do this, we will tell the geno2cross function to ignore the specification of chr_position in the marker names and instead provide random chromosome and position data.

geno2cross(hard.calls,crossfile = "~/Desktop/cross.csv",
           chr = sample(1:9, replace = T, size = ncol(hard.calls)),
           pos = runif(ncol(hard.calls), min = 0, max = 100))

Read the cross file into R/qtl ...

cross<-read.cross("csv", file="~/Desktop/cross.csv",
                  genotypes=c(0,1,2), crosstype="f2")
summary(cross)

\newpage

Part 4: Making the genetic map

Examine genotype frequencies by individual

Before we actually build the map, we want to make sure that the individuals in our population all look like recombinants. Our results will be biased if some individuals are identical to the parents ... these individuals also won't contribute to QTL detection power, and should be dropped.

g <- pull.geno(cross)
gfreq <- apply(g, 1, function(a) table(factor(a, levels=1:3)))
gfreq <- t(t(gfreq) / colSums(gfreq))
par(mfrow=c(1,3), las=1)
for(i in 1:3)
  plot(gfreq[i,], ylab="Genotype frequency", main=c("AA", "AB", "BB")[i],
       ylim=c(0,1))

Drop non-recombinant or highly heterozygous individuals

cross<-subset(cross,
              ind = gfreq[2,]<.8 & # <80% heterozygosity
                gfreq[1,]>0.02 & # >2% AA homozygote
                gfreq[3,]>0.02) # >2% BB homozygote

Drop markers that are too similar ...

The main issue in making a genetic map (besides missing/erroneous data) is the need to calculate all pairwise recombination fractions among markers. Since this problem scales exponentially with the number of markers, and NGS marker sets usually number in the 1000's, we need to find a way to reduce our marker set to a managable number. One way to do this to split the markers up into blocks and drop very similar markers. This reduces the size of the problem, then lets us calculate the entire matrix recombination fraction.

cross.sub<-dropSimilarMarkers(cross, blockSize = 500,
                              rf.threshold = 0.02, runFullMatrix = T, 
                              byChr = F)
summary(cross.sub)
par(mfrow = c(1,1))
plot.rf(cross.sub)

\newpage

Form linkage groups ...

The second step in making a genetic map is to bin markers into chromosomes. With all pairwise recombination fractions in hand, this is simple.

lgmar<-formLinkageGroups(cross.sub, reorgMarkers=F, max.rf = .23,verbose = F)
head(lgmar)

Rename linkage groups ...

We then need to re-name the marker groupings so that they match their original chromosomes.

lgmar$true.chr<-splitText(rownames(lgmar))
lgmar$true.pos<-as.numeric(splitText(rownames(lgmar), num = 2))
marlist<-split(rownames(lgmar), as.factor(lgmar$true.chr))
cross2<-newLG(cross.sub, marlist, keep.rf = T)
plot.rf(cross2)

\newpage

Order markers within linkage groups ...

Now we need to order markers within chromosomes. Historically, this is a difficult, error-prone and time intensive step. However, there are now a few approaches that make this super simple. To order markers, we will use a Traveling Salesperson Problem Solver. TSP solvers find the shortest distance to connect all points. We operate on the recombination fraction matrix to do so, using the program concorde. We call this program using the R package TSP and the qtlTools function tspOrder.

cross3<-tspOrder(cross = cross2,
                 concorde_path = "/Users/John/Documents/concorde/TSP")
plot.rf(cross3)

Finalize the markers ...

We can prune the cross, dropping markers that are still too close and re-order markers that seem to have better orders than the one found by TSP. In our case, there is nothing to be done tho.

cross4<-reDoCross(cross3, window = 3, min.distance = 2,
                  initialEstMap = T, verbose = T)

Orient chromosomes to match reference ...

Finally, we may want to ensure that the chromosomes are oriented correctly. To do this, we go chromosome-by-chromosome and check whether the physical order matches the mapping order. We need to do this because the TSP solver performs agnostic to the input order of markers.

map<-pullMap(cross4)
map$phys.chr<-splitText(map$marker.name)
map$phys.pos<-as.numeric(splitText(map$marker.name, num = 2))/1e6
kable(head(map))
cross5<-matchMarkerOrder(cross4)

\newpage

Part 5: Some other considerations.

Segregation distortion...

Segregation distortion (the deviation of genotype frequecies from expected ratios) can be indicative of mapping bias, bad markers or erroneous genotype calls. Such errors make building a genetic map difficult ... However, often segregation distortion is real.

Take for example the P. hallii F2 population.

gt<-geno.table(cross5, scanone.output = T)
par(mfrow = c(2,1), las = 1)
plot(gt, chr = 2, lod = 3:5, col = c("darkred","cyan", "darkblue"), type = "n",
     ylab = "genotype frequency", xlab = "Chr02 mapping position (cM)",
     main = "no segregation distortion on Chr02")
abline(h = colMeans(gt[,5:7]), col = c("darkred","cyan", "darkblue"), lty = 2)
plot(gt, chr = 2, lod = 3:5, col = c("darkred","cyan", "darkblue"), add = T)

plot(gt, chr = 9, lod = 3:5, col = c("darkred","cyan", "darkblue"), type = "n",
     ylab = "genotype frequency", xlab = "Chr09 mapping position (cM)",
     main = "Strong segregation distortion on Chr09")
abline(h = colMeans(gt[,5:7]), col = c("darkred","cyan", "darkblue"), lty = 2)
plot(gt, chr = 9, lod = 3:5, col = c("darkred","cyan", "darkblue"), add = T)

\newpage

Erroneous data

Often we cannot control our error rate. For example, low coverage short read data may produce bad genotype calls ~0.1% of the time. Since we are dealing with a 276x242 genotype matrix, this error rate would produce ~70 genotyping errors.

We can easily deal with genotyping errors during QTL Mapping (see the tutorial); however, if genotyping errors are common they may affect our genetic map construction. It may be appropriate to make the genetic map, then subsequently drop suspicious marker-individual combinations.

We can do this as follows (this is slow, so just doing Chr08 for the example): First, calculate the LOD score for genotyping errors and pull out those errors above some threshold (2 is very conservative, I usually use 3 or higher).

el<-calc.errorlod(subset(cross5, chr = 8), error.prob=0.001, 
                  map.function="kosambi")
tel<- top.errorlod(el, cutoff = 2)
kable(tel, caption = "The markers with an error LOD score >2 on Chr08")
badlibs<-which(as.character(getid(el)) %in% tel$id[tel$chr == 8])
par(mfrow = c(1,1))
plotGeno(el, ind=badlibs, cutoff=2, min.sep=2)

We then go through this data.frame and drop all marker-by-individual combinations that appear erroneous.

cross.clean<-cross5
for(i in 1:nrow(tel)) {
  chr <- tel$chr[i]
  id <- tel$id[i]
  mar <- tel$marker[i]
  cross.clean$geno[[chr]]$data[cross5$pheno$id==id, mar] <- NA
}
cross6<-cross.clean
map1<-est.map(cross5, chr = 8, error.prob = 0.001, map.function = "kosambi")
map2<-est.map(cross6, chr = 8, error.prob = 0.001, map.function = "kosambi")
chrlen(map1) # original chr08 length
chrlen(map2) # new chr08 length

Appendix 1 - Installing concorde ...

To install concorde, get the program here: http://www.math.uwaterloo.ca/tsp/concorde/downloads/codes/src/co031219.tgz.

On a PC, installation is pretty straightforward. Not so much on a mac. See here: https://qmha.wordpress.com/2015/08/20/installing-concorde-on-mac-os-x/.

If you are on a mac, running macOS Sierra, I can share the install ... but I'm not sure how well it will work.



jtlovell/qtlToolsTutorials documentation built on May 29, 2019, 1:50 a.m.