\newpage
In this experiment, we selfed and outcrossed a set of ~70 teosinte landraces to get a progeny array composed of 4,875 individuals. The ~70 founders and all the progeny were genotyped using GBS. We also re-sequenced 20/70 founder lines (the others will be re-sequenced soon). Because of the high error rate of the GBS data, especially problematic for calling heterozygous sites, we employed a phasing and imputation strategy to infer the expected genotypes by combining parentage and GBS information.
This file is to document the R
package we developed to solve the problems step by step.
Install devtools first, and then use devtools to install imputeR from github.
# install and load devtools devtools::install_github("hadley/devtools") library(devtools) # install and load imputeR install_github("yangjl/imputeR") library(imputeR)
Within "R" console, type ?impute_parent
or help(impute_parent)
to find help information about the function.
?impute_parent
To load hdf5
file, you need to install Vince's tasselr. If you fail to install them, please follow the above links and install the required dependencies. Note package "rhdf5" is a Bioconductor package, you have to use biocLite
to install it.
# install devtools and then install the devlopmental version # of tasselr and ProgenyArray using devtools. devtools::install_github("hadley/devtools") library(devtools) source("https://bioconductor.org/biocLite.R") biocLite("rhdf5") install_github("vsbuffalo/tasselr") # to load the packages library(tasselr)
The following several lines help you to reformat the *.h5
HDF5 file into an R object. You need to specify the path of the HDF5 file.
Here is the Teosinte data:
# Note: at least 100G memory will be needed to load the hdf5 file # load h5file teo <- initTasselHDF5("largedata/teo.h5", version="5") teo <- loadBiallelicGenotypes(teo, verbose = TRUE) # reformat to imputeR object # find help with ??imputeRob ob1 <- imputeRob(h5=teo, missingcode=3) save(file="largedata/teo.RData", list="ob1")
Here is the landrace data:
# Note: at least 100G memory was needed to load the hdf5 file # load h5file land <- initTasselHDF5("largedata/maize_landrace.h5", version="5") land <- loadBiallelicGenotypes(land, verbose = TRUE) # reformat to imputeR object ob2 <- imputeRob(h5=land, missingcode=3) save(file="largedata/bode_data.RData", list="ob2")
\newpage
If we have parent's WGS data, this step can be skipped.
We have observed mom, dad and observed kids (selfed and outcrossed). The likelihood of focal parents' genotype is
$P(G) = P(\theta|G)$,
where $P(G)$ is the probability of the genotype given the observed data. This consists of observed genotypes ($G'$) of mom, dad and kids. So:
$$P(\theta|G_m) = \prod\limits_{i=1}^{k1}(G'{k1}|G_m)\times \prod\limits{i=1}^{k2}\sum_{d}^{0,1,2}(G'_{k2}|G_m, G_d) \times P(G'_m|G_m)$$
where $G'{k1}|G_m)$ is the probability of a selfed kid given mom's genotype $G_m$. $\sum{d}^{0,1,2}(G'_{k2}|G_m, G_d)$ is the probability of a outcrossed kid given mom's genotype $G_m$ and sum over all dad's genotype $G_d$.
$P(G_m), P(G_d)$ is the probability of the genotype according to Hardy-Weinberg equilibrium estimated from the population.
We sum of the log probabilities over all kid and get the maximum likelihood genotype of the focal parent. The function impute_parent
was implemented to compute mom's genotype probabilities.
imputeR
packageAbove simulation steps were packed into a funcion sim.array
.
This function will simulate a GBS.array
object for the following functions to use.
# find help ?sim.array # make the simulation repeatable set.seed(1234) GBS.array <- sim.array(size.array=50, numloci=100, hom.error = 0.02, het.error = 0.8, rec = 0.25, selfing = 0, imiss = 0.5, misscode = 3)
We now impute parent genotypes using impute_parent
and extract results using parentgeno
. In the resulting table res
, the first three columns are the probabilities of genotype 0, 1, 2
. The 4th column is the odd ratio of the highest divided by the 2nd highest probability. gmax
parent's genotype with the highest probability. gor
parent's genotype with the highest probability and OR
bigger than the specified threshold.
#The stuff: inferred_geno_likes <- impute_parent(GBS.array, hom.error=0.02, het.error=0.8, imiss=0.5) res <- parentgeno(inferred_geno_likes, oddratio=0.6931472, returnall=TRUE) res$true_parent <- GBS.array@true_parents[[50]]$hap1 + GBS.array@true_parents[[50]]$hap2 #error rates nrow(subset(res, gmax != true_parent ))/nrow(res) nrow(subset(res, gor != true_parent & gor !=3 ))/nrow(res)
We also did ~100 simulations with size.array varied from 10 to 100. The results was plotted as below. When the size.array increased to more than 40, the error rates using both methods reduced to almost zero. It indicated that, with more than 40 kids the parent genotype can be confidently imputed.
Then, for each parent, run impute_parent
as in the toy example above. Note that you will need to supply a vector of allele frequencies at each locus estimated from the parents. If you do not supply this or leave p=NULL
, a random allele frequency drawn from the neutral SFS will be used instead (this is Very Bad). Since parents are coded as 0,1, or 2 for $N$ parents the allele frequency $p$ at a locus can be calculated as $\frac{\sum_{i=1}^Np_i}{2N}$.
An even better way to estimate frequency: taking only the parents (~50) with selfed offspring, simply average the genotypes of the selfed offspring, sum, and divide by ~100. This should get a really good estimate of allele frequency. I would also start by dropping any locus for which you don't observe the non reference allele some minimum number (10 sounds good to me) of times. Finally, keep in mind that the frequency we are working with here is the non reference allele.
A new function "create_array" will read in the Geno4imputeR
object, calculate allele freq from the selfed offspring, and split the data into different families. A set of GBS.array objects will be generated in the specified location for imputation.
library(imputeR) load(file="largedata/teo.RData") Geno4imputeR <- ob1 ped <- read.table("data/parentage_info.txt", header =TRUE) ob2 <- create_array(Geno4imputeR, ped, outdir="largedata/", maf_cutoff=0.002, lmiss_cutoff=0.8, imiss_cutoff=0.8, size_cutoff=40) # run through each RData objects and collected the data into 10 data.frames (chromosomes) # I ran it using a cluster to reduce computing time. files <- list.files(path="largedata/obs", pattern="RData", full.names=TRUE) for(JOBID in 1:length(files)){ o <- load(files[JOBID]) tem <- impute_parent(GBS.array=obj, hom.error = 0.02, het.error = 0.8, imiss = 0.5) res <- parentgeno(tem, oddratio=0.69, returnall=TRUE) outfile <- gsub("RData", "csv", files[JOBID]) write.table(res, outfile, sep=",", row.names=FALSE, quote=FALSE) }
The imputed parent (N=50) genotypes (seprated by chromosomes) had been deposited in this dir: ~/Documents/Github/phasing/largedata/ip/
.
These 10 files will be available upon request.
\newpage
According to Bayes' theorem, we got:
$$P(H_f|\theta) \propto P(\theta|H_f) \times P(H_f)$$
Our prior assumption for $P(H_f)$ is that all possible haplotypes are equally likely. So,
$$P(H_f | \theta) \propto P(\theta | H_f)$$
We assume $H_i$ as the haplotype of the other parent for the $i$th kid. Because all possible haplotypes ($H_i$, p ranged from 1 to k) inherited from other parents constitute a partition of the sample space, which includes $k$ kids reproduced by the focal parent ($H_f$). The hapoltypes ($H_i$) are pairwise mutually exclusive. Therefore,
$$P(H_f|\theta) \propto \sum_{i=1}^{k} P(\theta | H_f, H_i)$$
It can be further expressed by: $$P(H_f|\theta) \propto \sum_{i=1}^{k} P(K'_i | H_f, H_i)$$
$$P(H_f|\theta) \propto \sum_{i=1}^{k}\prod\limits_{l=1}^{n}{P(G'_{i,l}|H_1, H_2)} $$
$$P(\theta|H_m) = \prod\limits_{i=1}^{k1}(G'{k1}|H_m)\times \prod\limits{i=1}^{k2}(G'_{k2}|H_m, H_d)$$
We implemented a function phase_parent
to do the phasing work. It was conducted in two steps by two major functions phase_chunk
and join_chunks
. First, we get the most likely focal parent's haplotype in a window. We then extend the window by one bp each time into a larger chromosomal chunk until the new haplotype in a window conflict with the existing chunk. In the conflict case, a new chromosomal chunk will be created. Second, we compute the possibilities of the haplotypes of two neighboring chunks. The most likely joined haplotype chunks will be returned by the function join_chunks
.
# to make the random events repeatable set.seed(123457) # simulate a GBS.array object GBS.array <- sim.array(size.array=50, numloci=100, hom.error = 0.02, het.error = 0.8, rec = 0.25, selfing = 0.5, imiss = 0.5, misscode = 3) # get perfect parent genotype GBS.array <- get_true_GBS(GBS.array) # get probability matrices probs <- error_mx2(hom.error=0.02, het.error=0.8) # phasing phase <- phase_parent(GBS.array, win_length=10, join_length=10, verbose=TRUE) # compute error rate out <- phase_error_rate(GBS.array, phase) set.seed(13567) GBS.array <- sim.array(size.array=50, numloci=500, hom.error=0.02, het.error=0.8, rec=0.25, selfing=0.5, imiss=0.5, misscode=3) GBS.array <- get_true_GBS(GBS.array) probs <- error_mx(hom.error=0.02, het.error=0.8)
\newpage
$P(H_k|\theta) \propto P(\theta|H_k) \times P(H_k)$
$P(H_k|\theta) \propto \left( \prod\limits_{i=k}{P(H'k|H_k)} \right) \times P(H_k)$
$P(H_k|\theta) \propto \left( \prod\limits{i=k}\prod\limits_{l=1}^{n}{P(G'_{i,l}|H_k)} \right) \times P(H_k)$
$$P(\theta_k|H_k) = \prod\limits_{s=1}^n{P(G'_{s}|H_m, H_d)}$$
# to make the random events repeatable set.seed(123457) # simulate a GBS.array object GBS.array <- sim.array(size.array=50, numloci=10000, hom.error = 0.02, het.error = 0.8, rec = 0.25, selfing = 0.5, imiss = 0.5, misscode = 3) # get perfectly phased parents GBS.array <- get_phased(GBS.array, maxchunk=3) # get probability matrices probs <- error_mx(hom.error=0.02, het.error=0.8, imiss=0.5) # phasing phase <- impute_kid(GBS.array, winsize=10, verbose=TRUE) # compute error rate g1 <- phase@gbs_kids[[1]]$hap1 + phase@gbs_kids[[1]]$hap2 g2 <- phase@true_kids[[1]] sum(g1!=g2)/length(g1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.