\newpage

Introduction

Crossing Scheme and Experimental Design

alt text

In this experiment, we selfed and outcrossed a set of ~70 teosinte landraces to get a progeny array composed of 4,875 individuals. The ~70 founders and all the progeny were genotyped using GBS. We also re-sequenced 20/70 founder lines (the others will be re-sequenced soon). Because of the high error rate of the GBS data, especially problematic for calling heterozygous sites, we employed a phasing and imputation strategy to infer the expected genotypes by combining parentage and GBS information.

This file is to document the R package we developed to solve the problems step by step.

Install and Usage

Install devtools first, and then use devtools to install imputeR from github.

# install and load devtools
devtools::install_github("hadley/devtools")
library(devtools)
# install and load imputeR
install_github("yangjl/imputeR")
library(imputeR)

How to find help

Within "R" console, type ?impute_parent or help(impute_parent) to find help information about the function.

?impute_parent

How to load hdf5 file

To load hdf5 file, you need to install Vince's tasselr. If you fail to install them, please follow the above links and install the required dependencies. Note package "rhdf5" is a Bioconductor package, you have to use biocLite to install it.

# install devtools and then install the devlopmental version 
# of tasselr and ProgenyArray using devtools. 
devtools::install_github("hadley/devtools")
library(devtools)
source("https://bioconductor.org/biocLite.R")
biocLite("rhdf5")
install_github("vsbuffalo/tasselr") 
# to load the packages
library(tasselr)

The following several lines help you to reformat the *.h5 HDF5 file into an R object. You need to specify the path of the HDF5 file. Here is the Teosinte data:

# Note: at least 100G memory will be needed to load the hdf5 file
# load h5file
teo <- initTasselHDF5("largedata/teo.h5", version="5")
teo <- loadBiallelicGenotypes(teo, verbose = TRUE)
# reformat to imputeR object
# find help with ??imputeRob
ob1 <- imputeRob(h5=teo, missingcode=3)
save(file="largedata/teo.RData", list="ob1")  

Here is the landrace data:

# Note: at least 100G memory was needed to load the hdf5 file
# load h5file
land <- initTasselHDF5("largedata/maize_landrace.h5", version="5")
land <- loadBiallelicGenotypes(land, verbose = TRUE)
# reformat to imputeR object
ob2 <- imputeRob(h5=land, missingcode=3)
save(file="largedata/bode_data.RData", list="ob2")

\newpage

Infer parental genotype

If we have parent's WGS data, this step can be skipped.

We have observed mom, dad and observed kids (selfed and outcrossed). The likelihood of focal parents' genotype is

$P(G) = P(\theta|G)$,

where $P(G)$ is the probability of the genotype given the observed data. This consists of observed genotypes ($G'$) of mom, dad and kids. So:

$$P(\theta|G_m) = \prod\limits_{i=1}^{k1}(G'{k1}|G_m)\times \prod\limits{i=1}^{k2}\sum_{d}^{0,1,2}(G'_{k2}|G_m, G_d) \times P(G'_m|G_m)$$

where $G'{k1}|G_m)$ is the probability of a selfed kid given mom's genotype $G_m$. $\sum{d}^{0,1,2}(G'_{k2}|G_m, G_d)$ is the probability of a outcrossed kid given mom's genotype $G_m$ and sum over all dad's genotype $G_d$.

$P(G_m), P(G_d)$ is the probability of the genotype according to Hardy-Weinberg equilibrium estimated from the population.

We sum of the log probabilities over all kid and get the maximum likelihood genotype of the focal parent. The function impute_parent was implemented to compute mom's genotype probabilities.

Simulation using imputeR package

Above simulation steps were packed into a funcion sim.array. This function will simulate a GBS.array object for the following functions to use.

# find help
?sim.array
# make the simulation repeatable
set.seed(1234)
GBS.array <- sim.array(size.array=50, numloci=100, hom.error = 0.02, het.error = 0.8,
  rec = 0.25, selfing = 0, imiss = 0.5, misscode = 3)

We now impute parent genotypes using impute_parent and extract results using parentgeno. In the resulting table res, the first three columns are the probabilities of genotype 0, 1, 2. The 4th column is the odd ratio of the highest divided by the 2nd highest probability. gmax parent's genotype with the highest probability. gor parent's genotype with the highest probability and OR bigger than the specified threshold.

#The stuff:
inferred_geno_likes <- impute_parent(GBS.array, hom.error=0.02, het.error=0.8, imiss=0.5)
res <- parentgeno(inferred_geno_likes, oddratio=0.6931472, returnall=TRUE)
res$true_parent <- GBS.array@true_parents[[50]]$hap1 + GBS.array@true_parents[[50]]$hap2
#error rates
nrow(subset(res, gmax != true_parent ))/nrow(res)
nrow(subset(res, gor != true_parent & gor !=3 ))/nrow(res)

We also did ~100 simulations with size.array varied from 10 to 100. The results was plotted as below. When the size.array increased to more than 40, the error rates using both methods reduced to almost zero. It indicated that, with more than 40 kids the parent genotype can be confidently imputed.

alt text

Real Data

Then, for each parent, run impute_parent as in the toy example above. Note that you will need to supply a vector of allele frequencies at each locus estimated from the parents. If you do not supply this or leave p=NULL, a random allele frequency drawn from the neutral SFS will be used instead (this is Very Bad). Since parents are coded as 0,1, or 2 for $N$ parents the allele frequency $p$ at a locus can be calculated as $\frac{\sum_{i=1}^Np_i}{2N}$.

An even better way to estimate frequency: taking only the parents (~50) with selfed offspring, simply average the genotypes of the selfed offspring, sum, and divide by ~100. This should get a really good estimate of allele frequency. I would also start by dropping any locus for which you don't observe the non reference allele some minimum number (10 sounds good to me) of times. Finally, keep in mind that the frequency we are working with here is the non reference allele.

A new function "create_array" will read in the Geno4imputeR object, calculate allele freq from the selfed offspring, and split the data into different families. A set of GBS.array objects will be generated in the specified location for imputation.

library(imputeR)

load(file="largedata/teo.RData")
Geno4imputeR <- ob1
ped <- read.table("data/parentage_info.txt", header =TRUE)

ob2 <- create_array(Geno4imputeR, ped, outdir="largedata/",
                    maf_cutoff=0.002, lmiss_cutoff=0.8, imiss_cutoff=0.8, size_cutoff=40)

# run through each RData objects and collected the data into 10 data.frames (chromosomes)
# I ran it using a cluster to reduce computing time.
files <- list.files(path="largedata/obs", pattern="RData", full.names=TRUE)
for(JOBID in 1:length(files)){
    o <- load(files[JOBID])
    tem <- impute_parent(GBS.array=obj, hom.error = 0.02, het.error = 0.8, imiss = 0.5)
    res <- parentgeno(tem, oddratio=0.69, returnall=TRUE)

    outfile <- gsub("RData", "csv", files[JOBID])
    write.table(res, outfile, sep=",", row.names=FALSE, quote=FALSE)
}

The imputed parent (N=50) genotypes (seprated by chromosomes) had been deposited in this dir: ~/Documents/Github/phasing/largedata/ip/. These 10 files will be available upon request.

\newpage

Phasing focal parent's haplotype

According to Bayes' theorem, we got:

$$P(H_f|\theta) \propto P(\theta|H_f) \times P(H_f)$$

Our prior assumption for $P(H_f)$ is that all possible haplotypes are equally likely. So,

$$P(H_f | \theta) \propto P(\theta | H_f)$$

We assume $H_i$ as the haplotype of the other parent for the $i$th kid. Because all possible haplotypes ($H_i$, p ranged from 1 to k) inherited from other parents constitute a partition of the sample space, which includes $k$ kids reproduced by the focal parent ($H_f$). The hapoltypes ($H_i$) are pairwise mutually exclusive. Therefore,

$$P(H_f|\theta) \propto \sum_{i=1}^{k} P(\theta | H_f, H_i)$$

It can be further expressed by: $$P(H_f|\theta) \propto \sum_{i=1}^{k} P(K'_i | H_f, H_i)$$

$$P(H_f|\theta) \propto \sum_{i=1}^{k}\prod\limits_{l=1}^{n}{P(G'_{i,l}|H_1, H_2)} $$

$$P(\theta|H_m) = \prod\limits_{i=1}^{k1}(G'{k1}|H_m)\times \prod\limits{i=1}^{k2}(G'_{k2}|H_m, H_d)$$

Simulated data

We implemented a function phase_parent to do the phasing work. It was conducted in two steps by two major functions phase_chunk and join_chunks. First, we get the most likely focal parent's haplotype in a window. We then extend the window by one bp each time into a larger chromosomal chunk until the new haplotype in a window conflict with the existing chunk. In the conflict case, a new chromosomal chunk will be created. Second, we compute the possibilities of the haplotypes of two neighboring chunks. The most likely joined haplotype chunks will be returned by the function join_chunks.

# to make the random events repeatable
set.seed(123457)
# simulate a GBS.array object
GBS.array <- sim.array(size.array=50, numloci=100, hom.error = 0.02, het.error = 0.8,
                       rec = 0.25, selfing = 0.5, imiss = 0.5, misscode = 3)
# get perfect parent genotype
GBS.array <- get_true_GBS(GBS.array)
# get probability matrices
probs <- error_mx2(hom.error=0.02, het.error=0.8)
# phasing   
phase <- phase_parent(GBS.array, win_length=10, join_length=10, verbose=TRUE)

# compute error rate
out <- phase_error_rate(GBS.array, phase)

set.seed(13567)
GBS.array <- sim.array(size.array=50, numloci=500, hom.error=0.02, het.error=0.8, rec=0.25, selfing=0.5, imiss=0.5, misscode=3)
GBS.array <- get_true_GBS(GBS.array)
probs <- error_mx(hom.error=0.02, het.error=0.8)

Real data

\newpage

Phasing and Imputing Kids

$P(H_k|\theta) \propto P(\theta|H_k) \times P(H_k)$
$P(H_k|\theta) \propto \left( \prod\limits_{i=k}{P(H'k|H_k)} \right) \times P(H_k)$
$P(H_k|\theta) \propto \left( \prod\limits
{i=k}\prod\limits_{l=1}^{n}{P(G'_{i,l}|H_k)} \right) \times P(H_k)$

$$P(\theta_k|H_k) = \prod\limits_{s=1}^n{P(G'_{s}|H_m, H_d)}$$

alt text

  1. We calculate the maximum likelihood of the inherited haplotypes for a given window.
  2. We find hte munimum path of recombinations.
  3. At the potential recombination sites, we will find the maximum likelihood of the breaking points.

Simulated data

# to make the random events repeatable
set.seed(123457)
# simulate a GBS.array object
GBS.array <- sim.array(size.array=50, numloci=10000, hom.error = 0.02, het.error = 0.8,
                       rec = 0.25, selfing = 0.5, imiss = 0.5, misscode = 3)
# get perfectly phased parents
GBS.array <- get_phased(GBS.array, maxchunk=3)
# get probability matrices
probs <- error_mx(hom.error=0.02, het.error=0.8, imiss=0.5)
# phasing   
phase <- impute_kid(GBS.array, winsize=10, verbose=TRUE)

# compute error rate
g1 <- phase@gbs_kids[[1]]$hap1 + phase@gbs_kids[[1]]$hap2
g2 <- phase@true_kids[[1]]
sum(g1!=g2)/length(g1)


yangjl/imputeR documentation built on May 4, 2019, 2:28 p.m.