The package BATools is used to perform genome-wide association using a various Bayesian models. It is a implemented using both Markov Chain Monte Carlo (MCMC) and Maximum a posteriori (MAP) algorithm.

The basic functions in BATools is bafit, which fits a genomic selection model using different prior selection. The main characteristic of this package are:

1. Introduction

Whole genome prediction (WGP) is an evolutionary development in animal breeding. Currently, many models have been developed for WGP, which included rrBLUP, BayesA, BayesB, SSVS, Bayesian Lasso, Antedepedence BayesA/B (Meuwissen et al. 2001, VanRaden 2008, de los Campos et al. 2009, Habier et al. 2011, and Yang and Tempelman 2012). The major difference of these models are different prior assumptions on marker effects. Software packages like BGLR and GenSel implement BayesA, BayesB and Bayes Lasso model using MCMC algorithm. No public software is available to implement Antedependence models. At the same time, no R package is available for implement BayesA/C using EM algorithm for animal breeding. BATools package provides tools to fit Antedependence models in addition to some of the most popular models and provider faster EM algorithms to fit the model. The table below is a comparison between BATools and BGLR:

Model/Algorithms | MCMC | EM ---------------- |------------------------| -------------------- rrBLUP | BATools/BGLR | BATools BayesA | BATools/BGLR | BATools BayesB | BATools/BGLR | under development SSVS | BATools/BGLR | BATools AnteBayesA | BATools | under development AnteBayesB | BATools | under development ssSSVS | BATools | under development ssGBLUP | BATools | under development

2. Basic Model

The basic model used by BATools is: $$ \boldsymbol{y} = \boldsymbol{X}\cdot \boldsymbol{b}+\boldsymbol{Z} \cdot \boldsymbol{g} + \boldsymbol{e}, $$ where:

Notice that for different models, the priors on $\boldsymbol{g}_i$ are different:

Furthermore, the Antedepedence models specify correlation structure for $\boldsymbol{g}$ based on the relative physical location of SNP markers along the chromosome : $$g_j={\begin{array} {rrr} \delta_j & \mbox{if} & j=1 \ t_{j,j-1}\delta_{j-1}+\delta_j & \mbox{if} & 2\leq j \leq m \end{array} $$ where $t_{j,j-1}\sim N\left ( \mu_t,\sigma^2_t \right )$

3. BATools example

In BATools, we adhered the data structure of the object gpData in the synbreed package. The input and output objects are named as baData and BAout, which are R object class list. Therefore, users can directly use synbreed object as the input for BATools, and vice versa. More detailed explanation about baData and BAout can be found in the package manual file.

We will use a toy dataset from the MSUPRP population to illustrate the use of BATools. The data in the demo is explained in https://github.com/chenchunyu88/batoolsdata/blob/master/MSUPRP.ipynb.

Load packages and data

rm(list=ls())
library(BATools)
data("Pig")

Set up initial values for the model

We choose to demonstrate how to fit BayesA using MCMC and MAP. We start with MCMC:

#Standardize genotype matrix
geno=std_geno(PigM,method="s",freq=PigAlleleFreq)
init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=1,h2=0.5,c=NULL,model="BayesA",centered=TRUE)
#or set your own starting values using 
#init=list(df=5,scale=0.01,pi=1) 
run_para=list(niter=2000,burnIn=1000,skip=10)
print_mcmc=list(piter=500)
update_para=list(df=FALSE,scale=TRUE,pi=FALSE)
op<-create.options(model="BayesA",method="MCMC",priors=NULL,init=init,
                   update_para=update_para,run_para=run_para,save.at="BayesA",print_mcmc=print_mcmc)

Fit the model

We then fit the model using MCMC for the trait driploss with the above setups:

BA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win")
BA

Graphics

We can obtain the traceplot for MCMC:

par(mar=c(2,2,2,2))
plot(BA,type="trace",op=op)

We can also run cross-validation study

set.seed(1234)
PigPheno=createCV(data = PigPheno,k=5,"driploss")
cvBA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)
cvBA
par(mfrow=c(1,1))
plot(cvBA)

EM algorithm

To use the MAP algorithm for BayesA in BATools, we first run an analysis using GBLUP:

##################run rrBLUP REML#####################
init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=1,h2=0.5,c=NULL,model="GBLUP",centered=TRUE)
#or set your own starting values using 
#init=list(df=5,scale=0.01,pi=1) 
run_para=list(maxiter=100)
update_para=list(df=FALSE,scale=TRUE,pi=FALSE)
op<-create.options(model="GBLUP",method="REML",priors=NULL,init=init,
                   update_para=update_para,run_para=run_para,save.at="GBLUP",print_mcmc=NULL)

###Tested it's the same with other REML packages using the default settings
gblup<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win")
gblup
set.seed(1234)
PigPheno=createCV(data = PigPheno,k=5,"driploss")
cvgblup<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)

Then we use rrBLUP results as starting values for MAP BayesA:

init=set_init("driploss",data=PigPheno,geno=geno,"id",
              df=5,scale=gblup$hyper_est[2],vare = gblup$hyper_est[1],g=gblup$ghat,
              beta=gblup$betahat,pi_snp=1,h2=0.5,c=NULL,model="BayesA",centered=TRUE,from="GBLUP")

run_para=list(maxiter=100)
update_para=list(df=FALSE,scale=TRUE,pi=FALSE)
op<-create.options(model="BayesA",method="MAP",priors=NULL,init=init,
                   update_para=update_para,run_para=run_para,save.at="mapBayesA",print_mcmc=NULL,D="P")

mapBA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map = PigMap,GWA="Win")
mapBA
set.seed(1234)
PigPheno=createCV(data = PigPheno,k=5,"driploss")
cvmapBA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)
cvmapBA

Graphics

Let's look at the estimated phenotypes v.s. true phenotypes for EM:

We can also compare the difference bewteen MCMC and EM:

plot(BA$ghat,mapBA$ghat,xlab="MCMC",ylab="EM",main="BayesA MCMC v.s. EM")
abline(a=0,b=1)

SSVS

Running SSVS is similar to running BayesA:

#This code demonstrate SSVS model
init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=0.001,h2=0.5,c=1000,model="SSVS",centered=TRUE)
#or set your own starting values using 
#init=list(df=5,scale=0.01,pi=1) 
run_para=list(niter=2000,burnIn=1000,skip=10)
print_mcmc=list(piter=500)
update_para=list(df=FALSE,scale=TRUE,pi=F)
op<-create.options(model="SSVS",method="MCMC",seed=1,priors=NULL,init=init,
                   update_para=update_para,run_para=run_para,save.at="SSVS",print_mcmc=print_mcmc)

SSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win")
SSVS
set.seed(1234)
PigPheno=createCV(data = PigPheno,k=5,"driploss")
cvSSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)

init=set_init("driploss",data=PigPheno,geno=geno,"id",
              df=5,scale=gblup$hyper_est[2],vare = gblup$hyper_est[1],g=gblup$ghat,
              beta=gblup$betahat,pi_snp=0.001,post_prob = NULL,h2=0.5,c=1000,model="SSVS",centered=TRUE,from="GBLUP")
run_para=list(maxiter=100)
update_para=list(df=FALSE,scale=TRUE,pi=FALSE)
op<-create.options(model="SSVS",method="MAP",priors=NULL,init=init,
                   update_para=update_para,run_para=run_para,save.at="mapSSVS",print_mcmc=NULL)
mapSSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map = PigMap,GWA="Win")
mapSSVS
set.seed(1234)
PigPheno=createCV(data = PigPheno,k=5,"driploss")
cvmapSSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)

We can also compare the difference bewteen MCMC and EM for SSVS:

plot(SSVS$ghat,mapSSVS$ghat,xlab="MCMC",ylab="EM",main="SSVS MCMC v.s. EM")
abline(a=0,b=1)

We can also compare the difference bewteen BayesA and SSVS for MCMC:

plot(BA$ghat,SSVS$ghat,xlab="BayesA",ylab="SSVS",main="BayesA v.s. SSVS in MCMC")
abline(a=0,b=1)

We can also compare the difference bewteen BayesA and SSVS for EM:

plot(mapBA$ghat,mapSSVS$ghat,xlab="BayesA",ylab="SSVS",main="BayesA v.s. SSVS in MAP")
abline(a=0,b=1)

BayesB

init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=0.001,h2=0.5,c=NULL,model="BayesB",centered=TRUE)
run_para=list(niter=2000,burnIn=1000,skip=10)
print_mcmc=list(piter=500)
update_para=list(df=FALSE,scale=TRUE,pi=F)
op<-create.options(model="BayesB",method="MCMC",priors=NULL,init=init,
                   update_para=update_para,run_para=run_para,save.at="BayesB",print_mcmc=print_mcmc)

BB<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win")
#### Cross-validation using BATools
set.seed(1234)
PigPheno=createCV(data = PigPheno,k=5,"driploss")
head(PigPheno)
cvBB<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)

GWA

We can use get_pvalues function to obtain p-values from EM algorithms.And the manhattan_plot function creates manhattan plots using those p-values.

par(mfrow=c(2,2))
man_plot_pvalue(gblup)
man_plot_pvalue(gblup,type="Win")
man_plot_prob(SSVS)
man_plot_prob(SSVS,type="Win")

Under development ...



chenchunyu88/BATools documentation built on May 19, 2019, 8:21 a.m.