The package BATools
is used to perform genome-wide association using a various Bayesian models. It is a implemented using both Markov Chain Monte Carlo (MCMC) and Maximum a posteriori (MAP) algorithm.
The basic functions in BATools
is bafit
, which fits a genomic selection model using different prior selection. The main characteristic of this package are:
Whole genome prediction (WGP) is an evolutionary development in animal breeding. Currently, many models have been developed for WGP, which included rrBLUP, BayesA, BayesB, SSVS, Bayesian Lasso, Antedepedence BayesA/B (Meuwissen et al. 2001, VanRaden 2008, de los Campos et al. 2009, Habier et al. 2011, and Yang and Tempelman 2012). The major difference of these models are different prior assumptions on marker effects. Software packages like BGLR
and GenSel
implement BayesA, BayesB and Bayes Lasso model using MCMC algorithm. No public software is available to implement Antedependence models. At the same time, no R package is available for implement BayesA/C using EM algorithm for animal breeding. BATools
package provides tools to fit Antedependence models in addition to some of the most popular models and provider faster EM algorithms to fit the model. The table below is a comparison between BATools
and BGLR
:
Model/Algorithms | MCMC | EM ---------------- |------------------------| -------------------- rrBLUP | BATools/BGLR | BATools BayesA | BATools/BGLR | BATools BayesB | BATools/BGLR | under development SSVS | BATools/BGLR | BATools AnteBayesA | BATools | under development AnteBayesB | BATools | under development ssSSVS | BATools | under development ssGBLUP | BATools | under development
The basic model used by BATools
is:
$$
\boldsymbol{y} = \boldsymbol{X}\cdot \boldsymbol{b}+\boldsymbol{Z} \cdot \boldsymbol{g} + \boldsymbol{e},
$$
where:
Notice that for different models, the priors on $\boldsymbol{g}_i$ are different:
rrBLUP: $\boldsymbol{g}j\sim N\left ( \boldsymbol{0},\boldsymbol{I}\sigma{g}^{2} \right )$
BayesA: $\boldsymbol{g}j\sim N\left ( \boldsymbol{0},\boldsymbol{D}\sigma{g}^{2} \right )$, where $\boldsymbol{D}={ \tau_1,\tau_2, ...,\tau_m }$ and $\tau_j \sim \chi^{-2}\left( \nu_g,\nu_g \right)$
BayesB: $\boldsymbol{g}j\sim N\left ( \boldsymbol{0},\boldsymbol{D}\sigma{g}^{2} \right )$, where $\boldsymbol{D}={ \tau_1,\tau_2, ...,\tau_m }$ and $$\tau_j={\begin{array} {rrr} 0 & \mbox{with probability} & \pi \ {\sim \chi^{-2}\left( \nu_g,\nu_g \right)} & \mbox{with probability} & 1-\pi \end{array}$$
SSVS: $\boldsymbol{g}i\sim N\left ( \boldsymbol{0},\boldsymbol{D}\sigma{g}^{2} \right )$, where $\boldsymbol{D}={ \tau_1+\frac{1-\tau_1}{c},\tau_2+\frac{1-\tau_2}{c}, ...,\tau_m+\frac{1-\tau_m}{c}}$ and $\tau_j\sim Bernoulli(\pi)$, $\tau_j=0,1$
Furthermore, the Antedepedence models specify correlation structure for $\boldsymbol{g}$ based on the relative physical location of SNP markers along the chromosome : $$g_j={\begin{array} {rrr} \delta_j & \mbox{if} & j=1 \ t_{j,j-1}\delta_{j-1}+\delta_j & \mbox{if} & 2\leq j \leq m \end{array} $$ where $t_{j,j-1}\sim N\left ( \mu_t,\sigma^2_t \right )$
In BATools
, we adhered the data structure of the object gpData
in the synbreed
package. The input and output objects are named as baData
and BAout
, which are R object class list
. Therefore, users can directly use synbreed
object as the input for BATools
, and vice versa. More detailed explanation about baData
and BAout
can be found in the package manual file.
We will use a toy dataset from the MSUPRP population to illustrate the use of BATools
. The data in the demo is explained in https://github.com/chenchunyu88/batoolsdata/blob/master/MSUPRP.ipynb.
rm(list=ls()) library(BATools) data("Pig")
We choose to demonstrate how to fit BayesA using MCMC and MAP. We start with MCMC:
#Standardize genotype matrix geno=std_geno(PigM,method="s",freq=PigAlleleFreq) init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=1,h2=0.5,c=NULL,model="BayesA",centered=TRUE) #or set your own starting values using #init=list(df=5,scale=0.01,pi=1) run_para=list(niter=2000,burnIn=1000,skip=10) print_mcmc=list(piter=500) update_para=list(df=FALSE,scale=TRUE,pi=FALSE) op<-create.options(model="BayesA",method="MCMC",priors=NULL,init=init, update_para=update_para,run_para=run_para,save.at="BayesA",print_mcmc=print_mcmc)
We then fit the model using MCMC for the trait driploss
with the above setups:
BA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win") BA
We can obtain the traceplot for MCMC:
par(mar=c(2,2,2,2)) plot(BA,type="trace",op=op)
We can also run cross-validation study
set.seed(1234) PigPheno=createCV(data = PigPheno,k=5,"driploss") cvBA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1) cvBA par(mfrow=c(1,1)) plot(cvBA)
To use the MAP algorithm for BayesA in BATools
, we first run an analysis using GBLUP:
##################run rrBLUP REML##################### init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=1,h2=0.5,c=NULL,model="GBLUP",centered=TRUE) #or set your own starting values using #init=list(df=5,scale=0.01,pi=1) run_para=list(maxiter=100) update_para=list(df=FALSE,scale=TRUE,pi=FALSE) op<-create.options(model="GBLUP",method="REML",priors=NULL,init=init, update_para=update_para,run_para=run_para,save.at="GBLUP",print_mcmc=NULL) ###Tested it's the same with other REML packages using the default settings gblup<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win") gblup set.seed(1234) PigPheno=createCV(data = PigPheno,k=5,"driploss") cvgblup<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)
Then we use rrBLUP results as starting values for MAP BayesA:
init=set_init("driploss",data=PigPheno,geno=geno,"id", df=5,scale=gblup$hyper_est[2],vare = gblup$hyper_est[1],g=gblup$ghat, beta=gblup$betahat,pi_snp=1,h2=0.5,c=NULL,model="BayesA",centered=TRUE,from="GBLUP") run_para=list(maxiter=100) update_para=list(df=FALSE,scale=TRUE,pi=FALSE) op<-create.options(model="BayesA",method="MAP",priors=NULL,init=init, update_para=update_para,run_para=run_para,save.at="mapBayesA",print_mcmc=NULL,D="P") mapBA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map = PigMap,GWA="Win") mapBA set.seed(1234) PigPheno=createCV(data = PigPheno,k=5,"driploss") cvmapBA<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1) cvmapBA
Let's look at the estimated phenotypes v.s. true phenotypes for EM:
We can also compare the difference bewteen MCMC and EM:
plot(BA$ghat,mapBA$ghat,xlab="MCMC",ylab="EM",main="BayesA MCMC v.s. EM") abline(a=0,b=1)
Running SSVS is similar to running BayesA:
#This code demonstrate SSVS model init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=0.001,h2=0.5,c=1000,model="SSVS",centered=TRUE) #or set your own starting values using #init=list(df=5,scale=0.01,pi=1) run_para=list(niter=2000,burnIn=1000,skip=10) print_mcmc=list(piter=500) update_para=list(df=FALSE,scale=TRUE,pi=F) op<-create.options(model="SSVS",method="MCMC",seed=1,priors=NULL,init=init, update_para=update_para,run_para=run_para,save.at="SSVS",print_mcmc=print_mcmc) SSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win") SSVS set.seed(1234) PigPheno=createCV(data = PigPheno,k=5,"driploss") cvSSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1) init=set_init("driploss",data=PigPheno,geno=geno,"id", df=5,scale=gblup$hyper_est[2],vare = gblup$hyper_est[1],g=gblup$ghat, beta=gblup$betahat,pi_snp=0.001,post_prob = NULL,h2=0.5,c=1000,model="SSVS",centered=TRUE,from="GBLUP") run_para=list(maxiter=100) update_para=list(df=FALSE,scale=TRUE,pi=FALSE) op<-create.options(model="SSVS",method="MAP",priors=NULL,init=init, update_para=update_para,run_para=run_para,save.at="mapSSVS",print_mcmc=NULL) mapSSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map = PigMap,GWA="Win") mapSSVS set.seed(1234) PigPheno=createCV(data = PigPheno,k=5,"driploss") cvmapSSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)
We can also compare the difference bewteen MCMC and EM for SSVS:
plot(SSVS$ghat,mapSSVS$ghat,xlab="MCMC",ylab="EM",main="SSVS MCMC v.s. EM") abline(a=0,b=1)
We can also compare the difference bewteen BayesA and SSVS for MCMC:
plot(BA$ghat,SSVS$ghat,xlab="BayesA",ylab="SSVS",main="BayesA v.s. SSVS in MCMC") abline(a=0,b=1)
We can also compare the difference bewteen BayesA and SSVS for EM:
plot(mapBA$ghat,mapSSVS$ghat,xlab="BayesA",ylab="SSVS",main="BayesA v.s. SSVS in MAP") abline(a=0,b=1)
init=set_init("driploss",data=PigPheno,geno=geno,"id",df=5,pi_snp=0.001,h2=0.5,c=NULL,model="BayesB",centered=TRUE) run_para=list(niter=2000,burnIn=1000,skip=10) print_mcmc=list(piter=500) update_para=list(df=FALSE,scale=TRUE,pi=F) op<-create.options(model="BayesB",method="MCMC",priors=NULL,init=init, update_para=update_para,run_para=run_para,save.at="BayesB",print_mcmc=print_mcmc) BB<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op,map=PigMap,GWA="Win") #### Cross-validation using BATools set.seed(1234) PigPheno=createCV(data = PigPheno,k=5,"driploss") head(PigPheno) cvBB<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options = op, train=~cv1)
We can use get_pvalues
function to obtain p-values from EM algorithms.And the manhattan_plot
function creates manhattan plots using those p-values.
par(mfrow=c(2,2)) man_plot_pvalue(gblup) man_plot_pvalue(gblup,type="Win") man_plot_prob(SSVS) man_plot_prob(SSVS,type="Win")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.