Bayesian linear mixed model

Share:

Description

A Gibbs sampler is used to find posterior mean estimates for the parameters of a linear mixed model.

Usage

1
2
ghap.blmm(fixed,random,weights=NULL,ordinal=FALSE,env.eff=FALSE,data,K,vu=4,vp=4,ve=4,
          R2=0.5,nchain=1,nsim=500,burnin=0,thin=1,ncores=1,verbose=TRUE)

Arguments

fixed

Formula describing the fixed effects part of the model, e.g. y ~ a + b + c ... If the model does not include any covariate simply state the response variable with an intercept, i.e. y ~ 1.

random

A character value with name of the column containing labels for the random effects.

weights

A numeric vector with residual weights. If not supplied, all residual weights are set to 1.

ordinal

A logical value indicating if responses are ordered categorical values (default=FALSE).

env.eff

A logical value indicating if permanent environmental effects should be included (default=FALSE).

data

A dataframe containing the data.

K

A covariance matrix for random effects.

vu

A numeric value specifying the prior degrees of freedom for the variance of random effects (default = 4).

vp

A numeric value specifying the prior degrees of freedom for the variance of permanent environmental effects (default = 4).

ve

A numeric value specifying the prior degrees of freedom for the residual variance (default = 4).

R2

A numeric value specifying the prior variance explained by random effects. In case the covariance matrix is a genomic relationship matrix, this corresponds to the prior for the narrow sense heritability (default = 0.5). If permanent environmental effects are included, this becomes the prior for the broad sense heritability.

nchain

A numeric value indicating the number of independent Markov chains (default = 1).

nsim

A numeric value specifying the number of simulations to be performed by the main MCMC algorithm (default = 500).

burnin

A numeric value specifying the number of simulations to be performed prior to the main MCMC algorithm (default = 0).

thin

A numeric value specifying the thinning interval (default = 1).

ncores

A numeric value specifying the number of processors to be used in parallelization of independent Markov chains (default = 1).

verbose

A logical value specifying whether log messages should be printed (default = TRUE).

Details

The function uses a Bayesian framework to fit the following linear mixed model:

\mathbf{y} = \mathbf{Xb} + \mathbf{Zu} + \mathbf{Zp} + \mathbf{e}

where \mathbf{X} is a matrix relating \mathbf{y} to the vector of fixed effects \mathbf{b}, \mathbf{Z} is an incidence matrix relating \mathbf{y} to random effects \mathbf{u} and \mathbf{p}, and \mathbf{e} is the vector of residuals. The likelihood of the data and the prior distribution of the parameters are assumed:

\mathbf{y} \mid \mathbf{b},\mathbf{u},\mathbf{p},σ_{u}^{2},σ_{p}^{2},σ_{e}^2 \sim N(\mathbf{Xb}+\mathbf{Zu},\mathbf{W}σ_{e}^2)

\mathbf{b} \propto constant

\mathbf{u} \mid σ_{u}^2 \sim N(0,\mathbf{K}σ_{u}^2)

\mathbf{p} \mid σ_{p}^2 \sim N(0,\mathbf{I}σ_{u}^2)

σ_{u}^2 \sim χ^{-2}(ν_{u},S_u^2)

σ_{p}^2 \sim χ^{-2}(ν_{p},S_p^2)

σ_{e}^2 \sim χ^{-2}(ν_{e},S_e^2)

where \mathbf{K} is a covariance matrix for \mathbf{u}, σ_{u}^{2} and σ_{p}^{2} are the variances of \mathbf{u} and \mathbf{p}, respectively, \mathbf{W} is a residual covariance matrix and σ_{e}^{2} is the residual variance. The current implementation assumes \mathbf{W} = diag(w_i). The hyper-parameters ν_{u}, ν_{p}, ν_{e}, S_{u}^2, S_{p}^2 and S_{e}^2 are the random effects and residual variance degrees of freedom and scale parameters, respectively.

In the case of ordered categorical data, categories are assumed to emerge from thresholds of a latent normal variable (i.e., "liability"). The function uses data augmentation to sample thresholds and observations from the underlying latent variable, which are then treated as responses in the main algorithm. More details about the MCMC algorithm can be found in our vignette.

Value

The returned GHap.blmm object is a list with the following items:

nchain

Number of independent Markov chains.

nsim

Number of simulations performed by the main MCMC algorithm.

thin

Thinning interval used by the MCMC algorithm.

eff.nsim

The effective number of samples (nsim/thin) used for posterior computations.

b

A numeric vector containing the posterior means of the fixed effects.

u

A numeric vector containing the posterior means for the correlated random effects.

p

A numeric vector containing the posterior means for the permanent environmental effects. This vector is suppressed if env.eff=FALSE.

varu

A numeric value for the posterior mean of the variance of correlated random effects.

varp

A numeric value for the posterior mean of the variance of permanent environmental effects. This value is suppressed if env.eff=FALSE.

vare

A numeric value for the posterior mean of the residual variance.

h2

A numeric value for the posterior mean of the variance explained by correlated random effects only.

H2

A numeric value for the posterior mean of the variance explained by random effects. This value is suppressed if env.eff=FALSE.

k

A numeric vector containing the solutions for \mathbf{K}^{-1}\mathbf{\hat{u}}. This vector is used by the ghap.blup function.

y

A numeric vector containing the records used to fit the model.

weights

A numeric vector containing the residual weights used to fit the model.

residuals

A numeric vector containing residuals computed based on the posterior mean of the model parameters.

dev

Posterior mean of the deviance (-2*log-likelihood).

pdev

Deviance evaluated at the posterior mean of model parameters.

If the model is fitted to ordered categorical data, the following items are added to the object:

liability

Posterior means of liabilities.

thresholds

A numeric vector containing the posterior means of the thresholds in the liability scale.

Additionally, if multiple independent chains are ran, the following items are included under inter-chain:

b.sd

A numeric vector containing standard deviations for inter-chain estimates of the fixed effects.

u.sd

A numeric vector containing standard deviations for inter-chain estimates of the correlated random effects.

p.sd

A numeric vector containing standard deviations for inter-chain estimates of the permanent environmental effects. This vector is suppressed if env.eff=FALSE.

varu.sd

A numeric value containing standard deviations for inter-chain estimates of the variance of correlated random effects.

varp.sd

A numeric value containing standard deviations for inter-chain estimates of the variance of permanent environmental effects. This value is suppressed if env.eff=FALSE.

vare.sd

A numeric value containing standard deviations for inter-chain estimates of the residual variance.

h2.sd

A numeric value containing standard deviations for inter-chain estimates of the variance explained by the correlated random effects.

H2.sd

A numeric value containing standard deviations for inter-chain estimates of the variance explained by the correlated random effects and permanent environmental effects. This value is suppressed if env.eff=FALSE.

dev.sd

A numeric value containing standard deviations for inter-chain estimates of the deviance.

Author(s)

Yuri Tani Utsunomiya <ytutsunomiya@gmail.com>

References

P. Perez and G. de Los Campos. Genome-Wide Regression and Prediction with the BGLR Statistical Package. Genetics. 2014. 198:483-495.

D. A. Sorensen et al. Bayesian inference in threshold models using Gibbs sampling. Genet Sel Evol. 1995. 27:229-249.

C. S. Wang et al. Bayesian analysis of mixed linear models via Gibbs sampling with an application to litter size in Iberian pigs. Genet Sel Evol. 1994. 26:91-115.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# #### DO NOT RUN IF NOT NECESSARY ###
# 
# # Copy the example data in the current working directory
# ghap.makefile()
# 
# # Load data
# phase <- ghap.loadphase("human.samples", "human.markers", "human.phase")
# 
# # Subset data - randomly select 3000 markers with maf > 0.02
# maf <- ghap.maf(phase, ncores = 2)
# set.seed(1988)
# markers <- sample(phase$marker[maf > 0.02], 3000, replace = FALSE)
# phase <- ghap.subsetphase(phase, unique(phase$id), markers)
# rm(maf,markers)
# 
# # Generate block coordinates based on windows of 10 markers, sliding 5 marker at a time
# blocks <- ghap.blockgen(phase, 10, 5, "marker")
# 
# # Generate matrix of haplotype genotypes
# ghap.haplotyping(phase, blocks, batchsize = 100, ncores = 2, freq = 0.05, outfile = "example")
# 
# # Load haplotype genotypes
# haplo <- ghap.loadhaplo("example.hapsamples", "example.hapalleles", "example.hapgenotypes")
# 
# # Compute kinship matrix
# K <- ghap.kinship(haplo, batchsize = 100)
# 
# # Quantitative trait with 50% heritability
# # One major haplotype accounting for 30% of the genetic variance
# sim <- ghap.simpheno(haplo = haplo, K = K, h2 = 0.5, g2 = 0.3, major = 1000,seed=1988)
# 
# # Binary trait from the previous example
# # 0 if observation is below the 70% percentile
# # 1 otherwise
# thr <- quantile(x = sim$data$phenotype, probs = 0.7)
# sim$data$phenotype2 <- sim$data$phenotype
# sim$data$phenotype2[sim$data$phenotype < thr] <- 0
# sim$data$phenotype2[sim$data$phenotype >= thr] <- 1
# 
# ### RUN ###
# 
# #Continuous model
# model <- ghap.blmm(fixed = phenotype ~ 1, random = "individual", data = sim$data, K = K)
# model$h2
# plot(model$u,sim$u, ylab="True Breeding Value", xlab="Estimated Breeding Value")
# cor(model$u,sim$u)
# 
# #Threshold model
# model <- ghap.blmm(fixed = phenotype2 ~ 1, random = "individual",
#                    ordinal = TRUE, data = sim$data, K = K)
# model$h2
# plot(model$u,sim$u, ylab="True Breeding Value", xlab="Estimated Breeding Value")
# cor(model$u,sim$u)
# model$thresholds[2]
# quantile(x = scale(sim$data$phenotype), probs = 0.7)