6. BLUP estimation from Linear Mixed Model | R Documentation |
Solves the Linear Mixed Model and calculates the Best Linear Unbiased Predictor (BLUP)
fitBLUP(y, X = NULL, Z = NULL, K = NULL, trn = NULL,
EVD = NULL, varU = NULL, varE = NULL,
ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
BLUP = TRUE, method = c("REML","ML"),
interval = c(1E-9,1E9), tol = 1E-8, maxiter = 1000,
n.regions = 10, verbose = TRUE)
y |
(numeric vector) Response variable. It can be a matrix with each column representing a different response variable |
X |
(numeric matrix) Design matrix for fixed effects. When |
Z |
(numeric matrix) Design matrix for random effects. When |
K |
(numeric matrix) Kinship relationship matrix |
trn |
(integer vector) Which elements from vector |
EVD |
(list) Eigenvectors and eigenvalues from eigenvalue decomposition (EVD) of G corresponding to training data |
ID_geno |
(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector |
ID_trait |
(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector |
varU , varE |
(numeric) Genetic and residual variances. When both |
intercept |
|
BLUP |
|
method |
(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) |
tol |
(numeric) Maximum error between two consecutive solutions (convergence tolerance) when finding the root of the log-likelihood's first derivative |
maxiter |
(integer) Maximum number of iterations to run before convergence is reached |
interval |
(numeric vector) Range of values in which the root is searched |
n.regions |
(numeric) Number of regions in which the searched 'interval' is divided for local optimization |
verbose |
|
The basic linear mixed model that relates phenotypes with genetic values is of the form
y = X b + Z g + e
where y is a vector with the response, b is the vector of fixed effects, u is the vector of the (random) genetic effects of the genotypes, e is the vector of environmental residuals (random error), and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ2uK), and the error terms are assumed e ~ N(0,σ2eI).
The vector of total genetic values u = Z g will therefore follow u ~ N(0,σ2uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.
The predicted values utrn = {ui}, i = 1,2,...,ntrn, corresponding to observed data (training set) are derived as
utrn = B (ytrn - Xtrnb)
where B is a matrix of weights given by
B = σ2uGtrn (σ2uGtrn + σ2eI)-1
where Gtrn is the sub-matrix corresponding to the training set. This matrix can be rewritten as
B = Gtrn (Gtrn + θI)-1
where θ = σ2e/σ2u is the residual/genetic variances ratio representing a ridge-like shrinkage parameter.
The matrix H = Gtrn + θI in the above equation can be used to obtain predictions corresponding to un-observed data (testing set), utst = {ui}, i = 1,2,...,ntst, by
B = Gtst,trn (Gtrn + θI)-1
where Gtst,trn is the sub-matrix of G corresponding to the testing set (in rows) and training set (in columns).
Solutions are found using the GEMMA (Genome-wide Efficient Mixed Model Analysis) approach (Zhou & Stephens, 2012). First, the Brent's method is implemented to solve for the genetic/residual variances ratio (i.e., 1/θ) from the first derivative of the log-likelihood (either REML or ML). Then, variances σ2u and σ2e are calculated. Finally, b is obtained using Generalized Least Squares.
Returns a list object that contains the elements:
b
: (vector) fixed effects solutions (including the intercept).
u
: (vector) total genetic values (u = Z g).
g
: (vector) genetic effects solutions.
varU
: random effect variance.
varE
: residual variance.
h2
: heritability.
convergence
: (logical) whether Brent's method converged.
method
: either 'REML' or 'ML' method used.
Zhou X, Stephens M (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics, 44(7), 821-824
require(SFSI)
data(wheatHTP)
index = which(Y$trial %in% 1:20) # Use only a subset of data
Y = Y[index,]
M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers
G = tcrossprod(M) # Genomic relationship matrix
Y0 = scale(as.matrix(Y[,4:6])) # Response variable
#---------------------------------------------------
# Single-trait model
#---------------------------------------------------
y = Y0[,1]
# Training and testing sets
tst = which(Y$trial %in% 1:3)
trn = seq_along(y)[-tst]
# Kinship-based model
fm1 = fitBLUP(y, K=G, trn=trn)
head(fm1$g) # Genetic effects
plot(y[tst],fm1$yHat[tst]) # Predicted vs observed values in testing set
cor(y[tst],fm1$yHat[tst]) # Prediction accuracy in testing set
cor(y[trn],fm1$yHat[trn]) # Prediction accuracy in training set
fm1$varU # Genetic variance
fm1$varE # Residual variance
fm1$h2 # Heritability
fm1$b # Fixed effects
# Markers-based model
fm2 = fitBLUP(y, Z=M, trn=trn)
head(fm2$g) # Marker effects
all.equal(fm1$yHat, fm2$yHat)
fm2$varU # Genetic variance
fm2$varU*sum(apply(M,2,var))
#---------------------------------------------------
# Multiple response variables
#---------------------------------------------------
ID_geno = as.vector(row(Y0))
ID_trait = as.vector(col(Y0))
y = as.vector(Y0)
# Training and testing sets
tst = c(which(ID_trait==1)[Y$trial %in% 1:3],
which(ID_trait==2)[Y$trial %in% 1:3],
which(ID_trait==3)[Y$trial %in% 1:3])
trn = seq_along(y)[-tst]
# Across traits model
fm3 = fitBLUP(y, K=G, ID_geno=ID_geno, trn=trn)
plot(fm1$yHat,fm3$yHat[ID_trait==1]) # different from the single-trait model
# Within traits model: pass an index for traits
fm4 = fitBLUP(y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn)
plot(fm1$yHat,fm4$yHat[ID_trait==1]) # equal to the single-trait model
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.