fitBLUP: Fitting a Linear Mixed model to calculate BLUP

View source: R/fitBLUP.R

7. BLUP estimation from Linear Mixed ModelR Documentation

Fitting a Linear Mixed model to calculate BLUP

Description

Solves the Linear Mixed Model and calculates the Best Linear Unbiased Predictor (BLUP)

Usage

fitBLUP(y, X = NULL, Z = NULL, K = NULL, 
        U = NULL, d = NULL, varU = NULL, 
        varE = NULL, intercept = TRUE, BLUP = TRUE, 
        method = c("REML","ML"), 
        interval = c(1E-9,1E9), tol = 1E-8, 
        maxiter = 1000, n.regions = 10,
        verbose = TRUE)
            

Arguments

y

(numeric matrix) Response variable. It can contain >1 columns, each of them will be fitted separately

X

(numeric matrix) Design matrix for fixed effects. When X=NULL a vector of ones is constructed only for the intercept (default)

Z

(numeric matrix) Design matrix for random effects. When Z=NULL an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used

K

(numeric matrix) Kinship relationship matrix

U, d

(numeric matrix/vector) Eigenvectors and eigenvalues from eigen value decomposition of G = U D U'

varU, varE

(numeric) Genetic and residual variances. When both varU and varE are not NULL they are not calculated; otherwise, the likelihood function (REML or ML) is optimized to search for the genetic/residual variances ratio

intercept

TRUE or FALSE to whether fit an intercept. When FALSE, the model assumes a null intercept

BLUP

TRUE or FALSE to whether return the random effects estimates

method

(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood)

tol

(numeric) Maximum error between two consecutive solutions (convergence tolerance) when finding the root of the log-likelihood's first derivative

maxiter

(integer) Maximum number of iterations to run before convergence is reached

interval

(numeric vector) Range of values in which the root is searched

n.regions

(numeric) Number of regions in which the searched 'interval' is divided for local optimization

verbose

TRUE or FALSE to whether show progress

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z u + e

where y is a vector with the response, b is the vector of fixed effects, u is the vector of the (random) genetic values of the genotypes, e is the vector of environmental residuals (random error), and X and Z are design matrices conecting the fixed and genetic effects with replicates. Genetic values are assumed to follow a Normal distribution as u ~ N(02uK), and the error terms are assumed e ~ N(02eI).

The vector of genetic values g = Z u will therefore follow g ~ N(02uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence g = u and G = K.

The predicted values utrn = (ui), i = 1,2,...,ntrn, corresponding to observed data (training set) are derived as

utrn = B (ytrn - Xtrnb)

where B is a matrix of weights given by

B = σ2uGtrn2uGtrn + σ2eI)-1

where Gtrn is the sub-matrix corresponding to the training set. This matrix can be rewritten as

B = Gtrn (Gtrn + θI)-1

where θ = σ2e2u is the residual/genetic variances ratio representing a ridge-like shrinkage parameter.

The matrix H = Gtrn + θI in the above equation can be used to obtain predictions corresponding to un-observed data (testing set), utst, by

B = Gtst,trn (Gtrn + θI)-1

where Gtst,trn is the sub-matrix of G corresponding to the testing set (in rows) and training set (in columns).

Solutions are found using the GEMMA (Genome-wide Efficient Mixed Model Analysis) approach (Zhou & Stephens, 2012). First, the Brent's method is implemented to solve for the genetic/residual variances ratio (i.e., 1/θ) from the first derivative of the log-likelihood (either REML or ML). Then, variances σ2u and σ2e are calculated. Finally, b is obtained using Generalized Least Squares.

Value

Returns a list object that contains the elements:

  • b: (vector) fixed effects solutions (including the intercept).

  • u: (vector) random effects solutions.

  • varU: random effect variance.

  • varE: residual variance.

  • h2: heritability.

  • convergence: (logical) whether Brent's method converged.

  • method: either 'REML' or 'ML' method used.

References

VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

Zhou X, Stephens M (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics, 44(7), 821-824

Examples

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:10)     # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  y = as.vector(scale(Y[,"E1"]))      # Scale response variable
  
  # Training and testing sets
  tst = which(Y$trial %in% 1:2)
  trn = which(!Y$trial %in% 1:2)

  yNA <- y
  yNA[tst] <- NA
  fm = fitBLUP(yNA, K=G)
  plot(y[tst],fm$u[tst])    # Predicted vs observed values in testing set
  cor(y[tst],fm$u[tst])     # Prediction accuracy in testing set
  cor(y[trn],fm$u[trn])     # Prediction accuracy in training set
  fm$varU                   # Genetic variance
  fm$varE                   # Residual variance
  fm$h2                     # Heritability
  fm$b                      # Intercept
  

SFSI documentation built on Nov. 18, 2023, 9:06 a.m.