fitBLUP: Fitting a Linear Mixed model to calculate BLUP

View source: R/fitBLUP.R

6. BLUP estimation from Linear Mixed ModelR Documentation

Fitting a Linear Mixed model to calculate BLUP

Description

Solves the Linear Mixed Model and calculates the Best Linear Unbiased Predictor (BLUP)

Usage

fitBLUP(y, X = NULL, Z = NULL, K = NULL, trn = NULL,
        EVD = NULL, varU = NULL, varE = NULL,
        ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
        BLUP = TRUE, method = c("REML","ML"),
        interval = c(1E-9,1E9), tol = 1E-8, maxiter = 1000,
        n.regions = 10, verbose = TRUE)
            

Arguments

y

(numeric vector) Response variable. It can be a matrix with each column representing a different response variable

X

(numeric matrix) Design matrix for fixed effects. When X=NULL a vector of ones is constructed only for the intercept (default)

Z

(numeric matrix) Design matrix for random effects. When Z=NULL an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used

K

(numeric matrix) Kinship relationship matrix

trn

(integer vector) Which elements from vector y are to be used for training. When trn = NULL, non-NA entries in vector y will be used as training set

EVD

(list) Eigenvectors and eigenvalues from eigenvalue decomposition (EVD) of G corresponding to training data

ID_geno

(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector y to rows/columns of matrix G

ID_trait

(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector y to different traits

varU, varE

(numeric) Genetic and residual variances. When both varU and varE are not NULL they are not calculated; otherwise, the likelihood function (REML or ML) is optimized to search for the genetic/residual variances ratio

intercept

TRUE or FALSE to whether fit an intercept. When FALSE, the model assumes a null intercept

BLUP

TRUE or FALSE to whether return the random effects estimates

method

(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood)

tol

(numeric) Maximum error between two consecutive solutions (convergence tolerance) when finding the root of the log-likelihood's first derivative

maxiter

(integer) Maximum number of iterations to run before convergence is reached

interval

(numeric vector) Range of values in which the root is searched

n.regions

(numeric) Number of regions in which the searched 'interval' is divided for local optimization

verbose

TRUE or FALSE to whether show progress

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z g + e

where y is a vector with the response, b is the vector of fixed effects, u is the vector of the (random) genetic effects of the genotypes, e is the vector of environmental residuals (random error), and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(02uK), and the error terms are assumed e ~ N(02eI).

The vector of total genetic values u = Z g will therefore follow u ~ N(02uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.

The predicted values utrn = {ui}, i = 1,2,...,ntrn, corresponding to observed data (training set) are derived as

utrn = B (ytrn - Xtrnb)

where B is a matrix of weights given by

B = σ2uGtrn2uGtrn + σ2eI)-1

where Gtrn is the sub-matrix corresponding to the training set. This matrix can be rewritten as

B = Gtrn (Gtrn + θI)-1

where θ = σ2e2u is the residual/genetic variances ratio representing a ridge-like shrinkage parameter.

The matrix H = Gtrn + θI in the above equation can be used to obtain predictions corresponding to un-observed data (testing set), utst = {ui}, i = 1,2,...,ntst, by

B = Gtst,trn (Gtrn + θI)-1

where Gtst,trn is the sub-matrix of G corresponding to the testing set (in rows) and training set (in columns).

Solutions are found using the GEMMA (Genome-wide Efficient Mixed Model Analysis) approach (Zhou & Stephens, 2012). First, the Brent's method is implemented to solve for the genetic/residual variances ratio (i.e., 1/θ) from the first derivative of the log-likelihood (either REML or ML). Then, variances σ2u and σ2e are calculated. Finally, b is obtained using Generalized Least Squares.

Value

Returns a list object that contains the elements:

  • b: (vector) fixed effects solutions (including the intercept).

  • u: (vector) total genetic values (u = Z g).

  • g: (vector) genetic effects solutions.

  • varU: random effect variance.

  • varE: residual variance.

  • h2: heritability.

  • convergence: (logical) whether Brent's method converged.

  • method: either 'REML' or 'ML' method used.

References

Zhou X, Stephens M (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics, 44(7), 821-824

Examples

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:20)    # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  Y0 = scale(as.matrix(Y[,4:6]))      # Response variable
  
  #---------------------------------------------------
  # Single-trait model
  #---------------------------------------------------
  y = Y0[,1]     
  
  # Training and testing sets
  tst = which(Y$trial %in% 1:3)
  trn = seq_along(y)[-tst]
  
  # Kinship-based model
  fm1 = fitBLUP(y, K=G, trn=trn)
  
  head(fm1$g)                  # Genetic effects
  plot(y[tst],fm1$yHat[tst])   # Predicted vs observed values in testing set
  cor(y[tst],fm1$yHat[tst])    # Prediction accuracy in testing set
  cor(y[trn],fm1$yHat[trn])    # Prediction accuracy in training set
  fm1$varU                     # Genetic variance
  fm1$varE                     # Residual variance
  fm1$h2                       # Heritability
  fm1$b                        # Fixed effects
  
  # Markers-based model
  fm2 = fitBLUP(y, Z=M, trn=trn)
  head(fm2$g)                   # Marker effects
  all.equal(fm1$yHat, fm2$yHat)
  fm2$varU                      # Genetic variance
  fm2$varU*sum(apply(M,2,var))
  
  #---------------------------------------------------
  # Multiple response variables
  #---------------------------------------------------
  ID_geno = as.vector(row(Y0))
  ID_trait = as.vector(col(Y0))
  y = as.vector(Y0)
  
  # Training and testing sets
  tst = c(which(ID_trait==1)[Y$trial %in% 1:3],
          which(ID_trait==2)[Y$trial %in% 1:3],
          which(ID_trait==3)[Y$trial %in% 1:3])
  trn = seq_along(y)[-tst]
  
  # Across traits model
  fm3 = fitBLUP(y, K=G, ID_geno=ID_geno, trn=trn)
  plot(fm1$yHat,fm3$yHat[ID_trait==1])  # different from the single-trait model
  
  # Within traits model: pass an index for traits
  fm4 = fitBLUP(y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn)
  plot(fm1$yHat,fm4$yHat[ID_trait==1])  # equal to the single-trait model
  

SFSI documentation built on Sept. 11, 2024, 9:11 p.m.