library(knitr)
opts_chunk$set(echo       = TRUE,
               message    = TRUE,
               warning    = TRUE,
               # comment    = NA,
               fig.width  = 4,
               fig.height = 3,
               cache      = TRUE,
               autodep    = TRUE,
               ## invalidate cache when either versions of R or breedR change
               ## http://yihui.name/knitr/demo/cache/
               cache.extra= list(R.version, packageVersion('breedR')))
library(breedR)
library(ggplot2)

Intro

What is breedR

Installation

Where to find help

License

GPL-3

Roadmap | Future developments

Functionality

Inference

Frequentist

Bayesian

Linear Mixed Models with unstructured random effects

Example dataset

knitr::kable(head(globulus))
cat(attr(globulus, 'comment'), sep ='\n')
str(globulus, give.attr = FALSE)

A simple Provenance Test

Specify the genetic group gg as an unstructured random effect using the standard formulas in R

$$ \begin{aligned} \mathrm{phe}X = & \mu + Z \mathrm{gg} + \varepsilon \ \mathrm{gg} \sim & N(0, \sigma{\mathrm{gg}}^2) \ \varepsilon \sim & N(0, \sigma_\varepsilon^2) \end{aligned} $$

res <- remlf90(fixed  = phe_X ~ 1,
               random = ~ gg,
               data   = globulus)

Initial variances specification

To avoid the notification, initial values for all the variance components must be made explicit using the argument var.ini:

res <- remlf90(fixed = phe_X ~ 1,
               random = ~ gg,
               var.ini = list(gg = 2, resid = 10),
               data = globulus)

Although in most cases the results will not change at all, we encourage to give explicit initial values for variance components. Specially when some estimate can be artifact. This is also useful for checking sensitivity to initial values.

Exploring the results

summary(res)

Further extractor functions

fixef(res)
ranef(res)

Further extractor functions

qplot(
      fitted(res),
      globulus$phe_X) +
  geom_abline(intercept = 0,
              slope = 1,
              col = 'darkgrey')
str(resid(res))
extractAIC(res)
logLik(res)

Hierarchical and Factorial models

Nested factors

|gg | mum | |---:|:----| |A | 1 | |A | 2 | |B | 3 | |B | 4 |

Crossed factors

|gg | mum | |---:|:----| |A | 1 | |A | 2 | |B | 1 | |B | 2 |

Model specification

random = ~ gg + factor(mum)  # note that mum is numeric

Interactions

random = ~ gg * factor(mum)
dat <- transform(globulus,
                 interaction = factor(gg:bl))
random = ~ gg + bl + interaction

Exercise | Hierarchical and Factorial models

  1. Use remlf90() and the globulus dataset to fit

    • a hierarchical model using mum within gg
    • a factorial model using gg and bl
  2. Explore the results with summary()

    • is the family (mum) effect relevant?
    • is there any evidence of interaction between gg and bl?

Hierarchical and Factorial models #1 | Fitting models

res.h <- remlf90(fixed = phe_X ~ 1,
                 random = ~ factor(mum) + gg,
                 data = globulus)
# Interaction variable
globulus.f <- transform(globulus,
                        gg_bl = factor(gg:bl))

res.f <- remlf90(fixed = phe_X ~ 1,
                 random = ~ gg + bl + gg_bl,
                 data = globulus.f)

Hierarchical and Factorial models #2 | Hierarchical model

summary(res)
summary(res.h)

Hierarchical and Factorial models #3 | Factorial model

summary(res.f)
## result without interaction
res.f0 <- remlf90(fixed  = phe_X ~ 1,
                  random = ~ gg + bl,
                  data = globulus)
paste('AIC:', round(extractAIC(res.f0)),
      'logLik:', round(logLik(res.f0)))

Additive Genetic Effect

pedigree

What is an additive genetic effect

  • Random effect at individual level

  • Based on a pedigree

  • BLUP of Breeding Values from own and relatives' phenotypes

  • Represents the additive component of the genetic value

  • More general:

    • family effect is a particular case
    • accounts for more than one generation
    • mixed relationships
  • More flexible: allows to select individuals within families

Specifying a pedigree

knitr::kable(head(globulus[, 1:3]))

Fitting an animal model

res.animal <- remlf90(fixed  = phe_X ~ 1,
                      random = ~ gg,
                      genetic = list(model = 'add_animal', 
                                     pedigree = globulus[, 1:3],
                                     id = 'self'), 
                      data = globulus)

Animal model: results

summary(res.animal)

Extracting Predicted Breeding Values

## Predicted Breeding Values
# for the full pedigree first, and for the observed individuals
# by matrix multiplication with the incidence matrix
PBV.full <- ranef(res.animal)$genetic
PBV <- model.matrix(res.animal)$genetic %*% PBV.full

# Predicted genetic values vs.
# phenotype.
# Note: fitted = mu + PBV
qplot(fitted(res.animal), phe_X,
      data = globulus) +
  geom_abline(intercept = 0,
              slope = 1,
              col = 'gray')

Handling pedigrees

Spatial autocorrelation

spatial

What is spatial autocorrelation

Diagnosing spatial autocorrelation | residuals spatial plot

## Since coordinates have not
## been passed before they 
## must be provided explicitly.
coordinates(res.animal) <-
  globulus[, c('x', 'y')]
plot(res.animal, 'resid')

Diagnosing spatial autocorrelation | variograms of residuals

variogram(res.animal)

Interpreting the variograms

The empirical isotropic variogram is built by aggregating all the pairs of points separated by $h$, no matter the direction.

variogram(res.animal, 
          coord = globulus[, c('x', 'y')],
          plot = 'isotropic')
ang <- seq(0, 2*pi, by = pi/6)[- (1+3*(0:4))]
# Colours from
# http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442")

coord <- data.frame(x = cos(ang),
                    y = sin(ang))
ggplot(coord, aes(x, y)) + 
  geom_point(size = 6, col = cbPalette[1]) + 
  geom_point(aes(x = 0, y = 0), size = 6) +
  coord_fixed() + 
  scale_x_continuous(breaks=NULL) + 
  scale_y_continuous(breaks=NULL) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())

Interpreting the variograms

The empirical row/col variogram is built by aggregating all the pairs of points separated by exactly $x$ rows and $y$ columns.

variogram(res.animal, 
          coord = globulus[, c('x', 'y')],
          plot = 'heat')
ggplot(coord, aes(x, y)) + 
  geom_point(size = 6, col = cbPalette[rep(c(1, 2, 2, 1), 2)]) + 
  geom_point(aes(x = 0, y = 0), size = 6) +
  coord_fixed() + 
  scale_x_continuous(breaks=NULL) + 
  scale_y_continuous(breaks=NULL) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())

Interpreting the variograms

The empirical anisotropic variogram is built by aggregating all the pairs of points in the same direction separated by $|\mathbf{x}|$.

variogram(res.animal, 
          coord = globulus[, c('x', 'y')],
          plot = 'aniso')
ggplot(coord, aes(x, y)) + 
  geom_point(size = 6, col = cbPalette[rep(1:4, 2)]) + 
  geom_point(aes(x = 0, y = 0), size = 6) +
  coord_fixed() + 
  scale_x_continuous(breaks=NULL) + 
  scale_y_continuous(breaks=NULL) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())

Accounting for spatial autocorrelation

The blocks model

# The genetic component (DRY)
gen.globulus <- list(model    = 'add_animal', 
                     pedigree = globulus[, 1:3],
                     id       = 'self')

res.blk <- remlf90(fixed   = phe_X ~ 1,
                   random  = ~ gg,
                   genetic = gen.globulus, 
                   spatial = list(model = 'blocks', 
                                  coord = globulus[, c('x', 'y')],
                                  id = 'bl'),
                   data    = globulus)

Animal-spatial model: results

summary(res.blk)

Variogram of residuals

variogram(res.blk)

B-Splines model

## Use the `em` method! `ai` does not like splines
res.spl  <- remlf90(fixed   = phe_X ~ 1,
                    random  = ~ gg,
                    genetic = gen.globulus, 
                    spatial = list(model   = 'splines', 
                                   coord   = globulus[, c('x','y')]), 
                    data    = globulus, method  = 'em')

Autoregressive model

res.ar1  <- remlf90(fixed   = phe_X ~ 1,
                    random  = ~ gg,
                    genetic = gen.globulus, 
                    spatial = list(model = 'AR', 
                                   coord = globulus[, c('x','y')]), 
                    data    = globulus)

Change in model residuals

compare.plots(
  list(`Animal model only` = plot(res.animal, 'residuals'),
       `Animal/blocks model` = plot(res.blk, 'residuals'),
       `Animal/splines model` = plot(res.spl, 'residuals'),
       `Animal/AR1 model` = plot(res.ar1, 'residuals')))

Comparison of spatial components

compare.plots(list(Blocks  = plot(res.blk, type = 'spatial'),
                   Splines = plot(res.spl, type = 'spatial'),
                   AR1xAR1 = plot(res.ar1, type = 'spatial')))

Prediction of the spatial effect in unobserved locations

compare.plots(list(Blocks  = plot(res.blk, type = 'fullspatial'),
                   Splines = plot(res.spl, type = 'fullspatial'),
                   AR1xAR1 = plot(res.ar1, type = 'fullspatial')))

Spatial parameters | Number of knots of a splines model

ggplot(transform(data.frame(x = 10:1e3),
                 nok = breedR:::determine.n.knots(x)),
       aes(x, nok)) + 
  geom_line()

Spatial parameters | Autoregressive parameters of a AR model

Exercise | Tuning spatial parameters

rho.grid <- expand.grid(rho_r = seq(.7, .95, length = 4),
                        rho_c = seq(.7, .95, length = 4))
- What are now the most likely parameters?

Spatial #1 | B-splines model with increased nok

res.spl99  <- remlf90(fixed  = phe_X ~ 1, random = ~ gg,
                      genetic = gen.globulus,
                      spatial = list(model   = 'splines', 
                                     coord   = globulus[, c('x','y')],
                                     n.knots = c(9, 9)), 
                      data = globulus, method = 'em')
summary(res.spl)
summary(res.spl99)

Spatial #2 | Visualize log-likelihoods

qplot(rho_r, rho_c,
      fill = loglik,
      geom = 'tile',
      data = res.ar1$rho)
knitr::kable(res.ar1$rho)

Spatial #3 | Refine grid

rho.grid <- expand.grid(rho_r = seq(.7, .95, length = 4),
                        rho_c = seq(.7, .95, length = 4))
res.ar.grid  <- remlf90(fixed  = phe_X ~ gg,
                        genetic = list(model = 'add_animal', 
                                       pedigree = globulus[,1:3],
                                       id = 'self'), 
                        spatial = list(model = 'AR', 
                                       coord = globulus[, c('x','y')],
                                       rho = rho.grid), 
                        data = globulus)
summary(res.ar.grid)

Competition

Theoretical model

Competition model

  • Each individual have two (unknown) Breeding Values (BV)

  • The direct BV affects its own phenotype, while the competition BV affects its neghbours' (as the King moves)

  • The effect of the neighbouring competition BVs is given by their sum weighted by $1/d^\alpha$, where $\alpha$ is a tuning parameter called decay

  • Each set of BVs is modelled as a zero-mean random effect with structure matrix given by the pedigree and independent variances $\sigma^2_a$ and $\sigma^2_c$

  • Both random effects are modelled jointly with correlation $\rho$

Permanent Environmental Effect (pec)

Simulation of data

breedR implements a convenient dataset simulator which keeps a similar syntax.

# Simulation parameters
grid.size <- c(x=20, y=25) # cols/rows
coord <- expand.grid(sapply(grid.size,
                            seq))
Nobs <- prod(grid.size)
Nparents <- c(mum = 20, dad = 20)
sigma2_a <- 2   # direct add-gen var
sigma2_c <- 1   # compet add-gen var
rho      <- -.7 # gen corr dire-comp
sigma2_s <- 1   # spatial variance
sigma2_p <- .5  # pec variance
sigma2   <- .5  # residual variance

S <- matrix(c(sigma2_a,
              rho*sqrt(sigma2_a*sigma2_c),
              rho*sqrt(sigma2_a*sigma2_c),
              sigma2_c),
            2, 2)

set.seed(12345)
simdat <- 
  breedR.sample.phenotype(
    fixed   = c(beta = 10),
    genetic = list(model = 'competition',
                   Nparents = Nparents,
                   sigma2_a = S,
                   check.factorial=FALSE,
                   pec = sigma2_p),
    spatial = list(model = 'AR',
                   grid.size = grid.size,
                   rho   = c(.3, .8),
                   sigma2_s = sigma2_s),
    residual.variance = sigma2
    )

## Remove founders
dat <- subset(simdat,
              !(is.na(simdat$sire)
                & is.na(simdat$dam)))

Fitting a competition model

system.time(
  res.comp <- remlf90(fixed   = phenotype ~ 1,
                      genetic = list(model = 'competition',
                                     pedigree = dat[, 1:3],
                                     id = 'self',
                                     coord = dat[, c('x', 'y')],
                                     competition_decay = 1,
                                     pec = list(present = TRUE)),
                      spatial = list(model = 'AR', 
                                     coord = dat[, c('x', 'y')],
                                     rho   = c(.3, .8)),
                      data = dat,
                      method = 'em')  # AI diverges
  )

True vs. estimated parameters

var.comp <- 
  data.frame(True = c(sigma2_a, sigma2_c, rho, sigma2_s, sigma2_p, sigma2),
             Estimated = with(res.comp$var,
                              round(c(genetic[1, 1], genetic[2, 2],
                                      genetic[1, 2]/sqrt(prod(diag(genetic))),
                                      spatial, pec, Residual), digits = 2)),
             row.names = c('direct', 'compet.', 'correl.',
                          'spatial', 'pec', 'residual'))

knitr::kable(var.comp)
labels <- c(paste0(rep('sigma', 5),
                 c('[A]', '[C]', '[S]', '[P]', ''), '^2'),
            'rho')[c(1:2, 6, 3:5)]

ggplot(var.comp, aes(True, Estimated)) + 
  geom_point() +
  geom_abline(intercept = 0, slope = 1, col = 'darkgray') + 
  geom_text(label = labels, parse = TRUE, hjust = -.5) +
  expand_limits(x = 2.4, y = 2.6)

Exercise | Competition models

  1. Plot the true vs predicted:

    • direct and competition Breeding Values
    • spatial effects
    • pec effects
  2. Plot the residuals and their variogram

    • Do you think the residuals are independent?
    • How would you improve the analysis?

Competition #1 | True vs. predicted components

## compute the predicted effects for the observations
## by matrix multiplication of the incidence matrix and the BLUPs
pred <- list()
Zd <- model.matrix(res.comp)$'genetic_direct'
pred$direct <- Zd %*% ranef(res.comp)$'genetic_direct'

## Watch out! for the competition effects you need to use the incidence
## matrix of the direct genetic effect, to get their own value.
## Otherwise, you get the predicted effect of the neighbours on each
## individual.
pred$comp <- Zd %*% ranef(res.comp)$'genetic_competition'
pred$pec  <- model.matrix(res.comp)$pec %*% ranef(res.comp)$pec

Competition #1 | True vs. predicted components

comp.pred <-
  rbind(
    data.frame(
      Component = 'direct BV',
      True = dat$BV1,
      Predicted = pred$direct),
    data.frame(
      Component = 'competition BV',
      True = dat$BV2,
      Predicted = pred$comp),
    data.frame(
      Component = 'pec',
      True      = dat$pec,
      Predicted = as.vector(pred$pec)))

ggplot(comp.pred,
       aes(True, Predicted)) +
  geom_point() + 
  geom_abline(intercept = 0, slope = 1,
              col = 'darkgray') +
  facet_grid(~ Component)

The predition of the Permanent Environmental Competition effect is not precisely great...

Competition #2 | Map of residuals and their variogram

plot(res.comp, type = 'resid')
variogram(res.comp)

Generic component

The Generic model

This additional component allows to introduce a random effect $\psi$ with arbitrary incidence and covariance matrices $Z$ and $\Sigma$:

$$ \begin{aligned} y = & \mu + X \beta + Z \psi + \varepsilon \ \psi \sim & N(0, \sigma_\psi^2 \Sigma_\psi) \ \varepsilon \sim & N(0, \sigma_\varepsilon^2) \end{aligned} $$

Implementation of the generic component

## Fit a blocks effect using generic
inc.mat <- model.matrix(~ 0 + bl, globulus)
cov.mat <- diag(nlevels(globulus$bl))
res.blg <- remlf90(fixed  = phe_X ~ gg,
                   generic = list(block = list(inc.mat,
                                               cov.mat)),
                   data   = globulus)

Example of result

summary(res.blg)

Prediction

Predicting values for unobserved trees

Leave-one-out cross-validation

rm.idx <- 8
rm.exp <- with(dat[rm.idx, ],
               phenotype - resid)
dat.loo <- dat
dat.loo[rm.idx, 'phenotype'] <- NA
res.comp.loo <- remlf90(fixed   = phenotype ~ 1,
                        genetic = list(model = 'competition',
                                       pedigree = dat[, 1:3],
                                       id = 'self',
                                       coord = dat[, c('x', 'y')],
                                       competition_decay = 1,
                                       pec = list(present = TRUE)),
                        spatial = list(model = 'AR', 
                                       coord = dat[, c('x', 'y')],
                                       rho   = c(.3, .8)),
                        data = dat.loo,
                        method = 'em')  
## compute the predicted effects for the observations
## by matrix multiplication of the incidence matrix and the BLUPs
Zd <- model.matrix(res.comp)$'genetic_direct'
pred.BV.loo.mat <- with(ranef(res.comp.loo), 
                        cbind(`genetic_direct`, `genetic_competition`))
pred.genetic.loo <- Zd[rm.idx, ] %*% pred.BV.loo.mat

valid.pred <- 
  data.frame(True = with(dat[rm.idx, ],
                         c(BV1, BV2, rm.exp)),
             Pred.loo = c(pred.genetic.loo,
                          fitted(res.comp.loo)[rm.idx]),
             row.names = c('direct BV', 'competition BV', 'exp. phenotype'))

knitr::kable(round(valid.pred, 2))

Exercise | Cross validation

  1. Extend the last table to include the predicted values with the full dataset

  2. Remove 1/10th of the phenotypes randomly, and predict their expected phenotype

    • Have the parameter estimations changed too much?
  3. Compute the Root Mean Square Error (RMSE) of Prediction with respect to the true values

Cross-validation #1 | Include prediction with full data

pred.BV.mat <- with(ranef(res.comp), 
                    cbind(`genetic_direct`, `genetic_competition`))

valid.pred$Pred.full <- c(Zd[rm.idx, ] %*% pred.BV.mat,
                          fitted(res.comp)[rm.idx])

knitr::kable(round(valid.pred[,c(1,3,2)], 2))

Cross-validation #2 | Perform cross-validation on 1/10th of the observations

rm.idx <- sample(nrow(dat), nrow(dat)/10)
dat.cv <- dat
dat.cv[rm.idx, 'phenotype'] <- NA
## Re-fit the model and build table
res.comp.cv <-
  remlf90(fixed   = phenotype ~ 1,
          genetic = list(model = 'competition',
                         pedigree = dat[, 1:3],
                         id = 'self',
                         coord = dat[, c('x', 'y')],
                         competition_decay = 1,
                         pec = list(present = TRUE)),
          spatial = list(model = 'AR', 
                         coord = dat[, c('x', 'y')],
                         rho   = c(.3, .8)),
          data = dat.cv,
          method = 'em')

var.comp <- 
  data.frame(
    'Fully estimated' = with(res.comp$var,
                             round(c(genetic[1, 1], genetic[2, 2],
                                     genetic[1, 2]/sqrt(prod(diag(genetic))),
                                     spatial, pec, Residual), digits = 2)),
    'CV estimated' = with(res.comp.cv$var,
                          round(c(genetic[1, 1], genetic[2, 2],
                                  genetic[1, 2]/sqrt(prod(diag(genetic))),
                                  spatial, pec, Residual), digits = 2)),
    row.names = c('direct', 'compet.', 'correl.',
                  'spatial', 'pec', 'residual')
    )

knitr::kable(var.comp)

Cross-validation #3 | MSE of Prediction

true.exp.cv <- with(dat[rm.idx, ], phenotype - resid)
round(sqrt(mean((fitted(res.comp.cv)[rm.idx] - true.exp.cv)^2)), 2)

Multiple traits

breedR provides a basic interface for multi-trait models which only requires specifying the different traits in the main formula using cbind().

## Filter site and select relevant variables
dat <- 
  droplevels(
    douglas[douglas$site == "s3",
            names(douglas)[!grepl("H0[^4]|AN|BR|site", names(douglas))]]
  )

res <- 
  remlf90(
    fixed = cbind(H04, C13) ~ orig,
    genetic = list(
      model = 'add_animal', 
      pedigree = dat[, 1:3],
      id = 'self'),
    data = dat
  )

A full covariance matrix across traits is estimated for each random effect, and all results, including heritabilities, are expressed effect-wise:

summary(res)

Although the results are summarized in tabular form, the covariance matrices can be recovered directly:

res$var[["genetic", "Estimated variances"]]

## Use cov2cor() to compute correlations
cov2cor(res$var[["genetic", "Estimated variances"]])

Estimates of fixed effects and BLUPs of random effects can be recovered with fixef() and ranef() as usual. The only difference is that they will return a list of matrices rather than vectors, with one column per trait.

The standard errors are given as attributes, and are displayed in tabular form whenever the object is printed.

fixef(res)       ## printed in tabular form, but...
unclass(fixef(res))  ## actually a matrix of estimates with attribute "se"

str(ranef(res))
head(ranef(res)$genetic)

Recovering the breeding values for each observation in the original dataset follows the same procedure as for one trait: multiply the incidence matrix by the BLUP matrix. The result, however, will be a matrix with one column per trait.

head(model.matrix(res)$genetic %*% ranef(res)$genetic)

Initial (co)variance specification

breedR will use the empirical variances and covariances to compute initial covariance matrices. But you can specify your own. This is particularly interesting for setting some initial covariances to 0, which indicates that you don't want that component to be estimated, and thus reducing the dimension of the model.

Typical cases are Multi-Environment Trials (MET, e.g. multiple sites, or years) where you don't really want to estimate the residual covariances, or when you know a priori that two traits are little correlated.

Specify the initial covariance values in matrix form.

initial_covs <- list(
  genetic = 1e3*matrix(c(1, .5, .5, 1), nrow = 2),
  residual = diag(2)   # no residual covariances
)
res <- 
  remlf90(
    fixed = cbind(H04, C13) ~ orig,
    genetic = list(
      model = 'add_animal', 
      pedigree = dat[, 1:3],
      id = 'self',
      var.ini = initial_covs$genetic),
    data = dat,
    var.ini = list(residual = initial_covs$residual)
  )

Some more features

Metagene interface

Simulation framework

Remote computation

If you have access to a Linux server through SSH, you can perform computations remotely

Package options

breedR.getOption()


famuvie/breedR documentation built on Sept. 6, 2021, 4:50 a.m.