met$set_this_rmd_file(ps_this_rmd_file = ifelse(rstudioapi::isAvailable(), rstudioapi::getSourceEditorContext()$path, whereami::thisfile()))
Mixed linear effects models are a very useful tool in the analysis of data with some dependencies. In all statistical analyses that we have seen so far the assumption of independence between observations was central. One way of expressing this independence assumption is via the variance-covariance matrix ($var(\mathbf{e})$) of the vector ($\mathbf{e}$) of residuals. In mathematical terms this can be written as
\begin{equation} var(\mathbf{e}) = \mathbf{I} * \sigma_e^2 (#eq:varemlem) \end{equation}
which means that the variance-covariance matrix ($var(\mathbf{e})$) is proportional to the identity matrix $\mathbf{I}$ with the variance component $\sigma_e^2$ as proportionality factor.
In what follows, the models that account for different dependency structures are described.
It is quite common to have repeated observations of the same traits or characteristics from a group of animals. Observing the same characteristic of the same animal multiple times is expected to yield a more accurate description of any relationship between different traits such as body weight and breast circumference. If we apply that line of thought to the example data used in chapter \@ref(asm-regr), we would have repeated measurements of breast circumference and body weight of the same animals. Such a dataset is shown in Table \@ref(tab:rep-obs-bw-bc-tab) for a selected number of animals.
s_rep_obs_path <- "https://charlotte-ngs.github.io/asmss2022/data/asm_bw_bc_rep_obs.csv" tbl_rep_obs <- readr::read_csv(file = s_rep_obs_path) knitr::kable(tbl_rep_obs, booktabs = TRUE, longtable = TRUE, caption = "Repeated Observations for Body Weight and Breast Circumference")
In Table \@ref(tab:rep-obs-bw-bc-tab), the column entitiled Animal
is no longer a running counter which enumerates the observation records. In this repeated observation dataset, the column Animal
denotes for which animal the measurements was observed. The association between observations and animals is shown in Figure \@ref(fig:rep-obs-bw-bc-fig).
tbl_rep_obs$Animal <- as.factor(tbl_rep_obs$Animal) ggplot2::ggplot(data = tbl_rep_obs, mapping = ggplot2::aes(x = `Breast Circumference`, y = `Body Weight`, color = Animal)) + ggplot2::geom_point()
The color codes in Figure \@ref(fig:rep-obs-bw-bc-fig) identify the observations for the same animal. This shows that observations for the same animal tend to be grouped together. This grouping has to be considered in the staistical analysis of such a dataset.
In principle, the dataset shown in Table \@ref(tab:rep-obs-bw-bc-tab) can be analysed with a linear regression model. But from the plot (Figure \@ref(fig:rep-obs-bw-bc-res-plot)) of the residuals versus the fitted values, it becomes clear that the residuals are grouped according to the animals from which the measurement was taken. Due to the small size of the dataset, the grouping effect according to the animal does not show up as clearly as intended. But never the less, this grouping indicates that the assumption of independent residuals is violated.
lm_rep_obs <- lm(`Body Weight` ~ `Breast Circumference`, data = tbl_rep_obs) tbl_rep_obs$Animal <- as.factor(tbl_rep_obs$Animal) ggplot2::ggplot(data = tibble::tibble(Animal = tbl_rep_obs$Animal, Fitted = fitted(lm_rep_obs), Residuals = residuals(lm_rep_obs)), ggplot2::aes(x = Fitted, y = Residuals, color = Animal)) + ggplot2::geom_point()
Traditionally repeated measurement data have been analyzed using a statistical technique referred to as analysis of variance (ANOVA). ANOVA is a general method that has been used for a long time to assess the variability of different factors in a dataset. This is done by constructing a specific type of table (ANOVA-table) which presents the essential features of a given dataset (see r met$add("Searle1992")
for more details). The structure and the properties of an ANOVA-table can best be demonstrated by an analysis of a dataset that shows the influence of the factor breed on body weight of animals shown in Table \@ref(tab:bw-breed-tab).
s_rep_obs_path <- "https://charlotte-ngs.github.io/asmss2022/data/asm_bw_flem.csv" tbl_bw_breed <- readr::read_csv(file = s_rep_obs_path) tbl_bw_breed <- dplyr::select(tbl_bw_breed, Animal, `Body Weight`, Breed) tbl_bw_breed$Breed <- as.factor(tbl_bw_breed$Breed) knitr::kable(tbl_bw_breed, booktabs = TRUE, longtable = TRUE, caption = paste0("Body Weight and Breed for ", nrow(tbl_bw_breed), " Beef Animals", collapse = ""))
A one-factor analysis of variance of the data shown in Table \@ref(tab:bw-breed-tab) can answer the question, whether the factor Breed
has an influence on the response variable Body Weight
. An ANOVA in R can be constructed by the function aov()
as follows.
aov_bw_breed <- aov(`Body Weight` ~ Breed, data = tbl_bw_breed) (smry_aov_bw_breed <- summary(aov_bw_breed))
The result of the one-way ANOVA of Body Weight
ond Breed
shows that it is very unlikely that Breed
does not have any influence on Body Weight
. The presented test-statistic from an F-Test is the same that is also shown by the summary results of a result from the lm()
function. The ANOVA table which is presented by the summary()
function applied to the aov
-object contains also an estimate ($\widehat{\sigma_e^2})$ of the residual variance component ($\sigma_e^2$). The estimate corresponds to the mean sum of squares for the component Residuals
. For our dataset the estimate is r round(smry_aov_bw_breed[[1]]["Residuals", "Mean Sq"], digits=1)
. Taking the square root of this value results in the Residual standard error
shown in the summary output of an lm()
-analysis.
Extending the dataset shown in Table \@ref(tab:bw-breed-tab) to multiple observations for a selected number of animals results in the dataset given in Table \@ref(tab:bw-breed-rep-obs-tab).
s_bw_breed_rep_obs_path <- "https://charlotte-ngs.github.io/asmss2022/data/asm_bw_breed_rep_obs.csv" tbl_rep_obs_breed <- readr::read_csv(file = s_bw_breed_rep_obs_path) tbl_rep_obs_breed <- dplyr::select(tbl_rep_obs_breed, Animal, `Body Weight`, Breed) tbl_rep_obs_breed$Animal <- as.factor(tbl_rep_obs_breed$Animal) knitr::kable(tbl_rep_obs_breed, booktabs = TRUE, longtable = TRUE, caption = "Repeated Observations of Body Weight and Breed for Beef Cattle Animals")
Applying an ANOVA on the dataset given in Table \@ref(tab:bw-breed-rep-obs-tab) allows to check whether there is variation between measurements of the same animal.
aov_bw_breed_rep <- aov(`Body Weight` ~ Breed + Error(Animal), data = tbl_rep_obs_breed) summary(aov_bw_breed_rep)
The above ANOVA results show that taking into account the repeated measurement structure of the data greatly reduces the mean squared residuals. On the other hand due to the low number of animals in the dataset, the null-hypothesis of the factor Breed
having no effect on Body Weight
could not be rejected.
While ANOVA is a widely used method and the above results show that we were able to correctly separate the variation between breeds and within a series of observation for the same animal, it has a major disadvantage. ANOVA cannot handle so-called unbalanced data very well. Unbalanced data means that the number of observations per factor level or per animal is not the same. Because the problem of unbalanced data occurs quite frequently even in planned experiments, ANOVA is not used that often nowadays. The problems of unbalanced data can be addressed by a different class of models, called the mixed linear effects models.
Before, we introduce the mixed linear effects model, we first have a look at how the repeated measurements data can be modelled with a random effects model. For the demonstration of the random effects model, we use the dataset in Table \@ref(tab:bw-breed-rep-obs-tab), but we are ignoring the factor Breed
for a moment. Then this dataset just looks like a repeated measurement of the body weight of some beef cattle animals (see Table \@ref(tab:bw-rep-meas-tab)).
tbl_rep_obs_no_breed <- dplyr::select(tbl_rep_obs_breed, Animal, `Body Weight`) knitr::kable(tbl_rep_obs_no_breed, booktabs = TRUE, longtable = TRUE, caption = "Repeated Measurements of Body Weight for Beef Cattle Animals")
In a random effects model with repeated observations, the expected value $E(y_{ij})$ for body weight $y_{ij}$ of animal $i$ with the $j^{th}$ observation can be written as
\begin{equation} E(y_{ij}) = \mu + \alpha_i (#eq:repobsbwremmlem) \end{equation}
Algebraically the expression for $E(y_{ij})$ given in \@ref(eq:repobsbwremmlem) is not different from what we have seen for the fixed linear effects model in chapter \@ref(asm-flem). But the assumptions are different. In \@ref(eq:repobsbwremmlem), $\alpha_i$ is the effect of animal $i$ on the observed body weight. Because the animals in the dataset (Table \@ref(tab:bw-rep-meas-tab)) is a random sample of a large population of animals, the effect $\alpha_i$ is a so-called random effect. A random effect in a model is to be treated as a random variable for which, we have to specify its distributional properties such as expected value and variance. For our example of the repeated measurements data, we assume the following three properties for the $\alpha_i$ effects
A further consequence of choosing $\alpha_i$ as a random effect is that, the expected value in \@ref(eq:repobsbwremmlem) must be considered a second time and must be specified with more details. Assuming that $\alpha^$ denotes the general random animal effect on the observed body weight. For a given animal $i$, the effect is then $\alpha_i$ which is a realized but unobservable value of the distribution of the $\alpha^$ effects. Therefore in \@ref(eq:repobsbwremmlem) the expected value of $y_{ij}$ is conditional on the fact that the random variable $\alpha^*$ takes the value $\alpha_i$. Hence \@ref(eq:repobsbwremmlem) is a conditional mean
\begin{equation} E(y_{ij} | \alpha^* = \alpha_i) = \mu + \alpha_i (#eq:repobsbwcondexpmlem) \end{equation}
For notational simplicity, the $\alpha^$ is often ommitted. Taking expectation over $\alpha^$ leads to
\begin{equation} E_{\alpha^*}\left[E(y_{ij} | \alpha_i) \right] = E(y_{ij}) = \mu (#eq:repobsbwcondexpalphamlem) \end{equation}
The residuals are defined as
\begin{equation} e_{ij} = y_{ij} - E(y_{ij} | \alpha_i) = y_{ij} - (\mu + \alpha_i) (#eq:repobsbwresmlem) \end{equation}
With that definition, we can establish the model equation for an observation $y_{ij}$ as
\begin{equation} y_{ij} = \mu + \alpha_i + e_{ij} (#eq:repobsbwmodeleqmlem) \end{equation}
The properties of the residuals are assumed analogously to the fixed effects model. In summary, the properties are listed as
Together with \@ref(eq:repobsbwmodeleqmlem), we can establish the total variance of all observations $y_{ij}$ as
\begin{equation} var(y_{ij}) = var(\mu + \alpha_i + e_{ij}) = \sigma_{\alpha}^2 + \sigma_e^2 = \sigma_y^2 (#eq:repobsbwvarymlem) \end{equation}
This shows that the variance ($\sigma_y^2$) can be decomposed into the two variance components $\sigma_{\alpha}^2$ and $\sigma_e^2$. It is also noted that the intra-class covariance which corresponds to the covariance between body weights for the same animal can be written as
\begin{equation} cov(y_{ij}, y_{ij'}) = cov(\mu + \alpha_i + e_{ij}, \mu + \alpha_i + e_{ij'}) = \sigma_{\alpha}^2 \quad \text{ for } j \ne j' (#eq:repobsbwintraclasscovmlem) \end{equation}
In R, one of the packages that can handle random effects models is the package lme4
. For the dataset in Table \@ref(tab:bw-rep-meas-tab), this can be done as follows
library(lme4) lmer_bw_rep <- lmer(`Body Weight` ~ (1 | Animal), data = tbl_rep_obs_no_breed) summary(lmer_bw_rep)
Linear mixed effects models or just mixed models are a combination or a merger of fixed linear effects models and random models. That means mixed models contain both fixed effects and random effects. A first example of a dataset which can be modelled with a mixed model is shown in Table \@ref(tab:bw-breed-rep-obs-tab). In this dataset, the factor Breed
is regarded as a fixed effect whereas the influence of the animal on a single measurement is considered as a random effect. Hence, body weight $y_{ijk}$ which corresponds to repeated observation $k$ of animal $j$ from breed $i$ can be written as
\begin{equation} y_{ijk} = b_0 + b_i + \alpha_j + e_{ijk} (#eq:singleobsmlem) \end{equation}
where $b_0$ is the intercept, $b_i$ is the fixed effect of breed $i$, $\alpha_j$ is the random effect of animal $j$ and $e_{ijk}$ is the random residual. In matrix-vector notation equation \@ref(eq:singleobsmlem) takes the form
\begin{equation} \mathbf{y} = \mathbf{Xb} + \mathbf{Z}\alpha + \mathbf{e} (#eq:matvecnotmlem) \end{equation}
where $\mathbf{y}$ is the vector of length $n$ containing responses, $\mathbf{b}$ the vector of length $p$ with covariates, $\alpha$ the vector of length $q$ with random effects related to the repeated observations for an animal and $\mathbf{e}$ is the vector of length $n$ with random residuals. The matrices $\mathbf{X}$ and $\mathbf{Z}$ relate the different effects to the observations.
From equation \@ref(eq:singleobsmlem), we cannot really tell any difference to a fixed linear effects model. Only with the specification of the distributional properties of all the model components, it becomes clear that equation \@ref(eq:singleobsmlem) specifies a mixed model. The mixed model is characterized by two random variables
The datasets to be analysed with mixed models contain observations, denoted by the vector $\mathbf{y}$. Values $\alpha$ of $\alpha^$ are not observed and hence are unknown. When specifying the distributional properties of a mixed model, the unconditional distribution of $\alpha^$ and the conditional distribution of $(\mathbf{\mathcal{Y}}^|\alpha^)$ are given. The description of these distributions involve the form of the distribution and the values of the distributional parameters. The observations of the responses and of the covariates are used to estimate these parameters. The unconditional distribution of $\alpha^$ and the conditional distribution of $(\mathbf{\mathcal{Y}}^|\alpha^*)$ are both assumed to be multivariate normal distributions.
\begin{align} (\mathbf{\mathcal{Y}}^|\alpha^) & \sim \mathcal{N}(\mathbf{Xb} + \mathbf{Z}\alpha, \sigma^2 * I) \notag \ \alpha^* & \sim \mathcal{N}(\mathbf{0},\Sigma) (#eq:distalphaygivenalphamlem) \end{align}
The analysis of the dataset shown in Table \@ref(tab:bw-breed-rep-obs-tab) can be analysed using the function lmer()
of package lme4
.
mlem__rep_obs_breed <- lme4::lmer(`Body Weight` ~ Breed + (1|Animal), data = tbl_rep_obs_breed) summary(mlem__rep_obs_breed)
The variance components obtained by lme4::lmer()
are the same as what we have seen before as results of ANOVA. This is because, we are looking at balanced data.
In livestock breeding, linear mixed effects models are of interest when it comes to the evaluation of the genetic potential of selection candidates. From quantitative genetics, we know that parents with a superior genetic potential produce offspring which are better on average compared to the mean performance of animals from the same generation. The genetic potential of an animal is quantified by a concept which is referred to as breeding value. In what follows, we describe how breeding values for animals can be predicted using mixed models.
In a first application of mixed models for predicting breeding values, observation of daughter performance records were used to predict breeding values for sires. Assuming that sires are unrelated, i.e. they do not share any common ancestors, sire breeding values can be predicted similarly to the analysis of the repeated observations dataset. A possible mixed model for such an analysis might look as follows
\begin{equation} \mathbf{y} = \mathbf{Xb} + \mathbf{Zs} + \mathbf{e} (#eq:sireunrelatedmlem) \end{equation}
where $\mathbf{y}$ is the vector of length $n$ with responses, $\mathbf{b}$ is the vector of length $p$ with fixed effects, $\mathbf{s}$ is the vector of length $q$ with sire breeding values and $\mathbf{e}$ is the vector of length $n$ with random residuals. Matrices $\mathbf{X}$ and $\mathbf{Z}$ are design matrices which relate the observations to the respective effects.
A dataset that can be used to be analysed with a model such as shown in equation \@ref(eq:sireunrelatedmlem) is given in by the milk
dataset of the package pedigreemm
. The first six lines of this dataset are shown in Table \@ref(tab:milkdatapedigreemmtabmlem).
tbl_milk_pedigreemm_head <- tibble::as.tibble(head(pedigreemm::milk)) knitr::kable(tbl_milk_pedigreemm_head, booktabs = TRUE, longtable = TRUE, caption = "First six lines of milk dataset from package pedigreemm")
In Table \@ref(tab:milkdatapedigreemmtabmlem) the columns milk
, fat
, prot
and scs
stand for milk yield, fat yield, protein yield and somatic cell score for a given lactation of a cow, respectively and can be selected as response variables. The columns lact
, herd
and dim
are lactation, herd number and days in milk, respectively and these columns can be used as fixed effects or covariates. The sire column is used for the random sire breeding value effect in the model.
As already stated, if we assume that these sires are unrelated, the dataset can be analysed as shown above using the function lme4::lmer
.
In real datasets, the assumption of unrelated sires is unrealistic, because the selection process favors that male offspring of a given sire will be selected as sires again. As a consequence, the random sire effects are not independent. The dependence structure between the sire effects must be considered in the analysis. The sire model can still be written as shown in equation \@ref(eq:sireunrelatedmlem), but the variance-covariance matrix of the random sire effects ($\mathbf{s}$) is no longer an identity matrix $\mathbf{I}$ times a common variance component $\sigma_s^2$, but it can be written as
\begin{equation} var(\mathbf{s}) = \mathbf{A}_s * \sigma_s^2 (#eq:sirevarcovmlem) \end{equation}
where $\mathbf{A}_s$ is the sire relationship matrix. In a dataset with $q$ sires, the matrix $\mathbf{A}_s$ has dimensions $q\times q$ and it contains the proportions of sire effects that are passed from father to son. Together with the sire variance $\sigma_s^2$ this defines the variance-covariance structure of all sire effects. As an example, we can express the covariance $cov(s_i, s_k)$ of the sire effects between son $i$ and its sire $k$, as
\begin{equation} cov(s_i, s_k) = 1/2 * \sigma_s^2 (#eq:sirevarcovmlem) \end{equation}
where the factor $1/2$ stems from the fact that sire $k$ passes half of its genetic potential to its son $i$. Relating this single covariance back to the variance-covariance matrix $A_s$ means that elements $(i,k)$ and $(k,i)$ are both $1/2$. The diagonal elements of the matrix $\mathbf{A}_s$ are all equal to $1$. This means the variance $var(s_i)$ of each sire effect $s_i$ corresponds to the sire variance componentn $\sigma_s^2$.
An application of the sire model is shown in the dataset given in Table \@ref(tab:siremodeldatamlem) which is taken from r met$add("Mrode2005")
sigma_s2 <- 5 sigma_e2 <- 55 sigma_p2 <- sigma_s2 + sigma_e2 tbl_sire_model <- tibble::tibble(Animal = c(4:8), Sire = c(1,3,1,4,3), Sex = c("M","F","F","M","M"), WWG = c(4.5, 2.9, 3.9, 3.5, 5.0)) knitr::kable(tbl_sire_model, booktabs = TRUE, longtable = TRUE, caption = "Pre-weaning Gain in kg for five beef animals")
The objective is to predict breeding values for sires $1$, $3$ and $4$ based on the above dataset. The trait pre-weaning gain (WWG
) is taken as response and Sex
is assumed to be the only fixed effect. The following values for the variance components are assumed: $\sigma_s^2 = r sigma_s2
$ and $\sigma_e^2 = r sigma_e2
$.
This type of model where the structure of the variance-covariance matrix of the random effect is given by a pedigree cannot be fit by the package lme4
. An extension of lme4
is given in the R-package pedigreemm
. In pedigreemm
it is possible to specify the variance-covariance structure via a pedigree. For our example of the dataset in Table \@ref(tab:siremodeldatamlem), this can be done as follows
library(pedigreemm) ped_sire <- pedigree(sire = c(rep(NA,2), 1), dam = rep(NA,3), label = as.character(c(1,3,4))) lmem_sire <- pedigreemm( formula = WWG ~ Sex + (1 | Sire), data = tbl_sire_model, pedigree = list(Sire = ped_sire) ) (smry_lmem_sire <- summary(lmem_sire))
The output of pedigreemm::pedigreemm()
is equivalent to the one given by lme4::lmer()
. From this we can see that the sire variance results in an estimate of r attr(smry_lmem_sire$varcor[[1]], "stddev")[[1]]
. This does not correspond to the value that, we specified as an assumption. This has two reasons. The assumed sire variance was not estimated from the small dataset in Table \@ref(tab:siremodeldatamlem), but from a larger dataset not shown here. The second reason is that it is not possible with pedigreemm::pedigreemm()
to use an assumed variance component as input. The breeding values for the sires can be obtained by the function call
ranef(lmem_sire)
The predicted sire breeding values are also all equal to $0$. The reason for this is that the sire variance component was estimated to be $0$. Hence for small datasets, pedigreemm::pedigreemm()
cannot be used.
In a series of papers (r met$add("Henderson1953")
, r met$add("Henderson1963")
and r met$add("Henderson1975")
) which are summarized in r met$add("Henderson1982")
, a technique called mixed model equations was developed to solve for solutions of predicted values in a linear mixed effects model. For a given linear mixed effects model with $var(e) = I*\sigma_e^2$ and $var(s) = A_s * \sigma_s^2$
\begin{equation} \mathbf{y} = \mathbf{Xb} + \mathbf{Zs} + \mathbf{e} (#eq:sireunrelatedtwomlem) \end{equation}
the solutions for the fixed effects estimates and the predicted values of the random effects can be obtained by solving the following set of equations.
\begin{equation} \left[ \begin{array}{cc} X^TX & X^TZ \ Z^TX & Z^TZ + \lambda A_s^{-1} \end{array} \right] \left[ \begin{array}{c} \hat{b} \ \hat{s} \end{array} \right] = \left[ \begin{array}{c} X^Ty \ Z^Ty \end{array} \right] (#eq:siremmemlem) \end{equation}
where $\lambda = \sigma_e^2 / \sigma_s^2$. For our example the matrices $X$ and $Z$ are defined as
mat_X <- model.matrix(lm(formula = WWG ~ 0 + Sex, data = tbl_sire_model)) attr(mat_X, "assign") <- NULL attr(mat_X, "contrasts") <- NULL colnames(mat_X) <- NULL cat(paste0(rmdhelp::bmatrix(pmat = mat_X, ps_name = "X", ps_env = "$$"), collapse = "\n"), "\n")
and
mat_Z <- matrix(c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0), ncol = 3, byrow = TRUE) cat(paste0(rmdhelp::bmatrix(pmat = mat_Z, ps_name = "Z", ps_env = "$$"), collapse = "\n"), "\n")
The matrix $A_s^{-1}$ can be obtained by calling the function pedigreemm::getAinv()
on the defined sire pedigree.
mat_Ainv <- pedigreemm::getAInv(ped = ped_sire)
That results in
mat_Ainv <- as.matrix(mat_Ainv) cat(paste0(rmdhelp::bmatrix(pmat = mat_Ainv, ps_name = "A_s^{-1}", ps_env = "$$"), collapse = "\n"), "\n")
The vector $y$ corresponds to the vector of observations. With that we can solve the mixed model equations.
lambda <- sigma_e2 / sigma_s2 mat_xtx <- crossprod(mat_X) mat_xtz <- crossprod(mat_X, mat_Z) mat_ztx <- crossprod(mat_Z, mat_X) mat_ztzainv <- crossprod(mat_Z) + lambda * mat_Ainv mat_lhs <- rbind(cbind(mat_xtx,mat_xtz),cbind(mat_ztx,mat_ztzainv)) #rhs vec_y <- tbl_sire_model$WWG mat_xty <- crossprod(mat_X, vec_y) mat_zty <- crossprod(mat_Z, vec_y) mat_rhs <- rbind(mat_xty, mat_zty) # sol mat_sol <- solve(mat_lhs, mat_rhs) vec_fix <- mat_sol[1:2,] vec_rand <- mat_sol[3:(nrow(mat_sol)),]
For the fixed effects we get
tbl_fix_eff_sol <- tibble::tibble(Sex = c("F","M"), Solution = vec_fix) knitr::kable(tbl_fix_eff_sol, booktabs = TRUE, longtable = TRUE, caption = "Solutions for fixed Effect of Sex")
For the random sire breeding values, we get
tbl_rand_eff_sol <- tibble::tibble(Sire = c(1,3,4), Solution = vec_rand) knitr::kable(tbl_rand_eff_sol, booktabs = TRUE, longtable = TRUE, caption = "Solutions for random breeding values of sires")
\pagebreak
An extension of the sire model is called animal model. In an animal model not only the sires get predicted breeding values but all animals in the pedigree will be assigned a predicted breeding value. That is only possible, if we extend our dataset by the column of the dam of each animal for which we have observations. That leads to the following table.
sigma_u2 <- 4 * sigma_s2 sigma_e2 <- sigma_p2 - sigma_u2 tbl_animal_model <- tibble::tibble(Animal = c(4:8), Sire = c(1,3,1,4,3), Dam = c(NA, 2, 2, 5, 6), Sex = c("M","F","F","M","M"), WWG = c(4.5, 2.9, 3.9, 3.5, 5.0)) n_nr_founder <- min(tbl_animal_model$Animal) - 1 n_nr_animal <- max(tbl_animal_model$Animal) n_nr_rec <- nrow(tbl_animal_model) knitr::kable(tbl_animal_model, booktabs = TRUE, longtable = TRUE, caption = "Pre-weaning Gain in kg for five beef animals")
In Table \@ref(tab:animalmodeldatamlem) the Dam of animal $4$ is noted as NA
which stands for not available
. This means that the dam of animal $4$ is not known.
Because in an animal model all animals in the pedigree will get a predicted breeding value, we write the vector of breeding values as $\mathbf{u}$ and the complete animal model can be written as follows
\begin{equation} \mathbf{y} = \mathbf{Xb} + \mathbf{Zu} + \mathbf{e} (#eq:animalmodelmlem) \end{equation}
All other components have the same meaning as in the sire model. For the vector $\mathbf{u}$, we also have to define the expeted value ($E(\mathbf{u})$) and the variance-covariance matrix $var(\mathbf{u})$. Breeding values, by their very nature, are deviations from a population mean. This leads to the definition of the expected value of $\mathbf{u}$ to be $E(\mathbf{u}) = \mathbf{0}$. The variance-covariance matrix $var(\mathbf{u})$ of the vector of breeding values $\mathbf{u}$ is similar to the sire model given by the product of a matrix $\mathbf{A}$ and the variance component $\sigma_u^2$. The matrix $\mathbf{A}$ is called numerator relationship matrix. The diagonal elements $(\mathbf{A})_{ii}$ of the matrix $\mathbf{A}$ are computed as
\begin{equation} (\mathbf{A})_{ii} = 1 + F_i (#eq:diagnumrelmatmlem) \end{equation}
where $F_i$ is the inbreeding coefficient of animal $i$ which corresponds to half of the relationship coefficient of the parents $s$ and $d$ of animal $i$. As a formula this can be written as
\begin{equation} F_i = {1 \over 2} * (\mathbf{A})_{sd} (#eq:inbreedingcoeffmlem) \end{equation}
The offdiagonal elements of $\mathbf{A}$ are the proportionality constants which together with $\sigma_u^2$ form the covariance of the breeding values of two animals. If we look at two animals $i$ and $j$, the covariance $cov(u_i, u_j)$ can be written as
\begin{equation} cov(u_i, u_j) = (\mathbf{A})_{ij} * \sigma_u^2 (#eq:offdiagnumrelmatmlem) \end{equation}
The coefficients $(\mathbf{A}){ij}$ can be determined by decomposing both breeding values $u_i$ and $u_j$ recursively into the breeding of their parents until some common ancestors are found in the pedigree. Based on this decomposition, the covariance $cov(u_i,u_j)$ and with that the coefficient $(\mathbf{A}){ij}$ can be computed. If no common ancestors of $i$ and $j$ can be found in the pedigree the covariance $cov(u_i,u_j)$ is zero.
The example dataset shown in Table \@ref(tab:animalmodeldatamlem) cannot be analysed with the package pedigreemm
. The problem is that pedigreemm
does not allow to specify given variance components, but it wants to estimate the variance components from the dataset specified. In the small dataset with only one observation per animal, pedigreemm
cannot estimate both variance components $\sigma_e^2$ and $\sigma_u^2$.
But as already shown with the sire model, it is possible to get estimates of the fixed effects and predicted breeding values for all animals using the solutions to the following mixed model equations.
\begin{equation} \left[ \begin{array}{cc} X^TX & X^TZ \ Z^TX & Z^TZ + \lambda * A^{-1} \end{array} \right] \left[ \begin{array}{c} \hat{b} \ \hat{u} \end{array} \right] = \left[ \begin{array}{c} X^Ty \ Z^Ty \end{array} \right] (#eq:animalmmemlem) \end{equation}
where $\lambda = \sigma_e^2 / \sigma_u^2$. For our example the matrix $X$ is the same as for the sire model. The matrix $Z$ is defined as
mat_Z <- cbind(matrix(0, nrow = n_nr_rec, ncol = n_nr_founder), diag(1, nrow = n_nr_rec)) cat(paste0(rmdhelp::bmatrix(pmat = mat_Z, ps_name = "Z", ps_env = "$$"), collapse = "\n"), "\n")
The inverse $\mathbf{A}^1$ of the numerator relationship matrix can be computed by the function pedigreemm::getAinv()
library(pedigreemm) ped_ani <- pedigree(sire = c(rep(NA, n_nr_founder),1,3,1,4,3), dam = c(rep(NA, n_nr_founder),NA,2,2,5,6), label = as.character(1:n_nr_animal)) mat_Ainv <- getAInv(ped = ped_ani)
mat_Ainv <- as.matrix(mat_Ainv) cat(paste0(rmdhelp::bmatrix(pmat = round(mat_Ainv, digits = 4), ps_name = "\\mathbf{A}^1", ps_env = "$$"), collapse = "\n"), "\n")
Assuming that variance components were estimated from a different dataset, the following values can be used, $\sigma_u^2 = r sigma_u2
$ and $\sigma_e^2 = r sigma_e2
$. With all this information, mixed model equations can be solved.
lambda <- sigma_e2 / sigma_u2 mat_xtx <- crossprod(mat_X) mat_xtz <- crossprod(mat_X, mat_Z) mat_ztx <- crossprod(mat_Z, mat_X) mat_ztzainv <- crossprod(mat_Z) + lambda * mat_Ainv mat_lhs <- rbind(cbind(mat_xtx,mat_xtz),cbind(mat_ztx,mat_ztzainv)) #rhs vec_y <- tbl_sire_model$WWG mat_xty <- crossprod(mat_X, vec_y) mat_zty <- crossprod(mat_Z, vec_y) mat_rhs <- rbind(mat_xty, mat_zty) # sol mat_sol <- solve(mat_lhs, mat_rhs) vec_fix <- mat_sol[1:2,] vec_rand <- mat_sol[3:(nrow(mat_sol)),]
For the fixed effects we get
tbl_fix_eff_sol <- tibble::tibble(Sex = c("F","M"), Solution = vec_fix) knitr::kable(tbl_fix_eff_sol, booktabs = TRUE, longtable = TRUE, caption = "Solutions for fixed Effect of Sex")
For the random animal breeding values, we get
tbl_rand_eff_sol <- tibble::tibble(Animal = c(1:n_nr_animal), Solution = vec_rand) knitr::kable(tbl_rand_eff_sol, booktabs = TRUE, longtable = TRUE, caption = "Solutions for random breeding values of all animals")
Comparing the order of the breeding values of sires $1$, $3$ and $4$, it can be seen that they are not the same for the sire model and the animal model. Although, it has to be noted that the differences are small, but the fact that in the animal model all available information are considered for the prediction of the breeding values, can make a difference when it comes to the ranking of animals as potential parents according to their predicted breeding values.
Prediction of genomic breeding values can be done with two different modelling approaches.
In MEM random effects of markers are directly included in the model. For an idealized data set we can write
\begin{equation} y = 1_n \mu + Wq + e (#eq:asmgblupmem) \end{equation}
\begin{tabular}{lll} where & & \ & $y$ & vector of length $n$ with observations \ & $\mu$ & general mean denoting fixed effects \ & $1_n$ & vector of length $n$ of all ones \ & $q$ & vector of length $m$ of random SNP effects \ & $W$ & design matrix relating SNP-genotypes to observations \ & $e$ & vector of length $n$ of random error terms \end{tabular}
The vector $q$ contains a separate random effect for each SNP. Because the SNP effects are random, the expected value $E\left[q\right]$ and the variance $var(q)$ must be specified. In general, the random effects are defined as deviations and hence their expected value is $0$. This means $E\left[q\right] = 0$. The variance explained by each SNP corresponds to $\sigma_q^2$ and is assumed to be constant. The variance $var(e)$ of the random error terms is taken to be $var(e) = I * \sigma_e^2$ where $I$ is the identity matrix and $\sigma_e^2$ is the error variance.
The random marker effects can be predicted using the following mixed model equations.
\begin{equation} \left[ \begin{array}{cc} 1_n^T1_n & 1_n^TW \ W^T1_n & W^TW + \lambda_q * I \end{array} \right] \left[ \begin{array}{c} \hat{\mu} \ \hat{q} \end{array} \right] = \left[ \begin{array}{c} 1_n^Ty \ W^Ty \end{array} \right] (#eq:memmmemlem) \end{equation}
with $\lambda_q = \sigma_e^2 / \sigma_q^2$.
The genomic breeding value for a given animal $i$ with given genotypes at all SNP-marker positions is computed by summing over the appropriate predicted marker effects solutions $\hat{q}$ determined by the genotypes of animal $i$.
In a breeding value model a linear combination of all SNP effects are combined into a random genomic breeding value. This approach is meant when animal breeders are talking about Genomic BLUP (GBLUP). The mixed linear effects model in GBLUP corresponds to
\begin{equation} y = Xb + Zg + e (#eq:asmgblupbvm) \end{equation}
\begin{tabular}{lll} where & & \ & $y$ & vector of length $n$ with observations \ & $b$ & vector of length $r$ with fixed effects \ & $X$ & incidence matrix linking elements in $b$ to observations \ & $g$ & vector of length $t$ with random genomic breeding values \ & $Z$ & incidence matrix linking elements in $g$ to observations \ & $e$ & vector of length $n$ of random error terms \end{tabular}
The vector $g$ contains the genetic effects of all animals that are genotyped which means that they have genomic information based on SNP genotypes available. The expected values of all random effects is assumed to be $0$. The variance $var(g)$ of the random genomic breeding values is given by $var(g) = G * \sigma_g^2$. This expression looks very similar to the variance of the breeding values in the traditional BLUP animal model. The matrix $G$ is called genomic relationship matrix (GRM). The variance $var(e)$ of the random error terms is given by $var(e) = I * \sigma_e^2$.
Mostly the older animals for which SNP information is available may have observations ($y$) in the dataset. The younger animals may have SNP information but in most cases no information is available for them. The goal of GBLUP is to predict genomic breeding values for these animals. Depending on the number of genotyped animals which is in most cases smaller compared to the number of SNP loci, the BVM model has the following advantages over the MEM model
More recently with the number of genotyped animals growing very fast, these advantages are no longer as important as they used to be.
Genomic breeding values from a BVM can be predicted by solving the following mixed model equations.
\begin{equation} \left[ \begin{array}{cc} X^TX & X^TZ \ Z^TX & Z^TZ + \lambda_g * G^{-1} \end{array} \right] \left[ \begin{array}{c} \hat{b} \ \hat{g} \end{array} \right] = \left[ \begin{array}{c} X^Ty \ Z^Ty \end{array} \right] (#eq:bvmmmemlem) \end{equation}
with $\lambda_g = \sigma_e^2 / \sigma_g^2$.
The variance-covariance matrix between the genetic effects $g$ in model \@ref(eq:asmgblupbvm) is proportional to the genomic relationship matrix $G$. Analogously to the traditional BLUP animal model where the variance-covariance matrix of the random breeding values is proportional to the numerator relationship matrix $A$.
Because the traditional pedigree-based BLUP animal model is very well respected in animal breeding and the defined model \@ref(eq:asmgblupbvm) produces an analogy of the genomic evaluation model to the already known animal model the following properties of $g$ and the genomic relationship matrix $G$ are essential.
The matrix $G$ can be computed based on SNP genotypes. In what follows the material of r met$add("VanRaden2008")
and r met$add("Gianola2009")
is used to derive the genomic relationship matrix.
Based on the SNP marker information the marker effects in the vector $q$ can be estimated. Hence, we assume that the vector $q$ is known. The property that $g$ should be a linear combination of the effects in $q$ means that there exists a matrix $U$ for which we can write
\begin{equation} g = U \cdot q (#eq:asm-gblup-gfromq) \end{equation}
The matrix $U$ is determined based on the desired properties described above.
The genetic effects $g$ should be defined as deviation from a common basis. Due to this definition the expected value of the genetic effect is determined by $E\left[g \right] = 0$. This requirement has the following consequences for the matrix $U$.
Let us have a look at the random variable $w$ which takes the SNP-genotype codes in the matrix $W$ in the MEM model given in \@ref(eq:asmgblupmem). Let us further assume that the SNP loci are in Hardy-Weinberg equilibrium. Then $w$ can take the following values
\begin{equation} w = \left{ \begin{array}{lll} -1 & \text{with probability} & (1-p)^2\ 0 & \text{with probability} & 2p(1-p) \ 1 & \text{with probability} & p^2 \end{array} \right. (#eq:asm-gblup-RandVarGenotypesW) \end{equation}
The expected value of $w$ corresponds to
\begin{equation} E\left[w \right] = (-1) * (1-p)^2 + 0 * 2p(1-p) + 1 * p^2 = -1 + 2p - p^2 + p^2 = 2p - 1 (#eq:asm-gblup-exptectedw) \end{equation}
The matrix $U$ is computed as the difference between the matrix $W$ and the matrix $P$ where the matrix $P$ corresponds to column vectors which have elements corresponding to $2p_j-1$ where $p_j$ corresponds to the allele frequency of the positive allele at SNP locus $j$. The following table gives an overview of the elements of matrix $U$ for the different genotypes at SNP locus $j$.
\vspace{2ex} \begin{center} \begin{tabular}{|l|l|l|} \hline Genotype & Genotypic Value & Coding in Matrix $U$\ \hline $(G_2G_2)_j$ & $-2p_jq_j$ & $-1-2(p_j-0.5) = -2p_j$\ $(G_1G_2)_j$ & $(1-2p_j)q_j$ & $-2(p_j-0.5) = 1-2p_j$\ $(G_1G_1)_j$ & $(2-2p_j)q_j$ & $1-2(p_j-0.5) = 2 - 2p_j$\ \hline \end{tabular} \end{center}
Here we assume that for a locus $G_j$, the allele $(G_1)_j$ has a positive effect and occurs with frequency $p_j$. We can now verify that with this definition of $U$, the expected value for a genetic effect determined by the locus $j$ corresponds to
\begin{align} E\left[g \right]_j &= \left[(1-p_j)^2 * (-2p_j) + 2p_j(1-p_j)(1-2p_j) + p_j^2(2 - 2p_j) \right]q_j \notag \ &= 0 \end{align}
As already postulated the variance-covariance matrix of the genetic effects should be proportional to the genomic relationship matrix $G$.
\begin{equation} var(g) = G * \sigma_g^2 (#eq:asm-gblup-varcovargeneffect) \end{equation}
Computing the same variance-covariance matrix based on equation \@ref(eq:asm-gblup-gfromq)
\begin{equation} var(g) = U \cdot var(q) \cdot U^T (#eq:asm-gblup-uvarqut) \end{equation}
The variance-covariance matrix of the SNP effects is $var(q) = I * \sigma_q^2$. Inserting this into \@ref(eq:asm-gblup-uvarqut) we get $var(g) = UU^T \sigma_q^2$.
In [@Gianola2009] the variance component $\sigma_g^2$ was derived from $\sigma_q^2$ leading to
\begin{equation} \sigma_g^2 = 2 \sum_{j=1}^m p_j(1-p_j)\sigma_q^2 (#eq:asm-gblup-defsigmag2) \end{equation}
Now we combine all relationships for $var(g)$ leading to
\begin{equation} var(g) = G * \sigma_g^2 = UU^T\sigma_q^2 (#eq:asm-gblup-derivegstep1) \end{equation}
In \@ref(eq:asm-gblup-derivegstep1), $\sigma_g^2$ is replaced by the result of \@ref(eq:asm-gblup-defsigmag2).
\begin{equation} G * 2 \sum_{j=1}^m p_j(1-p_j)\sigma_q^2 = UU^T\sigma_q^2 (#eq:asm-gblup-derivegstep2) \end{equation}
Dividing both sides of \@ref(eq:asm-gblup-derivegstep2) by $\sigma_q^2$ and solving for $G$ gives us a formula for the genomic relationship matrix $G$
\begin{equation} G = \frac{UU^T}{2 \sum_{j=1}^m p_j(1-p_j)} (#eq:asm-gblup-derivegresult) \end{equation}
The genomic relationship matrix $G$ allows to predict genomic breeding values for animals with SNP-Genotypes without any observation in the dataset. This fact is the basis of the large benefit of genomic selection. As soon as a young animal is born, its SNP genotypes can be determined and a genomic breeding value can be predicted. This genomic breeding value is much more accurate then the traditional breeding value based only on ancestral information.
The BVM model given in \@ref(eq:asmgblupbvm) is a mixed linear effects model. The solution for the unknown parameters can be obtained by solving the mixed model equations shown in \@ref(eq:asm-gblup-gblupmme). In this form the Inverse $G^{-1}$ of $G$ and the vector $\hat{g}$ of predicted genotypic breeding values are split into one part corresponding to the animals with observations and a second part for the animals without phenotypic information.
matCoeff <- matrix(c("X^TX","X^TZ","0", "Z^TX","Z^TZ + G^{(11)}","G^{(12)}", "0","G^{(21)}","G^{(22)}"), ncol = 3, byrow = TRUE) vecSol <- c("\\hat{b}","\\hat{g}_1","\\hat{g}_2") vecRhs <- c("X^Ty","Z^Ty","0") ### # show mme cat("\\begin{equation}\n") cat(paste0(rmdhelp::bmatrix(pmat = matCoeff), collapse = '\n'), '\n') cat(paste0(rmdhelp::bcolumn_vector(pvec = vecSol), collapse = '\n'), '\n') cat(" = \n") cat(paste0(rmdhelp::bcolumn_vector(pvec = vecRhs), collapse = '\n'), '\n') cat("(\\#eq:asm-gblup-gblupmme)") cat("\\end{equation}\n")
The matrix $G^{(11)}$ denotes the part of $G^{-1}$ corresponding to the animals with phenotypic observations. Similarly, $G^{(22)}$ stands for the part of the animals without genotypic observations. The matrices $G^{(12)}$ and $G^{(21)}$ are the parts of $G^{-1}$ which link the two groups of animals. The same partitioning holds for the vector of predicted breeding values. The vector $\hat{g}_1$ contains the predicted breeding values for the animals with observations and the vector $\hat{g}_2$ contains the predicted breeding values of all animals without phenotypic observations.
Based on the last line of \@ref(eq:asm-gblup-gblupmme) the predicted breeding values $\hat{g}_2$ of all animals without phenotypic observations can be computed from the predicted breeding values $\hat{g}_1$ from the animals with observations.
\begin{equation} \hat{g}_2 = -\left( G^{22}\right)^{-1}G^{21}\hat{g}_1 (#eq:GenomicBvAnimalNoPhen) \end{equation}
Equation \@ref(eq:GenomicBvAnimalNoPhen) is referred to as genomic regression of predicted breeding values of animals without observation on the predicted genomic breeding values of animals with observations.
In real-world livestock breeding datasets not all animals are genotyped. But we want to have predicted breeding values for all animals in a population. Futhermore, the genomic information of the genotyped animals should also give more accurate predicted breeding values for related animals without genomic information.
The single step genomic BLUP model can be specified as
\begin{equation} y = Xb + Zg + e (#eq:singlestepgblupbvm) \end{equation}
with $var(g) = H * \sigma_g^2$ and $var(e) = I * \sigma_e^2$. At this point it is important to note that the vector $g$ of genomic breeding values can be split into two parts
$$g = \left[\begin{array}{c} g_1 \ g_2 \end{array} \right]$$
where $g_1$ is the vector of breeding values for non-genotyped animals and $g_2$ is the vector of genotyped animals.
\begin{equation} \left[ \begin{array}{cc} X^TX & X^TZ \ Z^TX & Z^TZ + \lambda * H^{-1} \end{array} \right] \left[ \begin{array}{c} \hat{b} \ \hat{g} \end{array} \right] = \left[ \begin{array}{c} X^Ty \ Z^Ty \end{array} \right] (#eq:singlestepmmemlem) \end{equation}
where here $\lambda = \sigma_e^2 / \sigma_g^2$.
The above required inverse matrix $H^{-1}$ can be shown (e.g. in r met$add("Legarra2014")
) to correspond to
$$H^{-1} = A^{-1} + \left(\begin{array}{cc} 0 & 0 \ 0 & G^{-1} - A_{22}^{-1} \end{array} \right) $$
where $A^{-1}$ is the inverse numerator relationship matrix and $A_22$ corresponds to the part of the numerator relationship matrix containing all genotyped animals.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.