knitr::opts_chunk$set(echo=TRUE)
Linear mixed model for breeding values:
\begin{align} \boldsymbol{y} = X \boldsymbol{\beta} + Z_a \boldsymbol{a} + \boldsymbol{\epsilon} \end{align}
where:
$\boldsymbol{y}$: $n \times 1$ vector of observations
$X$: $n \times p$ design matrix for fixed-effect factors, of rank $\nu$
$\boldsymbol{\beta}$: $p \times 1$ vector of fixed effects
$Z_a$: $n \times q$ design matrix for $q$ breeding values
$\boldsymbol{a}$: $q \times 1$ vector of random breeding values
$\boldsymbol{\epsilon}$: vector of $n$ residuals
Distributions of the random effects:
$\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, R)$ with $R$ diagonal
$\boldsymbol{a} \sim \mathcal{N}(\boldsymbol{0}, \sigma_a^2 G)$ and $\tau_a := 1 / \sigma_a^2$
Linear mixed model for SNP effects:
\begin{align} \boldsymbol{y} = X \boldsymbol{\beta} + Z_a Z \boldsymbol{u} + \boldsymbol{\epsilon} \end{align}
where:
$Z$: $n \times m$ design matrix for $m$ markers
$\boldsymbol{u}$: $m \times 1$ vector of random marker effects
Distributions of the random effects:
This model is also known as RRBLUP where "RR" stands for "ridge regression".
For GBLUP and SNPBLUP to be "equivalent", $\boldsymbol{a}$ and $Z \boldsymbol{u}$ must have the same probability distribution: $\boldsymbol{a} \overset{d}{=} Z \boldsymbol{u}$.
Let $M$ be the $n \times m$ matrix with $M_{ij} \in {0,1,2}$ so that it contains the number of copies of an arbitrary allele used as reference for SNP $j$ in individual $i$. More generally, if SNP genotypes are imputed, then $M_{ij} \in [0,2]$.
A first condition for equivalence between GBLUP and SNPBLUP is that $Z$ corresponds to $M$ after centering the columns, so that the sample mean of the breeding values equals zero ($\bar{a} = 0$). It ensures the uniqueness of the definition of $\boldsymbol{a}$; otherwise, a different coding of the SNP genotypes in $M$ would lead to different breeding values $\boldsymbol{a}$. This can typically be achieved using the frequencies of each reference allele, noted $p_j$'s, as computed from the sample at hand:
$Z_{ij} = M_{ij} - 2 p_j$
A second condition is that $G$ is estimated using $Z$ such that:
$G = \frac{Z Z^T}{2 \sum_j p_j(1-p_j)}$
As a result: $\sigma_a^2 = \sigma_u^2 \, \times \, 2 \sum_j p_j(1-p_j)$
$(Q + B S C^T)^{-1} = Q^{-1} - Q^{-1} B (S^{-1} + C^T Q^{-1} B)^{-1} C^T Q^{-1}$
Applying the formula for the BLUP:
$\hat{\boldsymbol{u}} := \text{BLUP}(\boldsymbol{u}) = (Z^T R^{-1} Z + \tau_u \text{I})^{-1} Z^T R^{-1} (\boldsymbol{y} - X \hat{\boldsymbol{\beta}})$
$\hat{\boldsymbol{a}} = Z \hat{\boldsymbol{u}} \, \Rightarrow \hat{\boldsymbol{a}} = Z (Z^T R^{-1} Z + \tau_u \text{I})^{-1} Z^T R^{-1} (\boldsymbol{y} - X \hat{\boldsymbol{\beta}})$
... see Strandén and Garrick (2009) ...
$\hat{\boldsymbol{a}} = (R^{-1} + \tau_a G^{-1})^{-1} R^{-1} (\boldsymbol{y} - X \hat{\boldsymbol{\beta}})$
Moving $R^{-1}$ inside the parentheses:
$\hat{\boldsymbol{a}} = (\text{I} + R \tau_a G^{-1})^{-1} (\boldsymbol{y} - X \hat{\boldsymbol{\beta}})$
And now moving $G^{-1}$ outside of the parentheses gives equation 5 of Strandén and Garrick (2009):
$\hat{\boldsymbol{a}} = G (G + R \tau_a)^{-1} (\boldsymbol{y} - X \hat{\boldsymbol{\beta}})$
TODO
... see Strandén and Garrick (2009) ...
$\hat{\boldsymbol{u}} = \tau_a \sigma_u^2 Z^T (G + R \tau_a)^{-1} (\boldsymbol{y} - X \hat{\boldsymbol{\beta}})$
Hence: $\hat{\boldsymbol{u}} = \tau_a \sigma_u^2 Z^T G^{-1} \hat{\boldsymbol{a}}$. First, we fit the GBLUP model by ReML, then we compute $\hat{\boldsymbol{a}}$, and use it to compute $\hat{\boldsymbol{u}}$.
Even though the SNPBLUP model assumes a single variance for the SNP effects, the contribution of the jth SNP to the overall genetic variance can be deduced as in Zhang et al (2010):
$\hat{\sigma}_{u,j}^2 = \hat{u}_j^2 \, 2 p_j (1 - p_j)$
In a GWAS, it is more common to model the effect of each SNP as fixed, and to fit the model SNP by SNP.
For each SNP $j$: $\boldsymbol{y} = \boldsymbol{1} \mu + \boldsymbol{z}_j b_j + \boldsymbol{a} + \boldsymbol{e}$ with $\boldsymbol{a} \sim \mathcal{N}(\boldsymbol{0}, \sigma_a^2 G)$ and $\boldsymbol{e} \sim \mathcal{N}(\boldsymbol{0}, \sigma_e^2 Id)$, where $\boldsymbol{z}_j$ contains centered gene contents at SNP $j$.
Equivalently: $\boldsymbol{y} = \boldsymbol{z}_j b_j + \boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, V)$ where $V = \sigma_a^2 G + \sigma_e^2 Id$.
The BLUE of $b_j$ is: $\hat{b}_j = (\boldsymbol{z}_j^T V^{-1} \boldsymbol{z}_j)^{-1} \boldsymbol{z}_j^T V^{-1} \boldsymbol{y}$ and $Var(\hat{b}_j) = (\boldsymbol{z}_j^T V^{-1} \boldsymbol{z}_j)^{-1}$.
The $t$-statistic for "$H_0: b_j = 0$" then is: $tb_j := \hat{b}_j / \sqrt{Var(\hat{b}_j)}$.
In the SNPBLUP model, the BLUPs of the SNP effects, $\hat{\boldsymbol{u}}$, as well as their variance, $Var(\hat{\boldsymbol{u}})$, can be obtained by inverting the coefficient matrix of the MMEs.
The $t$-statistic to test the null hypothesis for a given SNP, "$H_0: u_j = 0$", then is: $tu_j := \hat{u}_j / \sqrt{Var(\hat{u}_j)}$.
Quite remarkably, it can be shown that $tb_j = tu_j$.
TODO: suppl of Bernal Rubio et al 2016
Hence, the GBLUP-SNPBLUP model can be used to fit the data, and then a GWAS can be performed without having to fit the SNP-by-SNP model.
Caution however, it is not because the test statistics are equal that the estimates of the SNP effects are equal. Because GWAS is often used to prioritize which SNP(s) should be investigated further and functionally validated, the goal is not simply to detect them, but also to estimate their effects as accurately as possible.
See Maeir et al (2015) and Lu et al (2018) for the usage of backsolving in a multivariate setting.
Strandén I, Garrick DJ. 2009. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. Journal of Dairy Science. 92(6):2971–2975. doi:10.3168/jds.2008-1929.
Zhang Z, Liu J, Ding X, Bijma P, De Koning D-J, Zhang Q. 2010. Best Linear Unbiased Prediction of Genomic Breeding Values Using a Trait-Specific Marker-Derived Relationship Matrix. PLoS ONE. 5(9):e12648. doi:10.1371/journal.pone.0012648
Wang H, Misztal I, Aguilar I, Legarra A, Muir WM. 2012. Genome-wide association mapping including phenotypes from relatives without genotypes. Genetics Research. 94(02):73–83. doi:10.1017/S0016672312000274.
Gualdrón Duarte JL, Cantet RJ, Bates RO, Ernst CW, Raney NE, Steibel JP. 2014. Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations. BMC Bioinformatics. 15(1):246. doi:10.1186/1471-2105-15-246.
Bernal Rubio YL, Gualdrón Duarte JL, Bates RO, Ernst CW, Nonneman D, Rohrer GA, King A, Shackelford SD, Wheeler TL, Cantet RJC, et al. 2016. Meta‐analysis of genome‐wide association from genomic prediction models. Animal Genetics. 47(1):36–48. doi:10.1111/age.12378.
Chen C, Steibel JP, Tempelman RJ. 2017. Genome-Wide Association Analyses Based on Broadly Different Specifications for Prior Distributions, Genomic Windows, and Estimation Methods. Genetics. 206(4):1791–1806. doi:10.1534/genetics.117.202259.
Lu Y, Vandehaar MJ, Spurlock DM, Weigel KA, Armentano LE, Connor EE, Coffey M, Veerkamp RF, De Haas Y, Staples CR, et al. 2018. Genome-wide association analyses based on a multiple-trait approach for modeling feed efficiency. Journal of Dairy Science. 101(4):3140–3154. doi:10.3168/jds.2017-13364.
Aguilar I, Legarra A, Cardoso F, Masuda Y, Lourenco D, Misztal I. 2019. Frequentist p-values for large-scale-single step genome-wide association, with an application to birth weight in American Angus cattle. Genet Sel Evol. 51(1):28. doi:10.1186/s12711-019-0469-3.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.