knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
Genetic connectedness quantities the extent of linkage between individuals across management units. The concept of genetic connectedness can be extended to measure a connectedness level between training and testing sets in the whole-genome prediction area. However, there is no user-friendly R package that offers a comprehensive list of connectedness metrics. In this project, we aim to develop an R package, which utilizes either pedigree or genomic data to measure the connectedness between individuals across units.
This package implements three connectedness metrics which are functions of prediction error variance (PEV) matrix, including prediction error variance of difference (PEVD), coefficient of determination (CD) and prediction error correlation (r), coupled with three summary methods for each statistic. For example, the PEVD across units is summarized as 1) average PEV of all pairwise differences between individuals across units; 2) average PEV within and across units; 3) use a contrast vector. Analogous summary methods will be also applied to CD and r statistics. Three additional metrics approximating connectedness using variance of estimates of unit effects (VE) are included, such as variance of estimates of units effects difference (VED), coefficient of determination of VED (CDVED), and connectedness rating (CR). Within each metric, three different methods are names according to the number of corrected factors (e.g., 0, 1 and 2) for fixed effects, such as VED0, VED1 and VED2. Similar corrected function is also applied for metrics of CDVED and CR.
This R package is hosting on a GitHub page accompanied with a detailed vignette document. The core functions of the R code are rewritten with C++ to improve computational speed by taking the advantage of the Rcpp package. We expect this R package will provide a comprehensive tool for genetic connectedness analysis and whole genome prediction.
A PEV matrix typically serves as the kernel of connectedness measures, which can be obtained from Henderson's mixed model equations (MME) [@Henderson1984]. Assume a standard linear mixed model $\mathbf{y} = \mathbf{Xb} + \mathbf{Zu} + \boldsymbol{\epsilon}$, where $\mathbf{y}$, $\mathbf{b}$, $\mathbf{u}$ and $\boldsymbol{\epsilon}$ refer to a vector of phenotypes, effects of management units, random additive genetic effects and residuals, respectively. The $\mathbf{X}$ and $\mathbf{Z}$ are incidence matrices which associate management units effects and individuals to phenotypes, respectively. The variance-covariance structure of this mixed model follows
\begin{align}
\begin{pmatrix}
\mathbf{y} \
\mathbf{u} \
\mathbf{\boldsymbol{\epsilon}}
\end{pmatrix}
&\sim
N \left [
\begin{pmatrix}
\mathbf{Xb}\
0 \
0
\end{pmatrix},
\begin{pmatrix}
\mathbf{ZK}\sigma^2_u \mathbf{Z}' + \mathbf{I}\sigma^2_{\epsilon} & \mathbf{ZK}\sigma^2_u & \mathbf{I}\sigma^2_{\epsilon} \
\mathbf{KZ'}\sigma^2_u & \mathbf{K}\sigma^2_u & 0 \
\mathbf{I}\sigma^2_{\epsilon} & 0 & \mathbf{I}\sigma^2_{\epsilon}
\end{pmatrix}
\right ] .
\end{align}
where $\sigma^2_u$ and $\sigma^2_e$ refer to additive genetic variance and residual variance, respectively. The $\mathbf{K}$ is a positive (semi)definite relationship matrix. Following this, the mixed model equation [@Henderson1984] can be written as
\begin{align}
\begin{bmatrix}
\mathbf{X'X} & \mathbf{X'Z} \
\mathbf{Z'X} & \mathbf{Z'Z + K^{-1}\lambda}
\end{bmatrix}
\begin{bmatrix}
\mathbf{\widehat{b}}\
\mathbf{\widehat{u}}
\end{bmatrix}
\mathbf{=}
\begin{bmatrix}
\mathbf{X'y} \
\mathbf{Z'y}
\end{bmatrix}
\label{MMEeq}
\end{align}
where $\lambda$ is a ratio of variance components which equals to $\frac{\sigma^2_{\epsilon}}{\sigma^2_u}$.
The inverse of the MME coefficient matrix derived from this model is
\begin{align}
\mathbf{C}^{-1} &=
\begin{bmatrix}
\mathbf{X'X} & \mathbf{X'Z} \
\mathbf{Z'X} & \mathbf{Z'Z} + \mathbf{K}^{-1} \lambda
\end{bmatrix}^{-1} \
&=
\begin{bmatrix}
\mathbf{C}^{11} & \mathbf{C}^{12} \
\mathbf{C}^{21} & \mathbf{C}^{22}
\end{bmatrix}.
\end{align}
With this, the PEV of $u$ is defined as [@Henderson1984]
\begin{align}
\text{PEV}(u) &= \text{Var}(\widehat{u} - u) \
&= \text{Var}(u | \widehat{u}) \
&= (\mathbf{Z'MZ} + \mathbf{K}^{-1} \lambda)^{-1} \sigma^2_{\epsilon} \
&= \mathbf{C}^{22} \sigma^2_{\epsilon},
\end{align}
where $\mathbf{M} = \mathbf{I} - \mathbf{X}(\mathbf{X}' \mathbf{X})^{-}\mathbf{X}'$ is the absorption (projection) matrix for fixed effects.
The derivation of PEV oftentimes results in a heavy computational demanding for the high-dimensional dataset. @kennedy1993considerations proposed a direct approximation of mean PEV over unit (PEV_Mean) based on VE, that is
\begin{align} \text{Var}(\widehat{b}) = [\mathbf{X}' \mathbf{X} - \mathbf{X}' \mathbf{Z}(\mathbf{Z'Z} + \mathbf{K}^{-1} \lambda)^{-1}\mathbf{Z}' \mathbf{X}]^{-1}\sigma^2_{\epsilon}. \end{align}
However, @holmes2017 pointed out this equation is not an exact approximation of PEV_Mean, and an according correction is suggested to achieve the equality. When unit effect is the only fixed effect including in the model, an approximate mean of PEV over unit is \begin{equation} \text{PEV_Mean} = \text{Var}(\widehat{b}) - \sigma^2_{\epsilon} (\mathbf{X}' \mathbf{X})^{-1}, \end{equation} where $\sigma^2_{\epsilon} (\mathbf{X}' \mathbf{X})^{-1}$ corrects the number of records within management units. An additional correction was added to derive the mean of PEV over unit which accounts for the additional fixed effects excepts unit effect in the model as [@holmes2017] \begin{align} \text{PEV_Mean} &= \text{Var}(\widehat{b_1}) - \sigma^2_{\epsilon} (\mathbf{X_1}' \mathbf{X_1})^{-1} \ & + (\mathbf{X_1}'\mathbf{X_1})^{-1}\mathbf{X_1}'\mathbf{X_2}\text{Var}(\widehat{b_2})\mathbf{X_2}'\mathbf{X_1}(\mathbf{X_1}'\mathbf{X_1})^{-1} \ & + (\mathbf{X_1}'\mathbf{X_1})^{-1}\mathbf{X_1}'\mathbf{X_2}\text{Cov}(\widehat{b_2}, \widehat{b_1}) \ & + \text{Cov}(\widehat{b_1}, \widehat{b_2})\mathbf{X_2}'\mathbf{X_1}(\mathbf{X_1}'\mathbf{X_1})^{-1} \end{align} where the $\mathbf{X_1}$ and $\mathbf{X_2}$ represent incidence matrix for management units and other fixed effects, respectively. The $\widehat{b_1}$ refers to the estimates of management units effects, and $\widehat{b_2}$ indicates the estimates of other fixed effects.
In general, connectedness metrics can be grouped into two sets, which are the function of PEV and VE, respectively.
The PEVD [@kennedy1993considerations] measures the prediction error variance difference of breeding values between individuals from different management units. The PEVD between two individuals of $i$ and $j$ can be expressed as \begin{align} \text{PEVD}(\widehat{u_i} - \widehat{u_j}) &= [ \text{PEV}(\widehat{u_i}) + \text{PEV}(\widehat{u_j}) - 2 \text{PEC}(\widehat{u_i}, \widehat{u_j})] \ &= (\mathbf{C}^{22}{ii} - \mathbf{C}^{22}{ij} - \mathbf{C}^{22}{ji} + \mathbf{C}^{22}{jj}) \sigma^2_{\epsilon} \ &= (\mathbf{C}^{22}{ii} + \mathbf{C}^{22}{jj} - 2\mathbf{C}^{22}{ij}) \sigma^2{\epsilon} , \end{align} where PEC$_{ij}$ is the off-diagonal element of the PEV matrix which indicates the prediction error covariance or covariance between errors of genetic values.
Individual Average PEVD A calculation of summary PEVD can be traced back to @kennedy1993considerations as average PEV of all pairwise differences between individuals between two management units \begin{align} \text{PEVD}{i'j'} &= \frac{1}{n{i'} \cdot n_{j'}} \sum \text{PEVD}_{i'j'} , \end{align} $n_{i'}$ and $n_{j'}$ are the total number of records in units $i'$ and $j'$, accordingly. The $\sum \text{PEVD}_{i'j'}$ was the sums of all pairwise differences between two units.
Group Average PEVD We can also summarize PEVD using average PEV within and across management units \begin{align} \text{PEVD}{i'j'} &= \overline{\text{PEV}}{i'i'} + \overline{\text{PEV}}{j'j'} -2\overline{\text{PEC}}{i'j'}, \end{align} where the $\overline{\text{PEV}}{i'i'}$, $\overline{\text{PEV}}{i'i'}$ and $\overline{\text{PEC}}_{i'j'}$ indicate the mean PEV in $i'$ management unit, $j'$ mamagement unit, and mean prediction error covariance between $i'$ and $j'$ units.
Contrast PEVD
Alternatively, a summary of PEVD across any pair of management units under a contrast form is [@laloe1993; @yu2017]
\begin{align}
\text{PEVD}(\mathbf{x}) &= \mathbf{x}' \mathbf{C}^{22} \mathbf{x} \sigma^2_{\epsilon},
\end{align}
where the sum of elements in $\mathbf{x}$ equals to zero.
| The coefficient of determination (CD) measures the precision of the estimates breeding values. The pairwise CD between individual $i$ and $j$ is defined as [@laloe1996] \begin{align} \text{CD}{ij} &= \frac{\text{Var}(\mathbf{u}) - \text{Var}(\mathbf{u}|\mathbf{\widehat{u}})}{\text{Var}(\mathbf{u})} \ &= 1 - \frac{\text{Var}(\mathbf{u}|\mathbf{\widehat{u}})}{\text{Var}(\mathbf{u})} \ &= 1 - \lambda \frac{\mathbf{C}^{22}{ii} + \mathbf{C}^{22}{jj} - 2\mathbf{C}^{22}{ij} }{\mathbf{K}{ii} + \mathbf{K}{jj} - 2\mathbf{K}_{ij}}, \end{align} The $ii$ and $jj$ indicate the diagonal elements of the $\mathbf{K}$ matrix for $ith$ and $jth$ individuals, accordingly. The $ij$ refers to the off-diagonal elements of the K matrix, and $\lambda$ is variance ration of $\frac{\sigma^2_e}{\sigma^2_a}$.
Individual Average CD The CD between individuals across two management units can be also derived with average PEV of all pairwise differences \begin{align} \text{CD}{i'j'} &= 1 - \frac{\frac{1}{n{i'} \cdot n_{j'}} \cdot \sigma^2_e \cdot \sum \mathbf{(C^{22}}{i'i'} + \mathbf{C^{22}}{j'j'} - 2\mathbf{C^{22}}{i'j'})}{\frac{1}{n{i'} \cdot n_{j'}} \cdot \sigma^2_u \cdot \sum (\mathbf{K}{i'i'} + \mathbf{K}{j'j'} - 2\mathbf{K}{i'j'})}\ &= 1 - \frac{\frac{1}{n{i'} \cdot n_{j'}}\sum \text{PEVD}{i' j'}}{\frac{1}{n{i'} \cdot n_{j'}} \cdot \sigma^2_u \cdot \sum (\mathbf{K}{i'i'} + \mathbf{K}{j'j'} - 2\mathbf{K}{i'j'})}\ &= 1 - \frac{\sum \text{PEVD}{i'j'}}{\sigma^2_u \cdot \sum (\mathbf{K}{i'i'} + \mathbf{K}{j'j'} - 2\mathbf{K}_{i'j'})}. \end{align}
Group Average CD Similar, average PEV within and across management units can be used to calculate the CD summary statistic \begin{align} \text{CD}{i'j'} &= 1 - \frac{\sigma^2_e \cdot \overline{\mathbf{C^{22}}}{i'i'} + \overline{\mathbf{C^{22}}}{j'j'} -2\overline{\mathbf{C^{22}}}{i'j'}}{\sigma^2_u \cdot (\overline{\mathbf{K}}{i'i'} + \overline{\mathbf{K}}{j'j'} - 2\overline{\mathbf{K}}{i'j'})}\ &= 1 - \frac{\overline{\text{PEV}}{i'i'} + \overline{\text{PEV}}{j'j'} -2\overline{\text{PEC}}{i'j'}}{\sigma^2_u \cdot (\overline{\mathbf{K}}{i'i'} + \overline{\mathbf{K}}{j'j'} - 2\overline{\mathbf{K}}{i'j'})}\ &= 1 - \frac{\text{PEVD}{i'j'}}{\sigma^2_u \cdot (\overline{\mathbf{K}}{i'i'} + \overline{\mathbf{K}}{j'j'} - 2\overline{\mathbf{K}}_{i'j'})}. \end{align} The $\overline{\mathbf{K}}{i'i'}$, $\overline{\mathbf{K}}{j'j'}$ and $\overline{\mathbf{K}}_{i'j'}$ refer to the mean of relationship coefficients in the management unit $i'$, $j'$, and mean relationship coefficients between two units.
Contrast CD A contrast notation of summary CD between any pair of management units is [@laloe1996] \begin{align} \text{CD}(\mathbf{x}) &= 1 - \frac{\text{Var}(\mathbf{x}'\mathbf{u}|\mathbf{\widehat{u}})}{\text{Var}(\mathbf{x}'\mathbf{u})} \ &= 1- \lambda \frac{\mathbf{x}' \mathbf{C}^{22} \mathbf{x} }{ \mathbf{x}' \mathbf{K} \mathbf{x}}\ &= 1- \frac{\mathbf{x}' \mathbf{C}^{22} \mathbf{x} \cdot \sigma^2_e }{ \mathbf{x}' \mathbf{K} \mathbf{x} \cdot \sigma^2_u}\ &= 1- \frac{\text{PEVD(x)}}{ \mathbf{x}' \mathbf{K} \mathbf{x} \cdot \sigma^2_u}, \end{align} where $\mathbf{x}$ and $\mathbf{K}$ is a contrast vector and a relationship matrix defined above.
| The pairwise r statistic [@lewis1999] between individual $i$ and $j$ can be derived from PEV matrix to a predictive error correlation matrix as \begin{align} \text{r}_{ij} = \frac{\text{PEC}(\widehat{u_i}, \widehat{u_j})}{\sqrt{\text{PEV}(\widehat{u_i}) \cdot \text{PEV}(\widehat{u_j})}}. \end{align}
Individual connectedness The summary of individual connectedness calculates pairwise r for all pair of individuals followed by averaging all r across management units: \begin{align} \text{r}{i'j'} = \frac{1}{n{i'} \cdot n_{j'}} \cdot \sum{\frac{\text{PEC}(\widehat{u_{i'}}, \widehat{u_{j'}})}{\sqrt{\text{PEV}(\widehat{u_{i'}}) \cdot \text{PEV}(\widehat{u_{j'}})}}}. \end{align}
Group connectedness
One of the summary statistics of r is flock connectedness, which takes the average of PEV matrix by management units followed by a ratio to r. The Flock r between two management units $i'$ and $j'$ is given by [@kuehn2008]
\begin{align}
\text{r}{i'j'} &= \frac{\overline{\text{PEC}}{i'j'}}{\sqrt{\overline{\text{PEV}}{i'i'} \cdot \overline{\text{PEV}}{j'j'}}}\
&=\frac{ 1/n_{i'} \sum \text{PEC}{i' j'} 1/n{j'}}{ \sqrt{ (1/n_{i'})^2 \sum \text{PEV}{i' i'} \cdot (1/n{j'})^2 \sum \text{PEV}{j' j'} } } \
&= \frac{ \sum \text{PEC}{i' j'} }{ \sqrt{ \sum \text{PEV}{i' i'} \cdot \sum \text{PEV}{j' j'} } },
\end{align}
where $n_{i'}$ and $n_{j'}$ indcate the number of individuals in units $i'$ and $j'$, respectively.
Contrast r
We can also summarize r under a contrast format
\begin{align}
\text{r}(\mathbf{x}) &= \mathbf{x}' \mathbf{rx}.
\end{align}
The $\mathbf{r}$ is a correlation matrix with the element of $r_{ij} = \frac{\text{PEC}(\widehat{u_i}, \widehat{u_j})}{\sqrt{\text{PEV}(\widehat{u_i}) \cdot \text{PEV}(\widehat{u_j})}}$, which indicates the paiwise prdiction error correlation between individual $i$ and $j$.
VED0: Using the summary method of the group average PEVD, the VED0 [@kennedy1993considerations] statistic estimates connectedness with VE rather than PEV_Mean. The VED0 between two units $i'$ and $j'$ is defined as \begin{align} \text{VED0}{i'j'} &= \text{Var}(\widehat{b}){i'i'} + \text{Var}(\widehat{b}){j'j'} - 2\text{Cov}(\widehat{b}){i'j'}, \end{align} where the definition of $\text{Var}(\widehat{b})$, Cov$(\widehat{b})$ and $\overline{\mathbf{K}}$ are same as above.
VED1 & VED2: @holmes2017 suggested quantifying group average PEVD using PEV_Mean \begin{align} \text{PEVD}{i'j'} &= \text{PEV_Mean}{i'i'} + \text{PEV_Mean}{j'j'} -2\text{PEV_Mean}{i'j'}, \end{align} , where PEV_Mean could be directly derived from $\text{Var}(\widehat{b})$ by adding correction factors.
CDVED0 We can also use the variance of estimates management units effects to approximate CD as \begin{align} \text{CD}{i'j'} &= 1 - \frac{\text{Var}(\widehat{b}){i'i'} + \text{Var}(\widehat{b}){j'j'} - 2\text{Cov}(\widehat{b}){i'j'}}{\sigma^2_u \cdot (\overline{\mathbf{K}}{i'i'} + \overline{\mathbf{K}}{j'j'} - 2\overline{\mathbf{K}}_{i'j'})}. \end{align}
CDVED1 & CDVED2 Analogously, CD could be calculated with a correction of $\text{Var}(\widehat{b})$ (@holmes2017) as \begin{align} \text{CD}{i'j'} &= 1 - \frac{\sigma^2_e \cdot \overline{\mathbf{C^{22}}}{i'i'} + \overline{\mathbf{C^{22}}}{j'j'} -2\overline{\mathbf{C^{22}}}{i'j'}}{\sigma^2_u \cdot (\overline{\mathbf{K}}{i'i'} + \overline{\mathbf{K}}{j'j'} - 2\overline{\mathbf{K}}{i'j'})}\ &= 1 - \frac{\text{PEV_Mean}{i'i'} + \text{PEV_Mean}{j'j'} -2\text{PEV_Mean}{i'j'}}{\sigma^2_u \cdot (\overline{\mathbf{K}}{i'i'} + \overline{\mathbf{K}}{j'j'} - 2\overline{\mathbf{K}}_{i'j'})}. \end{align}
The $\text{PEV_Mean}{i'i'}$, $\text{PEV_Mean}{j'j'}$ and $\text{PEV_Mean}_{i'j'}$ are average PEV over management unit $i'$, $j'$ and average of prediction error covariance between management units $i'$ and $j'$, respectively.
CRO @mathur2002 proposed using CR approximates flock connectedness by replacing PEV matrix with VE. The CR of two management units $i'$ and $j'$ is given by \begin{align} \text{CR}{i'j'} &= \frac{\text{Cov}(\widehat{b}){i'j'} }{ \sqrt{ \text{Var}(\widehat{b}){i'i'} \cdot \text{Var}(\widehat{b}){j'j'}}}. \end{align}
CR1 & CR2 The prediction error correlation can be derived with aforementioned correction of variance of estimates of management units effects [@holmes2017] \begin{align} \text{r}{i'j'} &= \frac{\text{PEV_Mean}{i'j'}}{\sqrt{\text{PEV_Mean}{i' i'} \cdot \text{PEV_Mean}{j' j'}}}. \end{align}
install.packages("devtools") library(devtools) install_github('HaipengU/GCA') library(GCA)
The cattle dataset of GCcattle
contains two files of cattle.pheno
and cattle.W
, which include phenotype and marker information, respectively. The details can be checked with ?GCcattle
.
library(GCA) data(GCcattle) dim(cattle.pheno) # phenotype dim(cattle.W) # marker matrix
X <- model.matrix(~ -1 + factor(cattle.pheno$Unit)) # design matrix of unit effect with intercept excluded G <- computeG(cattle.W, maf = 0.05, type = 'G2') # genomic relationship matrix (VanRaden) EVD <- eigen(G) # eigendecomposition var <- varcomp(y = cattle.pheno$Phenotype, Evector = EVD$vectors, Evalue = EVD$values) # variance component var$Vu # variance of additive genetic effects var$Ve # variance of residual CD_IdAve <- gcm(Kmatrix = G, Xmatrix = X, sigma2a = var$Vu, sigma2e = var$Ve, MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_IdAve', NumofMU = 'Pairwise') CD_IdAve
The above example showed the individual average CD, which is a function of PEV and returned a pairwise connectedness between all units. Alternative, it can return an overall connectedness which averages all pairwise CD by setting 'NumofMU' to 'Overall' as
CD_IdAve <- gcm(Kmatrix = G, Xmatrix = X, sigma2a = var$Vu, sigma2e = var$Ve, MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_IdAve', NumofMU = 'Overall') CD_IdAve
We showed the CDVED1 in the following example, which is a function of VE.
CDVED1 <- gcm(Kmatrix = G, Xmatrix = X, sigma2a = var$Vu, sigma2e = var$Ve, MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED1', NumofMU = 'Pairwise') CDVED1
The above example only considered one fixed effect of unit in the model. When more than one fixed effect is included, we can account for these additional fixed effects using statistic 'CDVED2'
X2 <- model.matrix(~ -1 + factor(cattle.pheno$Unit) + factor(cattle.pheno$Sex)) CDVED2 <- gcm(Kmatrix = G, Uidx = 5, Xmatrix = X2, sigma2a = var$Vu, sigma2e = var$Ve, MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED2', NumofMU = 'Pairwise') CDVED2
Notice that, we added additional argument of 'Uidx' in the above example, which is a interger to indicate the last column of unit effects in X matrix. More details can be checked with ?gcm
The following section lists all available metrics in GCA package, which can be called by setting the argument of 'statistic'.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.