knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

Introduction

Genetic connectedness quantities the extent of linkage between individuals across management units. The concept of genetic connectedness can be extended to measure a connectedness level between training and testing sets in the whole-genome prediction area. However, there is no user-friendly R package that offers a comprehensive list of connectedness metrics. In this project, we aim to develop an R package, which utilizes either pedigree or genomic data to measure the connectedness between individuals across units.

This package implements three connectedness metrics which are functions of prediction error variance (PEV) matrix, including prediction error variance of difference (PEVD), coefficient of determination (CD) and prediction error correlation (r), coupled with three summary methods for each statistic. For example, the PEVD across units is summarized as 1) average PEV of all pairwise differences between individuals across units; 2) average PEV within and across units; 3) use a contrast vector. Analogous summary methods will be also applied to CD and r statistics. Three additional metrics approximating connectedness using variance of estimates of unit effects (VE) are included, such as variance of estimates of units effects difference (VED), coefficient of determination of VED (CDVED), and connectedness rating (CR). Within each metric, three different methods are names according to the number of corrected factors (e.g., 0, 1 and 2) for fixed effects, such as VED0, VED1 and VED2. Similar corrected function is also applied for metrics of CDVED and CR.

This R package is hosting on a GitHub page accompanied with a detailed vignette document. The core functions of the R code are rewritten with C++ to improve computational speed by taking the advantage of the Rcpp package. We expect this R package will provide a comprehensive tool for genetic connectedness analysis and whole genome prediction.

Core functions of connectedness metrics

Prediction Error Variance

A PEV matrix typically serves as the kernel of connectedness measures, which can be obtained from Henderson's mixed model equations (MME) [@Henderson1984]. Assume a standard linear mixed model $\mathbf{y} = \mathbf{Xb} + \mathbf{Zu} + \boldsymbol{\epsilon}$, where $\mathbf{y}$, $\mathbf{b}$, $\mathbf{u}$ and $\boldsymbol{\epsilon}$ refer to a vector of phenotypes, effects of management units, random additive genetic effects and residuals, respectively. The $\mathbf{X}$ and $\mathbf{Z}$ are incidence matrices which associate management units effects and individuals to phenotypes, respectively. The variance-covariance structure of this mixed model follows \begin{align} \begin{pmatrix} \mathbf{y} \ \mathbf{u} \ \mathbf{\boldsymbol{\epsilon}} \end{pmatrix} &\sim N \left [ \begin{pmatrix} \mathbf{Xb}\ 0 \ 0 \end{pmatrix}, \begin{pmatrix} \mathbf{ZK}\sigma^2_u \mathbf{Z}' + \mathbf{I}\sigma^2_{\epsilon} & \mathbf{ZK}\sigma^2_u & \mathbf{I}\sigma^2_{\epsilon} \ \mathbf{KZ'}\sigma^2_u & \mathbf{K}\sigma^2_u & 0 \ \mathbf{I}\sigma^2_{\epsilon} & 0 & \mathbf{I}\sigma^2_{\epsilon} \end{pmatrix} \right ] . \end{align} where $\sigma^2_u$ and $\sigma^2_e$ refer to additive genetic variance and residual variance, respectively. The $\mathbf{K}$ is a positive (semi)definite relationship matrix. Following this, the mixed model equation [@Henderson1984] can be written as \begin{align} \begin{bmatrix} \mathbf{X'X} & \mathbf{X'Z} \ \mathbf{Z'X} & \mathbf{Z'Z + K^{-1}\lambda} \end{bmatrix} \begin{bmatrix} \mathbf{\widehat{b}}\ \mathbf{\widehat{u}} \end{bmatrix} \mathbf{=} \begin{bmatrix} \mathbf{X'y} \ \mathbf{Z'y} \end{bmatrix} \label{MMEeq} \end{align} where $\lambda$ is a ratio of variance components which equals to $\frac{\sigma^2_{\epsilon}}{\sigma^2_u}$. The inverse of the MME coefficient matrix derived from this model is \begin{align} \mathbf{C}^{-1} &= \begin{bmatrix} \mathbf{X'X} & \mathbf{X'Z} \ \mathbf{Z'X} & \mathbf{Z'Z} + \mathbf{K}^{-1} \lambda
\end{bmatrix}^{-1} \ &= \begin{bmatrix} \mathbf{C}^{11} & \mathbf{C}^{12} \ \mathbf{C}^{21} & \mathbf{C}^{22} \end{bmatrix}. \end{align
} With this, the PEV of $u$ is defined as [@Henderson1984] \begin{align} \text{PEV}(u) &= \text{Var}(\widehat{u} - u) \ &= \text{Var}(u | \widehat{u}) \ &= (\mathbf{Z'MZ} + \mathbf{K}^{-1} \lambda)^{-1} \sigma^2_{\epsilon} \ &= \mathbf{C}^{22} \sigma^2_{\epsilon}, \end{align} where $\mathbf{M} = \mathbf{I} - \mathbf{X}(\mathbf{X}' \mathbf{X})^{-}\mathbf{X}'$ is the absorption (projection) matrix for fixed effects.

Variance of estimates of management units effects

The derivation of PEV oftentimes results in a heavy computational demanding for the high-dimensional dataset. @kennedy1993considerations proposed a direct approximation of mean PEV over unit (PEV_Mean) based on VE, that is

\begin{align} \text{Var}(\widehat{b}) = [\mathbf{X}' \mathbf{X} - \mathbf{X}' \mathbf{Z}(\mathbf{Z'Z} + \mathbf{K}^{-1} \lambda)^{-1}\mathbf{Z}' \mathbf{X}]^{-1}\sigma^2_{\epsilon}. \end{align}

However, @holmes2017 pointed out this equation is not an exact approximation of PEV_Mean, and an according correction is suggested to achieve the equality. When unit effect is the only fixed effect including in the model, an approximate mean of PEV over unit is \begin{equation} \text{PEV_Mean} = \text{Var}(\widehat{b}) - \sigma^2_{\epsilon} (\mathbf{X}' \mathbf{X})^{-1}, \end{equation} where $\sigma^2_{\epsilon} (\mathbf{X}' \mathbf{X})^{-1}$ corrects the number of records within management units. An additional correction was added to derive the mean of PEV over unit which accounts for the additional fixed effects excepts unit effect in the model as [@holmes2017] \begin{align} \text{PEV_Mean} &= \text{Var}(\widehat{b_1}) - \sigma^2_{\epsilon} (\mathbf{X_1}' \mathbf{X_1})^{-1} \ & + (\mathbf{X_1}'\mathbf{X_1})^{-1}\mathbf{X_1}'\mathbf{X_2}\text{Var}(\widehat{b_2})\mathbf{X_2}'\mathbf{X_1}(\mathbf{X_1}'\mathbf{X_1})^{-1} \ & + (\mathbf{X_1}'\mathbf{X_1})^{-1}\mathbf{X_1}'\mathbf{X_2}\text{Cov}(\widehat{b_2}, \widehat{b_1}) \ & + \text{Cov}(\widehat{b_1}, \widehat{b_2})\mathbf{X_2}'\mathbf{X_1}(\mathbf{X_1}'\mathbf{X_1})^{-1} \end{align} where the $\mathbf{X_1}$ and $\mathbf{X_2}$ represent incidence matrix for management units and other fixed effects, respectively. The $\widehat{b_1}$ refers to the estimates of management units effects, and $\widehat{b_2}$ indicates the estimates of other fixed effects.

Connectedness Metrics

In general, connectedness metrics can be grouped into two sets, which are the function of PEV and VE, respectively.

Function of PEV

Prediction Error Variance of Difference

The PEVD [@kennedy1993considerations] measures the prediction error variance difference of breeding values between individuals from different management units. The PEVD between two individuals of $i$ and $j$ can be expressed as \begin{align} \text{PEVD}(\widehat{u_i} - \widehat{u_j}) &= [ \text{PEV}(\widehat{u_i}) + \text{PEV}(\widehat{u_j}) - 2 \text{PEC}(\widehat{u_i}, \widehat{u_j})] \ &= (\mathbf{C}^{22}{ii} - \mathbf{C}^{22}{ij} - \mathbf{C}^{22}{ji} + \mathbf{C}^{22}{jj}) \sigma^2_{\epsilon} \ &= (\mathbf{C}^{22}{ii} + \mathbf{C}^{22}{jj} - 2\mathbf{C}^{22}{ij}) \sigma^2{\epsilon} , \end{align} where PEC$_{ij}$ is the off-diagonal element of the PEV matrix which indicates the prediction error covariance or covariance between errors of genetic values.

Coefficient of Determination

| The coefficient of determination (CD) measures the precision of the estimates breeding values. The pairwise CD between individual $i$ and $j$ is defined as [@laloe1996] \begin{align} \text{CD}{ij} &= \frac{\text{Var}(\mathbf{u}) - \text{Var}(\mathbf{u}|\mathbf{\widehat{u}})}{\text{Var}(\mathbf{u})} \ &= 1 - \frac{\text{Var}(\mathbf{u}|\mathbf{\widehat{u}})}{\text{Var}(\mathbf{u})} \ &= 1 - \lambda \frac{\mathbf{C}^{22}{ii} + \mathbf{C}^{22}{jj} - 2\mathbf{C}^{22}{ij} }{\mathbf{K}{ii} + \mathbf{K}{jj} - 2\mathbf{K}_{ij}}, \end{align} The $ii$ and $jj$ indicate the diagonal elements of the $\mathbf{K}$ matrix for $ith$ and $jth$ individuals, accordingly. The $ij$ refers to the off-diagonal elements of the K matrix, and $\lambda$ is variance ration of $\frac{\sigma^2_e}{\sigma^2_a}$.

Prediction Error Correlation

| The pairwise r statistic [@lewis1999] between individual $i$ and $j$ can be derived from PEV matrix to a predictive error correlation matrix as \begin{align} \text{r}_{ij} = \frac{\text{PEC}(\widehat{u_i}, \widehat{u_j})}{\sqrt{\text{PEV}(\widehat{u_i}) \cdot \text{PEV}(\widehat{u_j})}}. \end{align}

Function of VE

Variance of estimates of unit effects differences

Coefficient of Determination of VED

The $\text{PEV_Mean}{i'i'}$, $\text{PEV_Mean}{j'j'}$ and $\text{PEV_Mean}_{i'j'}$ are average PEV over management unit $i'$, $j'$ and average of prediction error covariance between management units $i'$ and $j'$, respectively.

Connectedness Rating (CR)

Application of GCA package

Install GCA from Github

install.packages("devtools")
library(devtools)
install_github('HaipengU/GCA')
library(GCA)

Load example cattle dataset in GCA package.

The cattle dataset of GCcattle contains two files of cattle.pheno and cattle.W, which include phenotype and marker information, respectively. The details can be checked with ?GCcattle.

library(GCA)
data(GCcattle)
dim(cattle.pheno) # phenotype
dim(cattle.W) # marker matrix

Example of connectedness across units.

Individual average CD

X <- model.matrix(~ -1 + factor(cattle.pheno$Unit)) # design matrix of unit effect with intercept excluded
G <- computeG(cattle.W, maf = 0.05, type = 'G2') # genomic relationship matrix (VanRaden)
EVD <- eigen(G) # eigendecomposition 
var <- varcomp(y = cattle.pheno$Phenotype, Evector = EVD$vectors, Evalue = EVD$values) # variance component
var$Vu # variance of additive genetic effects
var$Ve # variance of residual
CD_IdAve <- gcm(Kmatrix = G, Xmatrix = X, sigma2a = var$Vu, sigma2e = var$Ve, 
                MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_IdAve', NumofMU = 'Pairwise')
CD_IdAve

The above example showed the individual average CD, which is a function of PEV and returned a pairwise connectedness between all units. Alternative, it can return an overall connectedness which averages all pairwise CD by setting 'NumofMU' to 'Overall' as

CD_IdAve <- gcm(Kmatrix = G, Xmatrix = X, sigma2a = var$Vu, sigma2e = var$Ve, 
                MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_IdAve', NumofMU = 'Overall')
CD_IdAve

Coefficient of determination of VED

We showed the CDVED1 in the following example, which is a function of VE.

CDVED1 <- gcm(Kmatrix = G, Xmatrix = X, sigma2a = var$Vu, sigma2e = var$Ve, 
                MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED1', NumofMU = 'Pairwise')
CDVED1

The above example only considered one fixed effect of unit in the model. When more than one fixed effect is included, we can account for these additional fixed effects using statistic 'CDVED2'

X2 <- model.matrix(~ -1 + factor(cattle.pheno$Unit)
                        + factor(cattle.pheno$Sex))
CDVED2 <- gcm(Kmatrix = G, Uidx = 5, Xmatrix = X2, sigma2a = var$Vu, sigma2e = var$Ve,
              MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED2', NumofMU = 'Pairwise')
CDVED2

Notice that, we added additional argument of 'Uidx' in the above example, which is a interger to indicate the last column of unit effects in X matrix. More details can be checked with ?gcm

Available connectedness metrics in GCA package.

The following section lists all available metrics in GCA package, which can be called by setting the argument of 'statistic'.

Authors

References



HaipengU/GCA_r documentation built on July 3, 2019, 11:30 a.m.