library(knitr) opts_chunk$set(echo = TRUE,message=FALSE,warning=FALSE,comment="##", fig.width=4,fig.height=4,dpi=150) opts_knit$set(root.dir="~/Box Sync/Endelman/Software/StageWise") load("~/Software/StageWise/data/wheat.rda")
When estimating BLUEs in Stage1, an unstructured trait covariance matrix for the residuals is included. The phenotypes for one environment are denoted $y_{ikl}$, where $i$ represents genotype, $k$ is trait, and $l$ represents one or more indices for treatments, covariates, etc.
$$ y_{ikl} = \mu + g_{ik} + \dots + \epsilon_{ikl} $$ The genotype effects $g_{ik}$ are fixed, and the residuals are multivariate normal with $var[\boldsymbol{\epsilon}] = \mathbf{I} \otimes \boldsymbol\Sigma_\epsilon$, where $\boldsymbol\Sigma_\epsilon$ is an unstructured var-cov matrix for traits.
The response variable for Stage2 is $BLUE[g_{ijk}]$, where the subscript $j$ is now included to represent environment. The Stage2 model with multivariate normal additive $\boldsymbol{a}$ and dominance $\boldsymbol{d}$ effects is
$$BLUE[g_{ijk}] = E_{jk} + a_{ik} + d_{ik} - b_k F_i + gE_{ijk} + s_{ijk}$$
where $E_{jk}$ is the fixed effect for environment, $F_i$ is the genomic inbreeding coefficient, and its regression coefficient $b_k$ represents baseline heterosis, i.e., the difference between the population at panmictic equilibrium vs. fully inbred. The other effects are multivariate normal with zero mean and $var[\boldsymbol{a}]=\mathbf{G} \otimes \boldsymbol\Sigma_a$, $var[\boldsymbol{d}]=\mathbf{D} \otimes \boldsymbol\Sigma_d$, and $var[\boldsymbol{gE}]=\mathbf{I} \otimes \boldsymbol\Sigma_{gE}$. The preceding $\boldsymbol\Sigma$ are unstructured var-cov matrices for traits. The Stage1 error term $\boldsymbol{s}$ is also multivariate normal, with var-cov matrix equal to the direct sum of the var-cov matrices of the Stage1 BLUEs for each environment.
In Vignette 1, three traits were analyzed independently: yield, maturity, and fry color. The syntax for analyzing multiple traits is very similar.
library(StageWise) pheno.file <- system.file("vignette_data", "pheno1a.csv", package = "StageWise") ans1 <- Stage1(filename=pheno.file,traits=c("total.yield","vine.maturity","fry.color"), effects <- data.frame(name=c("block","stand.count"), fixed=c(FALSE,TRUE), factor=c(TRUE,FALSE))) names(ans1)
As with the single trait analysis, Stage1
returns a data frame of BLUEs and a list of their var-cov matrices. Instead of residual diagnostic plots, however, the residual covariance matrices are returned in "resid".
As of v1.04 of the package, the Stage2
command has an additional argument "pairwise", which indicates whether all traits should be analyzed at once, or if the traits should be analyzed in pairs to build up the full covariance matrix. The pairwise approach may be needed with many traits to get convergence in ASReml-R. Let's compare the two approaches with this dataset.
geno.file <- system.file("vignette_data", "geno1.csv", package = "StageWise") geno <- read_geno(geno.file,ploidy=4,map=TRUE,dominance = TRUE) ans2 <- Stage2(data=ans1$blue, vcov=ans1$vcov, geno=geno, non.add="dom", pairwise=FALSE) ans2p <- Stage2(data=ans1$blue, vcov=ans1$vcov, geno=geno, non.add="dom", pairwise=TRUE) summary(ans2$vars) summary(ans2p$vars)
When "pairwise=FALSE", the summary
command shows the variances and proportion of variation explained (PVE) as separate tables, as well as the additive genetic correlation between traits. When "pairwise=TRUE", only the additive genetic correlation matrix is available. The correlation matrices are similar and show that later maturity is correlated with higher yield, which will need to be addressed when constructing a selection index.
The next step in the workflow is blup_prep
, which has exactly the same syntax as the single trait analysis.
prep1 <- blup_prep(ans1$blues, vcov=ans1$vcov, geno=geno, vars=ans2$vars)
For multi-trait analyses, the blup
command was designed to predict total genetic merit, not individual traits, so the user must supply index coefficients.
The gain
command allows breeders to compare the expected response for each trait under different indices. The argument "merit" specifies the contribution of each standardized trait for total genetic merit. For example, if genetic gains for fry color and yield are equally valuable, and maturity is ignored, the code is
merit.coeff=c("total.yield"=1, "vine.maturity"=0, "fry.color"=1) gain1 <- gain(prep1, merit=merit.coeff) kable(gain1$table)
In the table returned by gain
, because multi-trait BLUPs are used in the selection index, the index coefficients have the same ratio as the merit coefficients (1:0:1)--they have just been rescaled to have unit norm. The new information is the expected response for each trait, in units of intensity x genetic standard deviation ($i\sigma$). By default breeding values are used for $\sigma$, but this can be controlled with argument "gamma" (see reference manual).
For a given selection intensity, the set of all possible responses is an ellipsoid. The argument "traits" in gain
can be used to see the ellipse for exactly 2 traits, which illustrates selection tradeoffs.
gain1 <- gain(input=prep1, merit=merit.coeff, traits=c("total.yield","vine.maturity")) gain1$plot
The dashed red line shows the direction of the merit vector, while the blue line segment is the projection of the point on the ellipsoid that maximizes genetic merit.
Because our maturity trait scale ranges from 1 = early to 9 = late, the positive response implies selection for later maturity, which is undesirable. The "restricted" argument in the gain
command can be used to restrict the response for particular traits while maximizing genetic merit for other traits. The form of the argument is a data frame with two columns: "trait" and "sign". The "sign" column can have one of three symbols: "=", "<", ">", which indicate whether the response is $= 0$, $\leq 0$, or $\geq 0$, respectively. In this case, we want the maturity response to be less than or equal to zero:
gain2 <- gain(input=prep1, merit=merit.coeff, restricted=data.frame(trait="vine.maturity", sign="<")) kable(gain2$table)
Compared with the response table for the unrestricted index, the yield response decreased from $0.44i\sigma$ to $0.29i\sigma$, while the response for fry color slightly increased. Multiplying the response vector by the merit coefficients and summing generates the response for total genetic merit, which is necessarily lower for the restricted index:
sum(merit.coeff*gain1$table$response) sum(merit.coeff*gain2$table$response)
One other option in gain
is "desired", which allows the user to specify the desired gains for each trait (in units of $i\sigma$), rather than specifying their contribution to genetic merit. Either "desired" or "merit" can be specified, not both. To achieve equal gains for yield and fry color, while restricting maturity, the desired gains vector is numerically equal to the previous merit coefficient vector:
gain3 <- gain(input=prep1, desired=merit.coeff) kable(gain3$table) sum(merit.coeff*gain3$table$response)
The above result shows the response for total merit is slightly less under the desired gains index, if we continue to assume equal merit for yield and fry color. But if yield is actually more important than fry color, this solution seems desirable.
Let's use the desired gains index to compute BLUPs.
index.coeff <- gain3$table$index names(index.coeff) <- gain3$table$trait GEBV <- blup(prep1, geno, what="BV", index.coeff=index.coeff) head(GEBV)
As explained in Vignette 1, the dominance
function expresses the inbreeding depression and dominance variance estimates relative to the additive SD and variance estimates, respectively. This also works for multi-trait models once the index coefficients are specified. To be consistent with the above gain
calculation, gamma=1/3 is needed for tetraploid breeding values:
dominance(ans2$params, index.coeff=index.coeff, gamma=1/3)
One last comment: groups of unrelated traits can be analyzed independently through blup_prep
, and the outputs from this command can be combined as a list in blup
to make the predictions. This assumes zero correlation between traits in the different groups and requires the same genetic model to have been used.
The “mask” argument for blup_prep
makes it easy to investigate the potential benefit of using a correlated, secondary trait to improve genomic selection. For example, many plant breeding programs are exploring the use of spectral measurements from high-throughput phenotyping platforms to improve selection for yield. The following example is based on data from Rutkoski et al. (2016), who showed that canopy temperature (CT) during grain fill was predictive of yield in wheat. The G matrix and Stage 1 BLUEs from the drought and extreme drought environments are distributed with the package. As with the potato dataset in Vignette 1, including the Stage 1 errors in Stage 2 lowers the AIC substantially.
data(wheat) #load the wheat data head(wheat.blues) ans2a <- Stage2(data=wheat.blues, vcov=wheat.vcov, geno=wheat.geno, non.add="none") ans2b <- Stage2(data=wheat.blues, geno=wheat.geno, non.add="none") data.frame(vcov=c(TRUE,FALSE), AIC=c(ans2a$aic,ans2b$aic))
Because the wheat lines are inbred, the genetic residual option in StageWise could be used for modeling non-additive values, but the AIC was lower without it (non.add="none"). Genomic heritability was 0.45-0.50 for yield and canopy temperature, with an additive genetic correlation of -0.81.
summary(ans2a$vars)
For genomic predictions, first we will do a tenfold cross validation without using CT data for the selection candidates, which can be called marker-based selection (MBS, see Vignette 1). Since the goal is yield prediction, the index coefficients are 1 and 0 for GY and CT, respectively.
id <- unique(wheat.blues$id) N <- length(id) folds <- split(sample(id),cut(1:N,10)) MBS <- NULL for (i in 1:10) { prep <- blup_prep(wheat.blues, wheat.vcov, wheat.geno, ans2a$vars, mask=data.frame(id=folds[[i]])) pred <- blup(prep, geno=wheat.geno, what="BV", index.coeff=c(GY=1, CT=0)) MBS <- rbind(MBS, pred[pred$id %in% folds[[i]],]) }
In the above code, the "mask" argument for blup_prep
only has the variable "id", which means that all Stage 1 BLUEs for those individuals are masked. To only mask grain yield and use CT as a secondary trait for marker-assisted selection (MAS), a second variable named "trait" is used.
MAS <- NULL for (i in 1:10) { prep <- blup_prep(wheat.blues, wheat.vcov, wheat.geno, ans2a$vars, mask=data.frame(id=folds[[i]], trait="GY")) pred <- blup(prep, geno=wheat.geno, what="BV", index.coeff=c(GY=1, CT=0)) MAS <- rbind(MAS, pred[pred$id %in% folds[[i]],]) } ans <- merge(MBS,MAS,by="id") library(ggplot2) ggplot(ans,aes(x=r2.x, y=r2.y)) + geom_hex() + coord_fixed(ratio=1) + geom_line(data=data.frame(x=c(0.2,0.8),y=c(0.2,0.8)),mapping=aes(x=x,y=y),linetype=2) + ggtitle("Reliability") + xlab("MBS") + ylab("MAS")
The above figure shows that using CT increased the reliability of genomic prediction, by 0.2 on average.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.