In INRA/rgs3: R package wrapping the GS3 program for genomic selection

library(knitr)
knitr::opts_chunk$set(fig.align="center")

Preamble

This document is an R vignette available under the CC BY-SA 4.0 license. It is part of the R package rgs3, a free software available under the GPL version 3 or later, with copyright from the Institut National de la Recherche Agronomique (INRA).

The rgs3 package can be downloaded from GitHub. To install it, we can follow the indications in the README.md file. For the rgs3 package to work successfully, the GS3 program should be already installed (some binaries are available from GitHub).

Once the installation is completed, we can load the package into R:

if("package:rgs3" %in% search())
  detach("package:rgs3", unload=TRUE)

suppressPackageStartupMessages(library(rgs3))
packageVersion("rgs3")

This R chunk is used to assess how much time it takes to execute the R code in this document until the end:

t0 <- proc.time()

Model

To know exactly what the GS3 program can do, it is strongly advised to also read the GS3 manual (the latest version should be on GitHub) as this introduction doesn't show all possible ways to use it.

Notations:

$N$: number of trait measurements (observations)
$Q$: number of generic covariates or cross-classified factors (modeled with fixed effects)
$I$: number of individuals
$P$: number of SNPs
$J$: number of permanent environmental covariates (modeled with random effects)
$\boldsymbol{y}$: $N \times 1$ vector of trait measurements
$X$: $N \times Q$ design matrix of covariates or cross-classified factors
$\boldsymbol{b}$: $Q \times 1$ vector of fixed effects
$Z$: $N \times P$ design matrix of SNP genotypes coded additively
$\boldsymbol{a}$: $P \times 1$ vector of random additive marker locus effects (SNP-based)
$W$: $N \times P$ design matrix of SNP genotypes coded for the dominance
$\boldsymbol{d}$: $P \times 1$ vector of random dominant marker locus effects (SNP-based)
$T$: $N \times I$ design matrix of individuals
$\boldsymbol{g}$: $I \times 1$ vector of random polygenic infinitesimal effects (pedigree-based), such as $\boldsymbol{g} \sim \mathcal{N}_I(\boldsymbol{0}, \sigma_g^2 \, A)$ where $A$ is the $I \times I$ matrix of additive genetic relationships (also called the "numerator relationship matrix")
$S$: $N \times J$ design matrix of permanent environmental covariates
$\boldsymbol{p}$: $J \times 1$ vector of random permanent environmental effects
$\boldsymbol{e}$: $N \times 1$ vector of errors
$\sigma_a^2, \sigma_d^2, \sigma_g^2, \sigma_p^2, \sigma_e^2$: variance components of each random effect
$R$: $N \times N$ variance-covariance matrix of the errors where $R_{n,n} = \frac{\sigma_e^2}{w_n}$ with $w_n$ the weight of the $n$-th observation

Likelihood:

$\boldsymbol{y} = X \boldsymbol{b} + Z \boldsymbol{a} + W \boldsymbol{d} + T \boldsymbol{g} + S \boldsymbol{p} + \boldsymbol{e}$ where $\boldsymbol{e} \sim \mathcal{N}_N(\boldsymbol{0}, \, R)$

Priors: see the GS3 manual

Estimate of the "generalized" genomic breeding value of individual $i$:

$EBV_i = \hat{g}_i + \boldsymbol{z}_i \hat{\boldsymbol{a}} + \boldsymbol{w}_i \hat{\boldsymbol{d}}$

Simulation

A data set was simulated to illustrate how the rgs3 package works (caution, the notations for the simulation model are different than for the inference model above):

SNP genotypes were simulated for $I$ individuals via the sequential coalescent with recombination model from Staab et al (2015);
$P$ SNPs were encoded as allele dose (in ${0,1,2}$), filtered out if they had a minor allele frequency below 1%, and gathered into a matrix used to estimate the $I \times I$ additive genomic relationships matrix $G_{\text{VR1}}$ using the first estimator from VanRaden (2008);
finally, the following model was used to simulate $N$ phenotypes over $Q$ years, with $N = Q \times I$ (each genotype is phenotyped each year):

$\boldsymbol{y} = X \boldsymbol{\beta} + Z \boldsymbol{u} + \boldsymbol{\epsilon}$ where $\boldsymbol{u} \sim \mathcal{N}I(\boldsymbol{0}, \sigma_u^2 \, G{\text{VR1}})$ and $\boldsymbol{\epsilon} \sim \mathcal{N}N(\boldsymbol{0}, \sigma\epsilon^2 \, \text{Id})$ with $\sigma_u^2 = 15$ and $\sigma_\epsilon^2 = 5$, where the narrow-sense heritability here is $h^2 = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_\epsilon^2}$.

Useful approximation (see Gianola et al, 2009): $\sigma_u^2 \approx \sigma_a^2 \times 2 \sum_p f_p (1 - f_p)$, where $f_p$ is the allele frequency of the $p$-th SNP.

Moreover, with such an experimental design, the broad-sense heritability (especially used in plant breeding, see Piepho and Mohring, 2007) here is: $H^2 = \frac{\sigma_u^2}{\sigma_u^2 + \frac{\sigma_\epsilon^2}{Q}}$.

Load the data into R

Phenotypes

Load the phenotypes:

phenos.file <- system.file("extdata", "phenos_df.txt.gz", package="rgs3")
tools::md5sum(path.expand(phenos.file))
phenos <- read.table(phenos.file, header=TRUE)
phenos$year <- as.factor(phenos$year)
str(phenos)
head(phenos)
phenos[phenos$geno == levels(phenos$geno)[1],]
(I <- nlevels(phenos$geno))
(Q <- nlevels(phenos$year))
(N <- nrow(phenos))
sigma.u2 <- 15
sigma.epsilon2 <- 5
(h2 <- sigma.u2 / (sigma.u2 + sigma.epsilon2))
(H2 <- sigma.u2 / (sigma.u2 + (sigma.epsilon2 / Q)))

As can be seen above, the phenotype file contains r ncol(phenos) columns. The first column corresponds to a covariate indicating the year at which the trait was measured. This covariate has $Q=$ r Q levels. The second column corresponds to the $I=$ r I individual identifiers. The third column corresponds to the $N=$ r N trait measurements.

A quick look at the data is always helpful:

par(mar=c(4,4,2,1))
boxplot(phenos$response1 ~ phenos$year, notch=TRUE, horizontal=TRUE, las=1,
        main="Phenotypes", xlab="response", ylab="year")

We can see at least a "year" effect, as well as substantial within-year variation likely coming from both genetic and environmental sources. The goal will be to quantify their respective contributions.

Genotypes

Load the genotypes (only SNPs with a minor allele frequency above 0.01 are available):

genos.file <- system.file("extdata", "genos_mat.txt.gz", package="rgs3")
tools::md5sum(path.expand(genos.file))
genos <- as.matrix(read.table(genos.file))
dim(genos)
genos[1:3,1:6]
(P <- ncol(genos))

As can be seen above, the genotype file contains the matrix of genotypes for each individual at the $P=$ r P SNPs, encoded as allele dose (0/1/2). Individuals are in rows and SNPs in columns.

We can have a quick look at the allele frequencies:

afs <- colMeans(genos, na.rm=TRUE) / 2
summary(afs) # mean should be 0.5 as reference alleles are chosen at random
par(mar=c(2,4,2,1))
hist(afs, breaks="FD", main="Allele frequencies",
     xlab="", ylab="Number of SNPs",
     col="grey", border="white", xlim=c(0, 1), las=1)

We can also have a quick look at minor allele frequencies:

mafs <- apply(genos, 2, function(x){
  x <- x[complete.cases(x)]
  tmp <- sum(x) / (2 * length(x))
  ifelse(tmp <= 0.5, tmp, 1 - tmp)
})
par(mar=c(2,4,2,1))
hist(mafs, breaks="FD", main="Minor allele frequencies",
     xlab="", ylab="number of SNPs",
     col="grey", border="white", xlim=c(0, 0.5), las=1)

We can also have a quick look at the additive genomic relationships:

M <- genos - 1
Pmat <- matrix(rep(1,nrow(genos))) %*% (2 * (afs - 0.5))
Z <- M - Pmat
G.VR1 <- tcrossprod(Z, Z) / (2 * sum(afs * (1 - afs)))
G.VR1[1:3,1:3]
summary(diag(G.VR1)) # under HWE, average should be 1
summary(G.VR1[upper.tri(G.VR1)]) # under HWE, average should be 0
par(mar=c(1,1,2,1))
image(t(G.VR1)[,nrow(G.VR1):1], axes=FALSE, main="G.VR1")

Quick check with another package

Load the true genotypic values:

genovals.file <- system.file("extdata", "genovals.txt.gz", package="rgs3")
tools::md5sum(path.expand(genovals.file))
genovals <- as.matrix(read.table(genovals.file, row.names=1))[,1]
head(genovals)
summary(genovals)

Fit the "rrBLUP/GBLUP" model via ReML (see Habier et al, 2007):

if(require(rrBLUP)){
  st <- system.time(
      fit.rrBLUP.g <- mixed.solve(y=phenos$response1,
                                  Z=model.matrix(~ phenos$geno - 1),
                                  K=G.VR1,
                                  X=model.matrix(~ phenos$year),
                                  method="REML", SE=TRUE, return.Hinv=TRUE))
  print(st)
  message(paste0("estimate of sigma.epsilon2 = ", fit.rrBLUP.g$Ve))
  message(paste0("estimate of sigma.u2 = ", fit.rrBLUP.g$Vu))
  message(paste0("cor(true.u, pred.u) = ", cor(genovals,
                                               fit.rrBLUP.g$u[names(genovals)])))
  print(accuracy.rrBLUP <- lm(fit.rrBLUP.g$u[names(genovals)] ~ genovals))
  plot(x=genovals, y=fit.rrBLUP.g$u[names(genovals)], xlab="True breeding values",
       ylab="Predicted breeding values", main="Check with rrBLUP", las=1, asp=1)
  abline(v=0, h=0, a=0, b=1, lty=2); abline(accuracy.rrBLUP, col="red")
  st <- system.time(
      fit.rrBLUP.mrk <- mixed.solve(y=phenos$response1,
                                    Z=model.matrix(~ phenos$geno - 1) %*% genos,
                                    X=model.matrix(~ phenos$year),
                                    method="REML", SE=TRUE, return.Hinv=TRUE))
  print(st)
  message(paste0("estimate of sigma.epsilon2 = ", fit.rrBLUP.mrk$Ve))
  message(paste0("estimate of sigma.a2 = ", fit.rrBLUP.mrk$Vu))
  sigma.u2.hat <- fit.rrBLUP.mrk$Vu * 2 * sum(afs * (1 - afs))
  message(paste0("estimate of sigma.u2 = ", sigma.u2.hat))
  u.hat <- genos[names(genovals),] %*% fit.rrBLUP.mrk$u[colnames(genos)]
  message(paste0("cor(true.u, pred.u) = ", cor(genovals,
                                               u.hat)))
  y.hat <- model.matrix(~ phenos$geno - 1) %*% u.hat
  message(paste0("cor(y, y-hat) = ", cor(phenos$response1,
                                         y.hat)))
  h2.hat <- sigma.u2.hat / (sigma.u2.hat + fit.rrBLUP.mrk$Ve)
  message(paste0("estimate of h2 = ", h2.hat))
  message(paste0("cor(y, y-hat) / h = ", cor(phenos$response1,
                                             y.hat) /
                                         sqrt(h2.hat)))
}

The estimates are quite close from the "true" values (i.e. those used to simulate the phenotypes, see section "Simulation" above).

Write the data into files formatted for GS3

Let us choose an identifier for our analysis, which will be used for all files, so that it will be easy to remove them at the end:

task.id <- "task-execGS3_intro-rgs3"

Encode the individual identifiers as consecutive numbers from $1$ to $I$:

inds <- setNames(object=1:nlevels(phenos$geno),
                 nm=levels(phenos$geno))
head(inds)

Write the phenotype file in the GS3 format:

phenos.file.gs3 <- paste0(getwd(), "/", task.id, "_phenos.txt")
writeDataForGs3(x=phenos, file=phenos.file.gs3, inds=inds, col.id=2,
                col.traits=3, binary.traits=FALSE)

Write the genotype file in the GS3 format:

genos.file.gs3 <- paste0(getwd(), "/", task.id, "_genos.txt")
writeGenosForGs3(x=genos, file=genos.file.gs3, inds=inds)

Estimate parameters with GS3

The data set simulated as described above will now be analyzed with GS3 via the following model:

$\boldsymbol{y} = X \boldsymbol{b} + Z \boldsymbol{a} + \boldsymbol{e}$ where $\boldsymbol{a} \sim \mathcal{N}_P(\boldsymbol{0}, \sigma_a^2 \, \text{Id})$ and $\boldsymbol{e} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma_e^2 \, \text{Id})$

Write the configuration file for GS3 (you can also use the helper function getDefaultConfig):

(ptl <- data.frame(position=c(which(colnames(phenos) == "year"),
                              ncol(phenos) + 1),
                   type=c("cross",
                          "add_SNP"),
                   nlevels=c(Q,
                             0),
                   stringsAsFactors=FALSE))
config <- list(data.file=phenos.file.gs3,
               genos.file=genos.file.gs3,
               ped.file="",
               num.loci=P,
               method="VCE",
               simul="F",
               ## niter=2000, burnin=200, thin=2,     # debug
               niter=1*10^4, burnin=1*10^3, thin=10,
               conv.crit="1d-8",
               correct=1000,
               vcs.file=paste0(getwd(), "/", task.id, "var.txt"),
               sol.file=paste0(getwd(), "/", task.id, "sol.txt"),
               twc=c(which(colnames(phenos) == "response1"), 0),
               num.eff=nrow(ptl),
               ptl=ptl,
               vc=data.frame(var=c("vara", "vard", "varg", "varp", "vare"),
                             exp=c("2.52d-04","1.75d-06","3.56","2.15","0.19"),
                             df=rep("-2", 5),
                             stringsAsFactors=FALSE),
               rec.id=which(colnames(phenos) == "geno"),
               cont="F",
               mod=rep("T", nrow(ptl)),
               ap=c(1,10),
               dp=c(1,1),
               use.mix="F",
               blasso=FALSE)
isValidConfig(config=config)
config.file <- writeConfigForGs3(config=config,
                                 task.id=task.id)

Here is how the resulting configuration file looks like:

readLines(config.file)

Execute GS3:

system.time(
    stdouterr.file <- execGs3(config.file, task.id))

The GS3 program outputs some messages (too long to display here):

stdouterr <- readLines(stdouterr.file)
head(stdouterr)
length(stdouterr)

Assess estimated variances

If the coda package is installed, we can use the vcs2mcmc function from rgs3 to read the output file from GS3 into an mcmc.list object. But we won't use this function in this vignette as coda may not be available.

Load the variance components' samples:

vcs <- read.table(config$vcs.file, header=TRUE, check.names=FALSE)
str(vcs)

Assess convergence visually, at least for some parameters:

par(mfrow=c(1,2), mar=c(4,4,2,1))
for(vc in c("vara","vare")){
  plot(vcs[,vc], las=1,
       xlab="iterations",
       ylab="samples",
       main=vc)
  abline(h=mean(vcs[,vc]), col="red")
  abline(v=101, lty=2)
  legend(ifelse(vc == "vara", "bottomright", "topright"),
         legend="posterior mean", col="red", lty=1, bty="n")
}

It looks like the chain converged, even though, to be more thorough, one should only declare convergence after checking it for all parameters.

Moreover, as indicated by the vertical, dotted line, we can discard more samples at the beginning:

idx.tokeep <- 101:nrow(vcs)

Look at posterior mean and variance of parameters of interest:

message(paste0("posterior mean of sigma.e2 = ",
               mean(vcs[idx.tokeep, "vare"])))
message(paste0("posterior sd of sigma.e2 = ",
               sd(vcs[idx.tokeep, "vare"])))
message(paste0("posterior mean of sigma.a2 = ",
               mean(vcs[idx.tokeep, "vara"])))
message(paste0("posterior sd of sigma.a2 = ",
               sd(vcs[idx.tokeep, "vara"])))
message(paste0("posterior mean of '2varapqpi' = ",
               mean(vcs[idx.tokeep, "2varapqpi"])))
message(paste0("posterior sd of '2varapqpi' = ",
               sd(vcs[idx.tokeep, "2varapqpi"])))

The estimates are quite close from the "true" values (i.e. those used to simulate the phenotypes, see section "Simulation" above).

Look at the heritability posteriors:

vcs$varu <- vcs$vara * 2 * sum(afs * (1 - afs))
vcs$h2 <- vcs$varu / (vcs$varu + vcs$vare)
vcs$H2 <- vcs$varu / (vcs$varu + (vcs$vare / Q))
message(paste0("posterior mean of 'varu' = ",
               mean(vcs[idx.tokeep, "varu"])))
message(paste0("posterior sd of 'varu' = ",
               sd(vcs[idx.tokeep, "varu"])))
h2.hat <- mean(vcs[idx.tokeep, "h2"])
message(paste0("posterior mean of 'h2' = ",
               h2.hat))
message(paste0("posterior sd of 'varu' = ",
               sd(vcs[idx.tokeep, "h2"])))
message(paste0("posterior mean of 'H2' = ",
               mean(vcs[idx.tokeep, "H2"])))
message(paste0("posterior sd of 'H2' = ",
               sd(vcs[idx.tokeep, "H2"])))

Assess estimated effects

The fixed and various SNP effects are found in the file specified via the variable sol.file (see above):

sols <- read.table(file=config$sol.file, header=TRUE)
str(sols)
table(sols$effect)

The first r Q rows correspond to the r Q fixed effects.
The remaining r nrow(sols)-Q rows correspond to the additive effects of the r P SNP genotypes.

summary(sols$solution[sols$effect == 2])
lim <- max(abs(sols$solution[sols$effect == 2]))
par(mar=c(2,4,2,1))
hist(sols$solution[sols$effect == 2], breaks="FD", xlim=c(-lim, lim),
     main="Additive effects of the SNPs",
     xlab="", ylab="number of SNPs",
     col="grey", border="white", las=1)
abline(v=0, lty=2)

Assess predicted genotypic values

The various estimated genotypic values are saved in the file <config.file>_EBVs:

ebvs <- read.table(file=paste0(config.file, "_EBVs"), header=TRUE)
ebvs <- ebvs[! duplicated(ebvs$id),]
str(ebvs)
rownames(ebvs) <- names(inds)
head(ebvs)

The first column, id, corresponds to the identifiers of the individuals defined in the vector inds above.
The second column, g_aSNP, corresponds to the sum of the posterior means of the additive effects of the SNP genotypes, $\sum_p \hat{a}_p = \boldsymbol{z}_i \hat{\boldsymbol{a}}$.
The third column, g_dSNP, corresponds to the sum of the posterior means of the dominant effects of the SNP genotypes, $\sum_p \hat{d}_p = \boldsymbol{w}_i \hat{\boldsymbol{d}}$.
The fourth column, poly_anim, corresponds to the posterior mean of the polygenic infinitesimal effect (pedigree-based), $\hat{g}_i$.
The last column, g_overall, corresponds to the estimated "generalized" genomic breeding value, that is, the sum of the three previous columns.

We can have a quick look at the distribution of these gEBVs:

summary(ebvs$g_overall)
lim <- max(abs(ebvs$g_overall))
par(mar=c(2,4,2,1))
hist(ebvs$g_overall, breaks="FD", xlim=c(-lim, lim),
     main="gEBVs (with uncentered SNP genotypes)",
     xlab="", ylab="number of individuals",
     col="grey", border="white", las=1)
abline(v=0, lty=2)

The bias is explained by the fact that GS3 returns (biological) "genotypic" values instead of (statistical) "breeding" values (EBVs), despite of the output file name (see Vitezica et al (2013) for a thorough explanation):

"genotypic" values: $u = Z a$ where $\forall i,j, \; Z_{ij} \in {-1,0,1}$ and $v = W d$ where $\forall i,j, \; W_{ij} \in {0,1}$;
"breeding" values: $u_{ebv} = Z^\star \alpha$ where $\alpha = a + (1 - f_p) d$ is the additive substitution effect and $\forall i,j, \; Z_{ij}^\star \in {-2 f_p, 1 - f_p, 2 - 2 f_p}$, i.e. $Z$ is centered.

The difference between the true breeding values and the raw output from GS3 hence comes from the way SNP genotypes are encoded in $Z$ versus $Z^\star$:

Z <- genos - 1
a.hat <- sols$solution[sols$effect == 2]
u.hat <- Z %*% a.hat
u.hat <- setNames(as.vector(u.hat), rownames(u.hat))
all.equal(u.hat, ebvs$g_aSNP, tolerance=10^(-6), check.attributes=FALSE)
alpha.hat <- sols$solution[sols$effect == 2] # all dominance effects are null in this analysis
Z.star <- Z - matrix(rep(1,nrow(genos))) %*% (2 * (afs - 0.5))
u.ebv <- Z.star %*% alpha.hat
u.ebv <- setNames(as.vector(u.ebv), rownames(u.ebv))
summary(u.ebv)

Check with the genotypic values used to simulate the phenotypes:

cor(x=genovals, ebvs[names(genovals), "g_overall"], method="pearson")
cor(x=genovals, u.ebv[names(genovals)], method="pearson")
(accuracy.GS3 <- lm(u.ebv[names(genovals)] ~ genovals))
plot(x=genovals, y=u.ebv[names(genovals)], asp=1, las=1,
     xlab="True genotypic values", ylab="Predicted genotypic values",
     main="Check with GS3")
abline(h=0, v=0, a=0, b=1, lty=2); abline(accuracy.GS3, col="red")

With real data sets, the genotypic values are usually unknown. A common practice hence is to look at the correlation between individual phenotypes and those predicted based on the markers, divided by the square root of the estimated narrow-sense heritability (see Daetwyler et al, 2013):

y.hat <- model.matrix(~ phenos$geno - 1) %*%
  genos %*% sols$solution[sols$effect == 2]
cor(x=phenos$response1, y=y.hat)
cor(x=phenos$response1, y=y.hat) / sqrt(h2.hat)

Remove output files

cleanGs3(config, config.file, task.id)
for(f in c(phenos.file.gs3, genos.file.gs3))
  if(file.exists(f))
    file.remove(f)

Acknowledgments

Andrés Legarra

Appendix

t1 <- proc.time(); t1 - t0
print(sessionInfo(), locale=FALSE)

INRA/rgs3 documentation built on May 20, 2019, 5:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

INRA/rgs3
R package wrapping the GS3 program for genomic selection

In INRA/rgs3: R package wrapping the GS3 program for genomic selection

Preamble

Model

Simulation

Load the data into R

Phenotypes

Genotypes

Quick check with another package

Write the data into files formatted for GS3

Estimate parameters with GS3

Assess estimated variances

Assess estimated effects

Assess predicted genotypic values

Remove output files

Acknowledgments

Appendix

R Package Documentation

Browse R Packages

We want your feedback!

INRA/rgs3 R package wrapping the GS3 program for genomic selection

In INRA/rgs3: R package wrapping the GS3 program for genomic selection

Preamble

Model

Simulation

Load the data into R

Phenotypes

Genotypes

Quick check with another package

Write the data into files formatted for GS3

Estimate parameters with GS3

Assess estimated variances

Assess estimated effects

Assess predicted genotypic values

Remove output files

Acknowledgments

Appendix

R Package Documentation

Browse R Packages

We want your feedback!

INRA/rgs3
R package wrapping the GS3 program for genomic selection