library(knitr) knitr::opts_chunk$set(fig.align="center")
This document is an R vignette available under the CC BY-SA 4.0 license.
It is part of the R package rgs3
, a free software available under the GPL version 3 or later, with copyright from the Institut National de la Recherche Agronomique (INRA).
The rgs3
package can be downloaded from GitHub.
To install it, we can follow the indications in the README.md
file.
For the rgs3
package to work successfully, the GS3 program should be already installed (some binaries are available from GitHub).
Once the installation is completed, we can load the package into R:
if("package:rgs3" %in% search()) detach("package:rgs3", unload=TRUE)
suppressPackageStartupMessages(library(rgs3)) packageVersion("rgs3")
This R chunk is used to assess how much time it takes to execute the R code in this document until the end:
t0 <- proc.time()
To know exactly what the GS3 program can do, it is strongly advised to also read the GS3 manual (the latest version should be on GitHub) as this introduction doesn't show all possible ways to use it.
Notations:
$N$: number of trait measurements (observations)
$Q$: number of generic covariates or cross-classified factors (modeled with fixed effects)
$I$: number of individuals
$P$: number of SNPs
$J$: number of permanent environmental covariates (modeled with random effects)
$\boldsymbol{y}$: $N \times 1$ vector of trait measurements
$X$: $N \times Q$ design matrix of covariates or cross-classified factors
$\boldsymbol{b}$: $Q \times 1$ vector of fixed effects
$Z$: $N \times P$ design matrix of SNP genotypes coded additively
$\boldsymbol{a}$: $P \times 1$ vector of random additive marker locus effects (SNP-based)
$W$: $N \times P$ design matrix of SNP genotypes coded for the dominance
$\boldsymbol{d}$: $P \times 1$ vector of random dominant marker locus effects (SNP-based)
$T$: $N \times I$ design matrix of individuals
$\boldsymbol{g}$: $I \times 1$ vector of random polygenic infinitesimal effects (pedigree-based), such as $\boldsymbol{g} \sim \mathcal{N}_I(\boldsymbol{0}, \sigma_g^2 \, A)$ where $A$ is the $I \times I$ matrix of additive genetic relationships (also called the "numerator relationship matrix")
$S$: $N \times J$ design matrix of permanent environmental covariates
$\boldsymbol{p}$: $J \times 1$ vector of random permanent environmental effects
$\boldsymbol{e}$: $N \times 1$ vector of errors
$\sigma_a^2, \sigma_d^2, \sigma_g^2, \sigma_p^2, \sigma_e^2$: variance components of each random effect
$R$: $N \times N$ variance-covariance matrix of the errors where $R_{n,n} = \frac{\sigma_e^2}{w_n}$ with $w_n$ the weight of the $n$-th observation
Likelihood:
$\boldsymbol{y} = X \boldsymbol{b} + Z \boldsymbol{a} + W \boldsymbol{d} + T \boldsymbol{g} + S \boldsymbol{p} + \boldsymbol{e}$ where $\boldsymbol{e} \sim \mathcal{N}_N(\boldsymbol{0}, \, R)$
Priors: see the GS3 manual
Estimate of the "generalized" genomic breeding value of individual $i$:
$EBV_i = \hat{g}_i + \boldsymbol{z}_i \hat{\boldsymbol{a}} + \boldsymbol{w}_i \hat{\boldsymbol{d}}$
A data set was simulated to illustrate how the rgs3
package works (caution, the notations for the simulation model are different than for the inference model above):
SNP genotypes were simulated for $I$ individuals via the sequential coalescent with recombination model from Staab et al (2015);
$P$ SNPs were encoded as allele dose (in ${0,1,2}$), filtered out if they had a minor allele frequency below 1%, and gathered into a matrix used to estimate the $I \times I$ additive genomic relationships matrix $G_{\text{VR1}}$ using the first estimator from VanRaden (2008);
finally, the following model was used to simulate $N$ phenotypes over $Q$ years, with $N = Q \times I$ (each genotype is phenotyped each year):
$\boldsymbol{y} = X \boldsymbol{\beta} + Z \boldsymbol{u} + \boldsymbol{\epsilon}$ where $\boldsymbol{u} \sim \mathcal{N}I(\boldsymbol{0}, \sigma_u^2 \, G{\text{VR1}})$ and $\boldsymbol{\epsilon} \sim \mathcal{N}N(\boldsymbol{0}, \sigma\epsilon^2 \, \text{Id})$ with $\sigma_u^2 = 15$ and $\sigma_\epsilon^2 = 5$, where the narrow-sense heritability here is $h^2 = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_\epsilon^2}$.
Useful approximation (see Gianola et al, 2009): $\sigma_u^2 \approx \sigma_a^2 \times 2 \sum_p f_p (1 - f_p)$, where $f_p$ is the allele frequency of the $p$-th SNP.
Moreover, with such an experimental design, the broad-sense heritability (especially used in plant breeding, see Piepho and Mohring, 2007) here is: $H^2 = \frac{\sigma_u^2}{\sigma_u^2 + \frac{\sigma_\epsilon^2}{Q}}$.
Load the phenotypes:
phenos.file <- system.file("extdata", "phenos_df.txt.gz", package="rgs3") tools::md5sum(path.expand(phenos.file)) phenos <- read.table(phenos.file, header=TRUE) phenos$year <- as.factor(phenos$year) str(phenos) head(phenos) phenos[phenos$geno == levels(phenos$geno)[1],] (I <- nlevels(phenos$geno)) (Q <- nlevels(phenos$year)) (N <- nrow(phenos)) sigma.u2 <- 15 sigma.epsilon2 <- 5 (h2 <- sigma.u2 / (sigma.u2 + sigma.epsilon2)) (H2 <- sigma.u2 / (sigma.u2 + (sigma.epsilon2 / Q)))
As can be seen above, the phenotype file contains r ncol(phenos)
columns.
The first column corresponds to a covariate indicating the year at which the trait was measured.
This covariate has $Q=$ r Q
levels.
The second column corresponds to the $I=$ r I
individual identifiers.
The third column corresponds to the $N=$ r N
trait measurements.
A quick look at the data is always helpful:
par(mar=c(4,4,2,1)) boxplot(phenos$response1 ~ phenos$year, notch=TRUE, horizontal=TRUE, las=1, main="Phenotypes", xlab="response", ylab="year")
We can see at least a "year" effect, as well as substantial within-year variation likely coming from both genetic and environmental sources. The goal will be to quantify their respective contributions.
Load the genotypes (only SNPs with a minor allele frequency above 0.01 are available):
genos.file <- system.file("extdata", "genos_mat.txt.gz", package="rgs3") tools::md5sum(path.expand(genos.file)) genos <- as.matrix(read.table(genos.file)) dim(genos) genos[1:3,1:6] (P <- ncol(genos))
As can be seen above, the genotype file contains the matrix of genotypes for each individual at the $P=$ r P
SNPs, encoded as allele dose (0/1/2).
Individuals are in rows and SNPs in columns.
We can have a quick look at the allele frequencies:
afs <- colMeans(genos, na.rm=TRUE) / 2 summary(afs) # mean should be 0.5 as reference alleles are chosen at random par(mar=c(2,4,2,1)) hist(afs, breaks="FD", main="Allele frequencies", xlab="", ylab="Number of SNPs", col="grey", border="white", xlim=c(0, 1), las=1)
We can also have a quick look at minor allele frequencies:
mafs <- apply(genos, 2, function(x){ x <- x[complete.cases(x)] tmp <- sum(x) / (2 * length(x)) ifelse(tmp <= 0.5, tmp, 1 - tmp) }) par(mar=c(2,4,2,1)) hist(mafs, breaks="FD", main="Minor allele frequencies", xlab="", ylab="number of SNPs", col="grey", border="white", xlim=c(0, 0.5), las=1)
We can also have a quick look at the additive genomic relationships:
M <- genos - 1 Pmat <- matrix(rep(1,nrow(genos))) %*% (2 * (afs - 0.5)) Z <- M - Pmat G.VR1 <- tcrossprod(Z, Z) / (2 * sum(afs * (1 - afs))) G.VR1[1:3,1:3] summary(diag(G.VR1)) # under HWE, average should be 1 summary(G.VR1[upper.tri(G.VR1)]) # under HWE, average should be 0 par(mar=c(1,1,2,1)) image(t(G.VR1)[,nrow(G.VR1):1], axes=FALSE, main="G.VR1")
Load the true genotypic values:
genovals.file <- system.file("extdata", "genovals.txt.gz", package="rgs3") tools::md5sum(path.expand(genovals.file)) genovals <- as.matrix(read.table(genovals.file, row.names=1))[,1] head(genovals) summary(genovals)
Fit the "rrBLUP/GBLUP" model via ReML (see Habier et al, 2007):
if(require(rrBLUP)){ st <- system.time( fit.rrBLUP.g <- mixed.solve(y=phenos$response1, Z=model.matrix(~ phenos$geno - 1), K=G.VR1, X=model.matrix(~ phenos$year), method="REML", SE=TRUE, return.Hinv=TRUE)) print(st) message(paste0("estimate of sigma.epsilon2 = ", fit.rrBLUP.g$Ve)) message(paste0("estimate of sigma.u2 = ", fit.rrBLUP.g$Vu)) message(paste0("cor(true.u, pred.u) = ", cor(genovals, fit.rrBLUP.g$u[names(genovals)]))) print(accuracy.rrBLUP <- lm(fit.rrBLUP.g$u[names(genovals)] ~ genovals)) plot(x=genovals, y=fit.rrBLUP.g$u[names(genovals)], xlab="True breeding values", ylab="Predicted breeding values", main="Check with rrBLUP", las=1, asp=1) abline(v=0, h=0, a=0, b=1, lty=2); abline(accuracy.rrBLUP, col="red") st <- system.time( fit.rrBLUP.mrk <- mixed.solve(y=phenos$response1, Z=model.matrix(~ phenos$geno - 1) %*% genos, X=model.matrix(~ phenos$year), method="REML", SE=TRUE, return.Hinv=TRUE)) print(st) message(paste0("estimate of sigma.epsilon2 = ", fit.rrBLUP.mrk$Ve)) message(paste0("estimate of sigma.a2 = ", fit.rrBLUP.mrk$Vu)) sigma.u2.hat <- fit.rrBLUP.mrk$Vu * 2 * sum(afs * (1 - afs)) message(paste0("estimate of sigma.u2 = ", sigma.u2.hat)) u.hat <- genos[names(genovals),] %*% fit.rrBLUP.mrk$u[colnames(genos)] message(paste0("cor(true.u, pred.u) = ", cor(genovals, u.hat))) y.hat <- model.matrix(~ phenos$geno - 1) %*% u.hat message(paste0("cor(y, y-hat) = ", cor(phenos$response1, y.hat))) h2.hat <- sigma.u2.hat / (sigma.u2.hat + fit.rrBLUP.mrk$Ve) message(paste0("estimate of h2 = ", h2.hat)) message(paste0("cor(y, y-hat) / h = ", cor(phenos$response1, y.hat) / sqrt(h2.hat))) }
The estimates are quite close from the "true" values (i.e. those used to simulate the phenotypes, see section "Simulation" above).
Let us choose an identifier for our analysis, which will be used for all files, so that it will be easy to remove them at the end:
task.id <- "task-execGS3_intro-rgs3"
Encode the individual identifiers as consecutive numbers from $1$ to $I$:
inds <- setNames(object=1:nlevels(phenos$geno), nm=levels(phenos$geno)) head(inds)
Write the phenotype file in the GS3 format:
phenos.file.gs3 <- paste0(getwd(), "/", task.id, "_phenos.txt") writeDataForGs3(x=phenos, file=phenos.file.gs3, inds=inds, col.id=2, col.traits=3, binary.traits=FALSE)
Write the genotype file in the GS3 format:
genos.file.gs3 <- paste0(getwd(), "/", task.id, "_genos.txt") writeGenosForGs3(x=genos, file=genos.file.gs3, inds=inds)
The data set simulated as described above will now be analyzed with GS3 via the following model:
$\boldsymbol{y} = X \boldsymbol{b} + Z \boldsymbol{a} + \boldsymbol{e}$ where $\boldsymbol{a} \sim \mathcal{N}_P(\boldsymbol{0}, \sigma_a^2 \, \text{Id})$ and $\boldsymbol{e} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma_e^2 \, \text{Id})$
Write the configuration file for GS3 (you can also use the helper function getDefaultConfig
):
(ptl <- data.frame(position=c(which(colnames(phenos) == "year"), ncol(phenos) + 1), type=c("cross", "add_SNP"), nlevels=c(Q, 0), stringsAsFactors=FALSE)) config <- list(data.file=phenos.file.gs3, genos.file=genos.file.gs3, ped.file="", num.loci=P, method="VCE", simul="F", ## niter=2000, burnin=200, thin=2, # debug niter=1*10^4, burnin=1*10^3, thin=10, conv.crit="1d-8", correct=1000, vcs.file=paste0(getwd(), "/", task.id, "var.txt"), sol.file=paste0(getwd(), "/", task.id, "sol.txt"), twc=c(which(colnames(phenos) == "response1"), 0), num.eff=nrow(ptl), ptl=ptl, vc=data.frame(var=c("vara", "vard", "varg", "varp", "vare"), exp=c("2.52d-04","1.75d-06","3.56","2.15","0.19"), df=rep("-2", 5), stringsAsFactors=FALSE), rec.id=which(colnames(phenos) == "geno"), cont="F", mod=rep("T", nrow(ptl)), ap=c(1,10), dp=c(1,1), use.mix="F", blasso=FALSE) isValidConfig(config=config) config.file <- writeConfigForGs3(config=config, task.id=task.id)
Here is how the resulting configuration file looks like:
readLines(config.file)
Execute GS3:
system.time( stdouterr.file <- execGs3(config.file, task.id))
The GS3 program outputs some messages (too long to display here):
stdouterr <- readLines(stdouterr.file) head(stdouterr) length(stdouterr)
If the coda package is installed, we can use the vcs2mcmc
function from rgs3
to read the output file from GS3 into an mcmc.list
object.
But we won't use this function in this vignette as coda
may not be available.
Load the variance components' samples:
vcs <- read.table(config$vcs.file, header=TRUE, check.names=FALSE) str(vcs)
Assess convergence visually, at least for some parameters:
par(mfrow=c(1,2), mar=c(4,4,2,1)) for(vc in c("vara","vare")){ plot(vcs[,vc], las=1, xlab="iterations", ylab="samples", main=vc) abline(h=mean(vcs[,vc]), col="red") abline(v=101, lty=2) legend(ifelse(vc == "vara", "bottomright", "topright"), legend="posterior mean", col="red", lty=1, bty="n") }
It looks like the chain converged, even though, to be more thorough, one should only declare convergence after checking it for all parameters.
Moreover, as indicated by the vertical, dotted line, we can discard more samples at the beginning:
idx.tokeep <- 101:nrow(vcs)
Look at posterior mean and variance of parameters of interest:
message(paste0("posterior mean of sigma.e2 = ", mean(vcs[idx.tokeep, "vare"]))) message(paste0("posterior sd of sigma.e2 = ", sd(vcs[idx.tokeep, "vare"]))) message(paste0("posterior mean of sigma.a2 = ", mean(vcs[idx.tokeep, "vara"]))) message(paste0("posterior sd of sigma.a2 = ", sd(vcs[idx.tokeep, "vara"]))) message(paste0("posterior mean of '2varapqpi' = ", mean(vcs[idx.tokeep, "2varapqpi"]))) message(paste0("posterior sd of '2varapqpi' = ", sd(vcs[idx.tokeep, "2varapqpi"])))
The estimates are quite close from the "true" values (i.e. those used to simulate the phenotypes, see section "Simulation" above).
Look at the heritability posteriors:
vcs$varu <- vcs$vara * 2 * sum(afs * (1 - afs)) vcs$h2 <- vcs$varu / (vcs$varu + vcs$vare) vcs$H2 <- vcs$varu / (vcs$varu + (vcs$vare / Q)) message(paste0("posterior mean of 'varu' = ", mean(vcs[idx.tokeep, "varu"]))) message(paste0("posterior sd of 'varu' = ", sd(vcs[idx.tokeep, "varu"]))) h2.hat <- mean(vcs[idx.tokeep, "h2"]) message(paste0("posterior mean of 'h2' = ", h2.hat)) message(paste0("posterior sd of 'varu' = ", sd(vcs[idx.tokeep, "h2"]))) message(paste0("posterior mean of 'H2' = ", mean(vcs[idx.tokeep, "H2"]))) message(paste0("posterior sd of 'H2' = ", sd(vcs[idx.tokeep, "H2"])))
The fixed and various SNP effects are found in the file specified via the variable sol.file
(see above):
sols <- read.table(file=config$sol.file, header=TRUE) str(sols) table(sols$effect)
The first r Q
rows correspond to the r Q
fixed effects.
The remaining r nrow(sols)-Q
rows correspond to the additive effects of the r P
SNP genotypes.
summary(sols$solution[sols$effect == 2]) lim <- max(abs(sols$solution[sols$effect == 2])) par(mar=c(2,4,2,1)) hist(sols$solution[sols$effect == 2], breaks="FD", xlim=c(-lim, lim), main="Additive effects of the SNPs", xlab="", ylab="number of SNPs", col="grey", border="white", las=1) abline(v=0, lty=2)
The various estimated genotypic values are saved in the file <config.file>_EBVs
:
ebvs <- read.table(file=paste0(config.file, "_EBVs"), header=TRUE) ebvs <- ebvs[! duplicated(ebvs$id),] str(ebvs) rownames(ebvs) <- names(inds) head(ebvs)
The first column, id
, corresponds to the identifiers of the individuals defined in the vector inds
above.
The second column, g_aSNP
, corresponds to the sum of the posterior means of the additive effects of the SNP genotypes, $\sum_p \hat{a}_p = \boldsymbol{z}_i \hat{\boldsymbol{a}}$.
The third column, g_dSNP
, corresponds to the sum of the posterior means of the dominant effects of the SNP genotypes, $\sum_p \hat{d}_p = \boldsymbol{w}_i \hat{\boldsymbol{d}}$.
The fourth column, poly_anim
, corresponds to the posterior mean of the polygenic infinitesimal effect (pedigree-based), $\hat{g}_i$.
The last column, g_overall
, corresponds to the estimated "generalized" genomic breeding value, that is, the sum of the three previous columns.
We can have a quick look at the distribution of these gEBVs:
summary(ebvs$g_overall) lim <- max(abs(ebvs$g_overall)) par(mar=c(2,4,2,1)) hist(ebvs$g_overall, breaks="FD", xlim=c(-lim, lim), main="gEBVs (with uncentered SNP genotypes)", xlab="", ylab="number of individuals", col="grey", border="white", las=1) abline(v=0, lty=2)
The bias is explained by the fact that GS3 returns (biological) "genotypic" values instead of (statistical) "breeding" values (EBVs), despite of the output file name (see Vitezica et al (2013) for a thorough explanation):
"genotypic" values: $u = Z a$ where $\forall i,j, \; Z_{ij} \in {-1,0,1}$ and $v = W d$ where $\forall i,j, \; W_{ij} \in {0,1}$;
"breeding" values: $u_{ebv} = Z^\star \alpha$ where $\alpha = a + (1 - f_p) d$ is the additive substitution effect and $\forall i,j, \; Z_{ij}^\star \in {-2 f_p, 1 - f_p, 2 - 2 f_p}$, i.e. $Z$ is centered.
The difference between the true breeding values and the raw output from GS3 hence comes from the way SNP genotypes are encoded in $Z$ versus $Z^\star$:
Z <- genos - 1 a.hat <- sols$solution[sols$effect == 2] u.hat <- Z %*% a.hat u.hat <- setNames(as.vector(u.hat), rownames(u.hat)) all.equal(u.hat, ebvs$g_aSNP, tolerance=10^(-6), check.attributes=FALSE) alpha.hat <- sols$solution[sols$effect == 2] # all dominance effects are null in this analysis Z.star <- Z - matrix(rep(1,nrow(genos))) %*% (2 * (afs - 0.5)) u.ebv <- Z.star %*% alpha.hat u.ebv <- setNames(as.vector(u.ebv), rownames(u.ebv)) summary(u.ebv)
Check with the genotypic values used to simulate the phenotypes:
cor(x=genovals, ebvs[names(genovals), "g_overall"], method="pearson") cor(x=genovals, u.ebv[names(genovals)], method="pearson") (accuracy.GS3 <- lm(u.ebv[names(genovals)] ~ genovals)) plot(x=genovals, y=u.ebv[names(genovals)], asp=1, las=1, xlab="True genotypic values", ylab="Predicted genotypic values", main="Check with GS3") abline(h=0, v=0, a=0, b=1, lty=2); abline(accuracy.GS3, col="red")
With real data sets, the genotypic values are usually unknown. A common practice hence is to look at the correlation between individual phenotypes and those predicted based on the markers, divided by the square root of the estimated narrow-sense heritability (see Daetwyler et al, 2013):
y.hat <- model.matrix(~ phenos$geno - 1) %*% genos %*% sols$solution[sols$effect == 2] cor(x=phenos$response1, y=y.hat) cor(x=phenos$response1, y=y.hat) / sqrt(h2.hat)
cleanGs3(config, config.file, task.id) for(f in c(phenos.file.gs3, genos.file.gs3)) if(file.exists(f)) file.remove(f)
t1 <- proc.time(); t1 - t0 print(sessionInfo(), locale=FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.