inferenceBGLR: Function to make inference on a cross validation analysis and...
In digiYozhik/msc_thesis: Functions to support master thesis

Description Usage Arguments Details Value Author(s) References Examples

Function to make inference on a cross validation analysis and a multi-location trial data set using BGLR

1
2
3

inferenceBGLR(P, id = "GERMPLASM", factor = "LOCATION", trait = "YIELD",
  CVscheme = NULL, modelName = c("model1", "model2", "model0"), G = G,
  outputDir = getwd(), verbose = TRUE, replications = 3, ...)

`P`	a data frame that holds the design information and phenotypes for the data to be modeled. The data should hold following features, which are detailed below and represent the columns in the data frame. For inference on the thesis data set we use the data set P obtained by typing data(P) in the console.
`id`	character describing the column name for the names of the observations. Default is GERMPLASM.
`factor`	character describing the column name for the factor that describes the geographic location in the mutli-location trial. Default is LOCATION.
`trait`	character describing the column name for the phenotype to be modeled. Default is YIELD.
`CVscheme`	data frame output from the crossValidate function which was based on a user decided sampling strategy to use in the cross-validation. Default is NULL, which applies prediction on full data set, and which is not yet implemented, making specificiation of this argument required.
`modelName`	character name describing the model used for modeling: `modelG`: Model where entries and locations are seen as random terms in the model. The G-matrix is used to include the genetic relatedness between the entries. See reference 4 for more detail. `modelGE`: Model where entries and locations are seen as random terms in the model, and where a GxE interaction term is included. The G-matrix is used to include the genetic relatedness between the entries. The GxE interactions are modeled following Jarquin et al. (2014). See reference 3 and 4 for more detail. `modelL`: Model where entries and locations are seen as random terms in the model, and where no information about the relatedness between the entries in included. This model is included for didactic and testing purposes.
`G`	matrix containing the realized G-matrix obtained for the entries in the dataset specified in P.
`outputDir`	character specifying the name of the directory where to output the files used in the modeling and inference. Default is the working directory.
`verbose`	logical whether to output information about the progress of the cross-validation. Default is FALSE.
`replications`	numeric defining the number of replications of the cross-validation. Default is 3.
`...`	additional arguments for the BGLR function. Of interest are nIter for the number of iterations and burnIn specifying the burn-in used in MCMC analysis.

The function uses the cross-validation scheme information (CVscheme argument) to split the data into training and test sets. While running through the replications (replication argument) and folds, the model specified in the modelName argument is fitted using the BGLR framework following the specifications in Appendix B of reference 4. After model fit a series of metrics are calculated to support inference, which is further detailed in reference 4. This includes the predictive ability, the mean squared prediction error (MSPE), and the bias which is calculated using a linear model (lm function) of observed phenotype values on the predicted phenotype values of the test set under evaluation. The outputted information is detailed in the Value section. The files used for inference are stored in a folder named BGLR which is a subdirectory of the directory specified in the outputDir argument.

list with following slots, where TS stands for test set.

n.SNP Number of SNPs used in analysis. Not relevant here, put to zero.
n.T Matrix with number of entries in the test set for each fold (rows) by replications (columns).
n.DS Matrix with the number of observations in the total dataset for each fold(rows) by replications (columns).
id.TS List of IDs of each test set within a list of each replication.
bu Estimated fixed and random effects of each fold within each replication (see crossVal function)
y.TS Predicted values of all test sets within each replication.
PredAbi Predictive ability of each fold within each replication calculated as correlation coefficient r(y_{TS},\hat y_{TS}).
rankCor Spearman's rank correlation of each fold within each replication calculated between y_{TS} and \hat y_{TS}.
bias Regression coefficients of a regression of the observed values on the predicted values in the TS. A regression coefficient < 1 implies inflation of predicted values, and a coefficient of > 1 deflation of predicted values.
k Integer defining the number of folds.
Rep Numeric defining the number of replications.
sampling Character defining the sampling method.
Seed Seed for set.seed()
rep.seed vector with the values for the seeds used for each replication
nr.ranEff Number of random effects used (see crossVal function)
VC.est.method Method for the variance components (committed or re-estimated with ASReml/BRR/BL), see crossVal function. We recommend the default, BGLR.
m10 Mean of observed values for the 10% best predicted of each replication. The k test sets are pooled within each replication.
mse Mean squared error (of prediction, MSPE) of each fold within each replication calculated between y_{TS} and \hat y_{TS}. This is in reference 4 referred to as MSPE, the mean squared prediction error.
topRecovery Array of topx recovery of entries across the different locations. Array contains a matrix for every fold in the cross- validation. Every matrix hold as as many rows as replications defined. The columns in the matrix hold values for the different topx recoveries, where x is element of (10, 20, 30, 40, 50, 100, 200). The elements in the matrix are calculated as the percentage entries intersecting between the entries in the raw and predicted test set under consideration.
residualErrors Matrix of residual errors, with as columns the different folds in the cross-validation and the number of columns representing the different replications. The variance was taken from the varE component in the fitted BGLR object.

Ruud Derijcker

1:: Albrecht, T., et al. (2011). Genome-based prediction of testcross values in maize. Theor Appl Genet 123:339-350.
2:: De Los Campos, G., Perez, P. (2014). BGLR: Bayesian Generalized Linear Regression. Version 1.0.3. (http://CRAN.R-project.org/package=BGLR).
3:

Jarquin, D. et al. (2014). A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor Appl Genet 127(3):595-607.

4:

Derijcker, R. (2015). Investigating incorporation of genotype x environment interaction (G x E) for genomic selection in a practical setting. Unpublished M.Sc. thesis. University of Ghent:Belgium.

data(G)
data(P)
scheme <- crossValidate(x=P, id="GERMPLASM", factor="LOCATION", k=5,
                        replication=2, seed=NULL, exclusive=TRUE,
                        sampling="randomByID",verbose=TRUE)
output <- inferenceBGLR(P, CVscheme=scheme, modelName="modelG", id="GERMPLASM",
                       G=G, factor="LOCATION", trait="YIELD", nIter=1500, burnIn=250,
                       replications=2)
str(output)