crossValidate: Cross-validation function
In digiYozhik/msc_thesis: Functions to support master thesis

Description Usage Arguments Details Value References Examples

Cross-validation function used in combination with BGLR

crossValidate(x, id = "GERMPLASM", factor = "LOCATION", k = 5,
  replication = 3, seed = NULL, exclusive = TRUE,
  sampling = c("randomByID", "randomAccrossFactor", "randomByFactor",
  "randomWithinFactor", "popStructureAccrossFactor", "popStructureWithinFactor",
  "commit", "incompleteTrial"), trainingSet = NULL, validationSet = NULL,
  populationStructure = NULL, verbose = FALSE)

`x`	a data frame with at least the following information: `GERMPLASM`: Name of then entries. `LOCATION`: Name of the geographic locations of the multi-field trial. In this function we assume the factor used in the setting up the cross-validation schemes is the geographic location. However this can be any field design factor which is adequate as analysis factor.
`id`	character specifying the column name of the entries IDs in x. Default is GERMPLASM.
`factor`	character specifying the column name of the factor to use in the cross-validation in x. Default is LOCATION, refering to the graphical locations in considering a multi-location field trial.
`k`	integer defining the number of folds for k-fold cross validation, thus k should be in [2,nrow(y)], where y is the vector of phenotypic values. The default is 5.
`replication`	numeric defining the number of replications of the cross-validation. Default is 3.
`seed`	numeric value for the seed value used for the randomization by the set.seed function. In this way randomization can be reproduced by the user. Default is NULL, which uses 123 as value for the seed.
`exclusive`	logical whether sampling should be done with replacement. The argument is passed to the replace argument of the samp.int function as the negation, i.e. exclusive is TRUE means replace=FALSE, such that the probability of choosing the next item is proportional to the weights amongst the remaining items.
`sampling`	character specifying which sampling strategy to use in the cross-validation. The different sampling strategies are described below: `randomByID`: Random sampling by name of the observations `randomAccrossFactor`: Random sampling by name of the entries taking into account randomization across a defined factor `randomByFactor`: Random sampling by name of the entries using the factor to define the sets `randomWithinFactor`: Random sampling by name of the entries taking into account randomization within a defined factor `popStructureAccrossFactor`: Accounts for across population structure information, e.g. test and training sets contain a set of complete families `popStructureWithinFactor`: Accounts for within population structure information, e.g. each family is splitted into k subsets `commit`: Sampling done using defined test and training sets `incompleteTrial`: Random sampling by taking into account an incomplete field trial setup If sampling is "commit" the sets of names have to specified in the trainingSet and validationSet arguments.
`trainingSet`	character vector of the observations in the training set.
`validationSet`	character vector of the observations in the specified test set.
`populationStructure`	vector of length nrow(y) assigning individuals to a population structure, where y refers to the vector of phenotypes. This argument is only required for the options sampling="popStructureAccrossFactor" or sampling="popStructureWithinFactor".
`verbose`	logical whether to output information about the progress of the cross-validation. Default is FALSE.

in cross validation (CV) the data set is splitted into a training set, and a validation or test set. For sampling into the sets, k-fold cross validation is applied, where the data set is splitted into k subsets and k-1 comprising the training set and 1 is the test set, repeated for each subset. The function is based on the crossVal function from the synbreed package. We made the function more flexible by taking out the cross-validation schemes functionality, to allow easy plug-in of more user-defined CV schemes. Further, the function was adjusted to work with the BGLR framework.

data frame with the result of the sampling of the entries into k-folds using a number of user-defined replications. The table includes following columns:

IDThe names of the observations.
Rep[x][x] columns of numeric scores according to the assignment of the observations into 1...k folds, where [x] is set by the replication argument

1:: Albrecht T, Wimmer V, Auinger HJ, Erbe M, Knaak C, Ouzunova M, Simianer H, Schoen CC (2011) Genome-based prediction of testcross values in maize. Theor Appl Genet 123:339-350.
2:: Gustavo de los Campos and Paulino Perez Rodriguez (2014). BGLR: Bayesian Generalized Linear Regression. R package version 1.0.3. http://CRAN.R-project.org/package=BGLR

data(exampleCV)
scheme1 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomByFactor",verbose=TRUE)
scheme2 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="incompleteTrial",verbose=TRUE)
scheme3 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomAccrossFactor",verbose=TRUE)
scheme4 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomWithinFactor",verbose=TRUE)
scheme5 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomByID",verbose=TRUE)
head(scheme1)
head(scheme2)
head(scheme3)
head(scheme4)
head(scheme5)