crossValidate: Cross-validation function

Description Usage Arguments Details Value References Examples

Description

Cross-validation function used in combination with BGLR

Usage

1
2
3
4
5
6
crossValidate(x, id = "GERMPLASM", factor = "LOCATION", k = 5,
  replication = 3, seed = NULL, exclusive = TRUE,
  sampling = c("randomByID", "randomAccrossFactor", "randomByFactor",
  "randomWithinFactor", "popStructureAccrossFactor", "popStructureWithinFactor",
  "commit", "incompleteTrial"), trainingSet = NULL, validationSet = NULL,
  populationStructure = NULL, verbose = FALSE)

Arguments

x

a data frame with at least the following information:

GERMPLASM:

Name of then entries.

LOCATION:

Name of the geographic locations of the multi-field trial. In this function we assume the factor used in the setting up the cross-validation schemes is the geographic location. However this can be any field design factor which is adequate as analysis factor.

id

character specifying the column name of the entries IDs in x. Default is GERMPLASM.

factor

character specifying the column name of the factor to use in the cross-validation in x. Default is LOCATION, refering to the graphical locations in considering a multi-location field trial.

k

integer defining the number of folds for k-fold cross validation, thus k should be in [2,nrow(y)], where y is the vector of phenotypic values. The default is 5.

replication

numeric defining the number of replications of the cross-validation. Default is 3.

seed

numeric value for the seed value used for the randomization by the set.seed function. In this way randomization can be reproduced by the user. Default is NULL, which uses 123 as value for the seed.

exclusive

logical whether sampling should be done with replacement. The argument is passed to the replace argument of the samp.int function as the negation, i.e. exclusive is TRUE means replace=FALSE, such that the probability of choosing the next item is proportional to the weights amongst the remaining items.

sampling

character specifying which sampling strategy to use in the cross-validation. The different sampling strategies are described below:

randomByID:

Random sampling by name of the observations

randomAccrossFactor:

Random sampling by name of the entries taking into account randomization across a defined factor

randomByFactor:

Random sampling by name of the entries using the factor to define the sets

randomWithinFactor:

Random sampling by name of the entries taking into account randomization within a defined factor

popStructureAccrossFactor:

Accounts for across population structure information, e.g. test and training sets contain a set of complete families

popStructureWithinFactor:

Accounts for within population structure information, e.g. each family is splitted into k subsets

commit:

Sampling done using defined test and training sets

incompleteTrial:

Random sampling by taking into account an incomplete field trial setup

If sampling is "commit" the sets of names have to specified in the trainingSet and validationSet arguments.

trainingSet

character vector of the observations in the training set.

validationSet

character vector of the observations in the specified test set.

populationStructure

vector of length nrow(y) assigning individuals to a population structure, where y refers to the vector of phenotypes. This argument is only required for the options sampling="popStructureAccrossFactor" or sampling="popStructureWithinFactor".

verbose

logical whether to output information about the progress of the cross-validation. Default is FALSE.

Details

in cross validation (CV) the data set is splitted into a training set, and a validation or test set. For sampling into the sets, k-fold cross validation is applied, where the data set is splitted into k subsets and k-1 comprising the training set and 1 is the test set, repeated for each subset. The function is based on the crossVal function from the synbreed package. We made the function more flexible by taking out the cross-validation schemes functionality, to allow easy plug-in of more user-defined CV schemes. Further, the function was adjusted to work with the BGLR framework.

Value

data frame with the result of the sampling of the entries into k-folds using a number of user-defined replications. The table includes following columns:

References

1:

Albrecht T, Wimmer V, Auinger HJ, Erbe M, Knaak C, Ouzunova M, Simianer H, Schoen CC (2011) Genome-based prediction of testcross values in maize. Theor Appl Genet 123:339-350.

2:

Gustavo de los Campos and Paulino Perez Rodriguez (2014). BGLR: Bayesian Generalized Linear Regression. R package version 1.0.3. http://CRAN.R-project.org/package=BGLR

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
data(exampleCV)
scheme1 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomByFactor",verbose=TRUE)
scheme2 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="incompleteTrial",verbose=TRUE)
scheme3 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomAccrossFactor",verbose=TRUE)
scheme4 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomWithinFactor",verbose=TRUE)
scheme5 <- crossValidate(x=exampleCV, id="GERMPLASM", factor="LOCATION",
                         k=5, replication=3, seed=NULL, exclusive=TRUE,
                         sampling="randomByID",verbose=TRUE)
head(scheme1)
head(scheme2)
head(scheme3)
head(scheme4)
head(scheme5)

digiYozhik/msc_thesis documentation built on May 14, 2019, 5:16 p.m.