crossValidation: Conduct cross-validation
In rModeling: A Framework of Cross-Validation

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/functions.R

Conduct a cross-validation for a given classification/regression model and output the prediction results collected over the cross-validation loop. The cross-validation can be done in two ways: normal k-fold cross-validaiton (batch=NULL), or batch-wise cross-validation (batch!=NULL). The latter is particularly useful in the presence of significant intra-group heterogeneity.

crossValidation(data, label, batch = NULL, 
                method = lda, pred = predict, classify = TRUE, 
                folds = NULL, nBatch = 0, nFold = 10, 
                verbose = TRUE, seed = NULL, ...)

`data`	a data matrix, with samples saved in rows and features in columns.
`label`	a vector of response variables (i.e., group/concentration info), must be the same length as the number of samples.
`batch`	a vector of sample identifications (e.g., batch/patient ID), must be the same length as the number of samples. Ideally, this should be the identification of the samples at the highest hierarchy (e.g., the patient ID rather than the spectral ID). If missing, a normal k-fold cross validaiton will be performed (i.e., the data is split randomly into k folds). Ignored if `folds` is given.
`method`	the name of the function to be performed on training data (can be any model-based procedures, like classification/regression or even pre-processings). A user-defined function is possible, see `fnPcaLda` as an example.
`pred`	the name of the function to be performed on testing data (eg. new substances) based on the model built by `method`. A user-defined function is possible, see `predPcaLda` as an example.
`classify`	a boolean value, `classify=TRUE` means a classification task, otherwise a regression task. It is used in the function `predSummary`.
`folds`	a list of indices specifying the sample index to be used in each fold, can be the output of function `dataSplit`. If missing, a data split will be done first before performing cross-validaiton
`nBatch`	an integer, the number of data folds in case of batch-wise cross-validaiton (if `nBatch=0`, each batch will be used as one fold). Ignored if `folds` is given or if `batch` is missing.
`nFold`	an integer, the value of k in case of normal k-fold cross-validaiton. Ignored if `folds` or `batch` is given.
`verbose`	a boolean value, if or not to print out the logging info
`seed`	an integer, if given, will be used as the random seed to split the data in case of k-fold cross-validation. Ignored if `batch` or `folds` is given.
`...`	parameters to be passed to the `method`

The cross-validaiton will be conducted based on the data partitions folds, each fold is predicted once using the model built on the rest folds. If folds is missing, a data split will be done first (see more in dataSplit).

The procedures to be performed within the cross-validation is given in the function method, for example, fnPcaLda. A user-defined function is possible, as long as the it follows the same structure as fnPcaLda. A two-layer cross-validation (see reference) can be done by using a tuning function as method, such as tunePcaLda (see examples). In this case, the parameters of a classifier are optimized using the training data within tunePcaLda and the optimal model is tested on the testing data. The parameters of pre-processing can be optimized in a similar way by involving the pre-processing steps into the function method.

NOTE: It is recommended to specify the seed for a normal k-fold cross-validation in order to get the same results from repeated runnings.

A list with elements

`Fold`	a list, each giving the sample indices of a fold
`True`	a vector of characters, the groundtruth response variables, collected for each fold when it is used as testing data
`Pred`	a vector of characters, the results from prediction, collected for each fold when it is used as testing data
`Summ`	a list, the output of function `predSummary`. A confusion matrix (if `classify=TRUE`) from `confusionMatrix` or RMSE (if `classify=FALSE`) calculated from each fold being predicted.

Shuxia Guo, Thomas Bocklitz, Juergen Popp

S. Guo, T. Bocklitz, et al., Common mistakes in cross-validating classification models. Analytical methods 2017, 9 (30): 4410-4417.

dataSplit

  data(DATA)
  ### perform batch-wise cross-validation using the function fnPcaLda
  RES3 <- crossValidation(data=DATA$spec
                          ,label=DATA$labels
                          ,batch=DATA$batch
                          ,method=fnPcaLda
                          ,pred=predPcaLda
                          ,folds=NULL 
                          ,nBatch=0
                          ,nFold=3
                          ,verbose=TRUE     
                          ,seed=NULL
                          
                          ### parameters to be passed to fnPcaLda
                          ,center=TRUE
                          ,scale=FALSE
   )


   ### perform a two-layer cross-validation using the function tunePcaLda,
   ### where the number of principal components used for LDA is optimized 
   ### (i.e., internal cross-validaiton).
   RES4 <- crossValidation(data=DATA$spec
                          ,label=DATA$labels				    
                          ,batch=DATA$batch	
                          ,method=tunePcaLda 
                          ,pred=predPcaLda     
                          ,folds=NULL      
                          ,nBatch=0			    
                          ,nFold=3					
                          ,verbose=TRUE     
                          ,seed=NULL
                          
                          ### parameters to be passed to tunePcaLda
                          ,nPC=2:4
                          ,cv=c('CV', 'BV')[2]
                          ,nPart=0
                          ,optMerit=c('Accuracy', 'Sensitivity')[2]
                          ,center=TRUE
                          ,scale=FALSE
  )