tdmModCreateCVindex: Create and return a training-validation-set index vector.

Description Usage Arguments Value Note

View source: R/tdmModelingUtils.r


Depending on the value of member TST.kind in list opts, the returned index cvi is

  1. TST.kind="cv": a random cross validation index P([111...222...333...]) - or -

  2. TST.kind="rand": a random index with P([00...11...-1-1...]) for training (0), validation (1) and disregard (-1) cases - or -

  3. TST.kind="col": the column dset[,opts$TST.COL] contains the training (0), validation (1) and disregard (-1) set division (and all records with a value <0 in column TST.COL are disregarded).

Here P(.) denotes random permutation of the sequence.
The disregard set is optional, i.e. cvi may contain only 0 and 1, if desired.
Special case TST.kind="cv" and TST.NFOLD=1: make *every* record a training record, i.e. index [000...].
In case TST.kind="rand" and stratified=TRUE a stratified sample is drawn, where the strata in the training case reflect the rel. frequency of each level of the **1st** response variable and are ensured to be at least of size 1.
In summary, TST.kind="cv" means cross validation (TST.NFOLD models are built with TST.NFOLD different train-validation data sets), while TST.kind="rand" or "col" means one model build with a random ("rand") or user-defined ("col") training-validation split.


tdmModCreateCVindex(dset, response.variables, opts, stratified = FALSE)



the data frame for which cvi is needed


issue a warning if length(response.variables)>1. Use the first response variable for determining strata size.


a list from which we need here the following entries

  • TST.kind: ["cv"|"rand"|"col"]

  • TST.NFOLD: number of CV folds (only relevant in case TST.kind=="cv")

  • TST.COL: column of dset containing the (0/1/<0) index (only relevant in case TST.kind=="col") or NULL if no such column exists

  • TST.valiFrac: fraction of records to set aside for validation (only relevant in case TST.kind=="rand")

  • TST.trnFrac: [1-opts$TST.valiFrac] fraction of records to use for training (only relevant in case TST.kind=="rand")


[F] do stratified sampling for TST.kind="rand" with at least one training record for each response variable level (classification)


cvi training-validation-set (0/>0) index vector (all records with cvi<0, e.g. from column TST.COL, are disregarded)


Currently stratified sampling in case TST.KIND='rand' does only work correctly for one response variable. If there are more than one, the right fraction of validation records is taken, but the strata are drawn w.r.t. the first response variable. (For multiple response variables we would have to return a list of cvi's or to call tdmModCreateCVindex for each response variable anew.)

TDMR documentation built on March 3, 2020, 1:06 a.m.