tdmOptsDefaultsSet: Default values for list 'opts'.
In TDMR: Tuned Data Mining in R

Description Usage Arguments Details Value Note Author(s) See Also

Set up and return a list opts with default settings. The list opts contains all DM-related settings which are needed by main_<TASK>.

For better readability, most elements of opts are arranged in groups:

`dir.*`	path-related settings
`READ.*`	data-reading-related settings
`TST.*`	resampling-related settings (training, validation and test set, CV)
`PRE.*`	preprocessing parameters
`SRF.*`	several parameters for `tdmModSortedRFimport`
`MOD.*`	general settings for models and model building
`RF.*`	several parameters for model RF (Random Forest)
`SVM.*`	several parameters for model SVM (Support Vector Machines)
`ADA.*`	several parameters for model ADA (AdaBoost)
`CLS.*`	classification-related settings
`GD.*`	settings for the graphic devices

1	tdmOptsDefaultsSet(opts = NULL, path = ".")

`opts`	(optional) the options already set
`path`	["."] where to find everything for the DM task.

The path-related settings are relative to opts$path, if it is def'd, else relative to the current dir.
Finally, the function tdmOptsDefaultsFill(opts) is called to fill in further details, depending on the current settings of opts.

a list opts, with defaults set for all options relevant for a DM task, containing the following elements

`path`	["."] where to find everything for the DM task
`dir.txt`	[data] where to find .txt/.csv files
`dir.data`	[data] where to find other data files, including .Rdata
`dir.output`	[Output] where to put output files
`filename`	["default.txt"] the task data
`filetest`	[NULL] the test data, only relevant for READ.TstFn!=NULL
`data.title`	["Default Data"] title for plots
`READ.TXT`	[T] =T: read data from .csv and save as .Rdata, =F: read from .Rdata
`READ.NROW`	[-1] read this amount of rows or -1 for 'read all rows'
`READ.TrnFn`	function to be passed into `tdmReadDataset`. Signature: function(opts) returning a data frame. It reads the train-validation data.
`READ.TstFn`	[NULL] function to be passed into `tdmReadDataset`. Signature: function(opts) returning a data frame. It reads a separate test data file. If NULL, this reading step is skipped.
`READ.INI`	[TRUE] read the task data initially, i.e. prior to tuning, using `tdmReadDataset` . If =FALSE, the data are read anew in each pass through main_TASK, i.e. in each tuning step (deprecated).
`TST.kind`	["rand"] one of the choices from {"cv","rand","col"}, see `tdmModCreateCVindex` for details
`TST.COL`	["TST.COL"] name of column with train/test/disregard-flag
`TST.NFOLD`	[3] number of CV-folds (only for TST.kind=="cv")
`TST.valiFrac`	[0.1] set this fraction of the train-validation data aside for validation (only for TST.kind=="rand")
`TST.testFrac`	[0.1] set prior to tuning this fraction of data aside for testing (if tdm$umode=="SP_T" and opts$READ.INI==TRUE) or set this fraction of data aside for testing after tuning (if tdm$umode=="RSUB" or =="CV")
`TST.trnFrac`	[NULL] train set fraction, if NULL then `tdmModCreateCVindex` will set it to 1 - opts$TST.valiFrac.
`TST.SEED`	[NULL] a seed for the random test set selection (`tdmRandomSeed`) and random validation set selection. (`tdmClassifyLoop`). If NULL, use `tdmRandomSeed`.
`PRE.PCA`	["none" (default)\|"linear"] PCA preprocessing: [don't \| do normal PCA (prcomp) ]
`PRE.PCA.REPLACE`	[T] =T: replace with the PCA columns the original numerical columns, =F: add the PCA columns
`PRE.PCA.npc`	[0] if >0: add monomials of degree 2 from the first PRE.PCA.npc columns (PCs) (only active, if opts$PRE.PCA!="none")
`PRE.SFA`	["none" (default)\|"2nd"] SFA preprocessing (see package `rSFA-package`: [don't \| do ormal SFA with 2nd degree expansion ]
`PRE.SFA.REPLACE`	[F] =T: replace the original numerical columns with the SFA columns; =F: add the SFA columns
`PRE.SFA.npc`	[0] if >0: add monomials of degree 2 from the first PRE.SFA.npc columns (only acitve, if opts$PRE.SFA!="none")
`PRE.SFA.PPRANGE`	[11] number of inputs after SFA preprocessing, only those inputs enter into SFA expansion
`PRE.SFA.ODIM`	[5] number of SFA output dimensions (slowest signals) to return
`PRE.SFA.doPB`	[T] =F\|T: don't \| do parametric bootstrap for SFA in case of marginal training data
`PRE.SFA.fctPB`	[sfaPBootstrap] the function to call in case of parametric bootstrap, see `sfaPBootstrap` in package `rSFA-package` for its interface description
`PRE.allNonVali`	[F] if =T, then use all non-validation data in the training-validation set for PCA or SFA preprocessing. If =F, use only the training set for PCA or SFA processing (only relevant if opts$PRE.PCA!="none" or opts$PRE.SFA!="none").
`PRE.Xpgroup`	[0.99] bind the fraction 1-PRE.Xpgroup in column OTHER (see `tdmPreGroupLevels`)
`PRE.MaxLevel`	[32] bind the N-32+1 least frequent cases in column OTHER (see `tdmPreGroupLevels`)
`SRF.kind`	["xperc" (default) \|"ndrop" \|"nkeep" \|"none" ] the method used for feature selection, see `tdmModSortedRFimport`
`SRF.ndrop`	[0] how many variables to drop (only relevant if SRF.kind=="ndrop")
`SRF.nkeep`	[NULL] how many variables to keep, NULL="keep all" (only relevant if SRF.kind=="nkeep")
`SRF.XPerc`	[0.95] if >=0, keep that importance percentage, starting with the most important variables (if SRF.kind=="xperc")
`SRF.calc`	[T] =T: calculate importance & save on SRF.file, =F: load from srfFile (srfFile = Output/<confFile>.SRF.Rdata)
`SRF.ntree`	[50] number of RF trees
`SRF.samp`	sampsize for RF in importance estimation. See RF.samp for further info on sampsize.
`SRF.verbose`	[2]
`SRF.maxS`	[40] how many variables to show in plot
`SRF.minlsi`	[1] a lower bound for the length of SRF$input.variables
`SRF.method`	["RFimp"]
`SRF.scale`	[TRUE] option 'scale' for call importance() in `tdmModSortedRFimport`
`MOD.SEED`	[NULL] a seed for the random model initialization (if model is non-deterministic). If NULL, use `tdmRandomSeed`.
`MOD.method`	["RF" (default) \|"MC.RF" \|"SVM" \|"NB" ]: use [RF \| MetaCost-RF \| SVM \| Naive Bayes ] in `tdmClassify` ["RF" (default) \|"SVM" \|"LM" ]: use [RF \| SVM \| linear model ] in `tdmRegress`
`RF.ntree`	[500]
`RF.samp`	[1000] sampsize for RF in model training. If RF.samp is a scalar, then it specifies the total size of the sample. For classification, it can also be a vector of length n.class (= # of levels in response variable), then it specifies the size of each strata. The sum of the vector is the total sample size. If NULL, RF.samp will be replaced by 3000 later in tdmModAdjustSampsize*.
`RF.mtry`	[NULL]
`RF.nodesize`	[1]
`RF.OOB`	[TRUE] if =T, return OOB-training set error as tuning measure; if =F, return validation set error
`RF.p.all`	[FALSE]
`SVM.kernel`	[3] =1: linear, =2: polynomial, =3: RBF, =4: sigmoid
`SVM.epsilon`	[0.005] needed only for regression
`SVM.gamma`	[0.005]
`SVM.coef0`	[0.0] (needed only for opts$SVM.kernel=="polynomial" or =="sigmoid")
`SVM.degree`	[3] (needed only for opts$SVM.kernel=="polynomial")
`SVM.tolerance`	[0.008]
`ADA.coeflearn`	[1] =1: "Breiman", =2: "Freund", =3: "Zhu" as value for boosting(...,coeflearn,...) (AdaBoost)
`ADA.mfinal`	[10] number of trees in AdaBoost = mfinal boosting(...,mfinal,...)
`ADA.rpart.minsplit`	[20] minimum number of observations in a node in order for a split to be attempted
`CLS.cutoff`	[NULL] vote fractions for the classes (vector of length n.class = # of levels in response variable). The class i with maximum ratio (% votes)/CLS.cutoff[i] wins. If NULL, then each class gets the cutoff 1/n.class (i.e. majority vote wins). The smaller CLS.cutoff[i], the more likely class i will win.
`CLS.CLASSWT`	[NULL] class weights for the n.class classes, e.g. c(A=10,B=20) for a 2-class problem with classes A and B (the higher, the more costly is a misclassification of that real class). It should be a named vector with the same length and names as the levels of the response variable. If no names are given, the levels of the response variables in lexicographical order will be attached in `tdmClassify`. CLS.CLASSWT=NULL for no weights.
`CLS.gainmat`	[NULL] (n.class x n.class) gain matrix. If NULL, CLS.gainmat will be set to unit matrix in `tdmClassify`
`rgain.type`	["rgain" (default) \|"meanCA" \|"minCA" ] in case of `tdmClassify`: For classification, the measure `Rgain` returned from `tdmClassifyLoop` in `result$R_` is [relative gain (i.e. gain/gainmax) \| mean class accuracy \| minimum class accuracy \| minus Y ]. The goal is to maximize `Rgain`. For binary classification there are the additional measures [ "arROC" \| "arLIFT" \| "arPRE" \| "bYouden" ], see 'Value' in `tdmModConfmat`. For regression, the goal is to minimize `result$R_` returned from `tdmRegress`. In this case, possible values are `rgain.type` = ["rmae" (default) \|"rmse" \| "made" ] which stands for [ relative mean absolute error \| root mean squared error \| mean absolute deviation ].
`ncopies`	[0] if >0, activate `tdmParaBootstrap` in `tdmClassify`
`fct.postproc`	[NULL] name of a function with signature `(pred, dframe, opts)` where `pred` is the prediction of the model on the data frame `dframe` and `opts` is this list. This function may do some postprocessing on `pred` and it returns a (potentially modified) `pred`. This function will be called in `tdmClassify` if it is not `NULL`.
`GD.DEVICE`	["win"] ="win": all graphics to (several) windows (`windows` or `X11` in package `grDevices`) ="rstudio": same as "win", but all graphics go to the RStudio device ="pdf": all graphics to one multi-page PDF ="png": all graphics in separate PNG files in `opts$GD.PNGDIR` ="non": no graphics at all This concerns the TDMR graphics, not the SPOT (or other tuner) graphics. If running R from RStudio (if there is a device with name "RStudioGD") then the default "win" is changed to "rstudio" automatically.
`GD.RESTART`	[T] =T: restart the graphics device (i.e. close all 'old' windows or re-open multi-page pdf) in each call to `tdmClassify` or `tdmRegress`, resp. =F: leave all windows open (suitable for calls from SPOT) or write more pages in same pdf.
`GD.CLOSE`	[T] =T: close graphics device "png", "pdf" at the end of main_.r (suitable for main_.r solo) or =F: do not close (suitable for call from tdmStartSpot2, where all windows should remain open)
`NRUN`	[2] how many runs with different train & test samples - or - how many CV-runs, if `opts$TST.kind`="cv"
`APPLY_TIME`	[FALSE]
`test2.show`	[FALSE]
`test2.string`	["default cutoff"]
`VERBOSE`	[2] =2: print much output, =1: less, =0: none

The variables opts$PRE.PCA.numericV and opts$PRE.SFA.numericV (string vectors of numeric input columns to be used for PCA or SFA) are not set by tdmOptsDefaultsSet or tdmOptsDefaultsFill. Either they are supplied by the user or, if NULL, TDMR will set them to input.variables in tdmClassifyLoop, assuming that all columns are numeric.

Wolfgang Konen, THK, 2013 - 2018

tdmOptsDefaultsFill tdmDefaultsFill

TDMR documentation built on March 3, 2020, 1:06 a.m.