tdmOptsDefaultsSet: Default values for list 'opts'.

Description Usage Arguments Details Value Note Author(s) See Also

View source: R/tdmOptsDefaults.r

Description

Set up and return a list opts with default settings. The list opts contains all DM-related settings which are needed by main_<TASK>.

For better readability, most elements of opts are arranged in groups:

dir.* path-related settings
READ.* data-reading-related settings
TST.* resampling-related settings (training, validation and test set, CV)
PRE.* preprocessing parameters
SRF.* several parameters for tdmModSortedRFimport
MOD.* general settings for models and model building
RF.* several parameters for model RF (Random Forest)
SVM.* several parameters for model SVM (Support Vector Machines)
ADA.* several parameters for model ADA (AdaBoost)
CLS.* classification-related settings
GD.* settings for the graphic devices

Usage

1
tdmOptsDefaultsSet(opts = NULL, path = ".")

Arguments

opts

(optional) the options already set

path

["."] where to find everything for the DM task.

Details

The path-related settings are relative to opts$path, if it is def'd, else relative to the current dir.
Finally, the function tdmOptsDefaultsFill(opts) is called to fill in further details, depending on the current settings of opts.

Value

a list opts, with defaults set for all options relevant for a DM task, containing the following elements

path

["."] where to find everything for the DM task

dir.txt

[data] where to find .txt/.csv files

dir.data

[data] where to find other data files, including .Rdata

dir.output

[Output] where to put output files

filename

["default.txt"] the task data

filetest

[NULL] the test data, only relevant for READ.TstFn!=NULL

data.title

["Default Data"] title for plots

READ.TXT

[T] =T: read data from .csv and save as .Rdata, =F: read from .Rdata

READ.NROW

[-1] read this amount of rows or -1 for 'read all rows'

READ.TrnFn

function to be passed into tdmReadDataset. Signature: function(opts) returning a data frame. It reads the train-validation data.

READ.TstFn

[NULL] function to be passed into tdmReadDataset. Signature: function(opts) returning a data frame. It reads a separate test data file. If NULL, this reading step is skipped.

READ.INI

[TRUE] read the task data initially, i.e. prior to tuning, using tdmReadDataset . If =FALSE, the data are read anew in each pass through main_TASK, i.e. in each tuning step (deprecated).

TST.kind

["rand"] one of the choices from {"cv","rand","col"}, see tdmModCreateCVindex for details

TST.COL

["TST.COL"] name of column with train/test/disregard-flag

TST.NFOLD

[3] number of CV-folds (only for TST.kind=="cv")

TST.valiFrac

[0.1] set this fraction of the train-validation data aside for validation (only for TST.kind=="rand")

TST.testFrac

[0.1] set prior to tuning this fraction of data aside for testing (if tdm$umode=="SP_T" and opts$READ.INI==TRUE) or set this fraction of data aside for testing after tuning (if tdm$umode=="RSUB" or =="CV")

TST.trnFrac

[NULL] train set fraction, if NULL then tdmModCreateCVindex will set it to 1 - opts$TST.valiFrac.

TST.SEED

[NULL] a seed for the random test set selection (tdmRandomSeed) and random validation set selection. (tdmClassifyLoop). If NULL, use tdmRandomSeed.

PRE.PCA

["none" (default)|"linear"] PCA preprocessing: [don't | do normal PCA (prcomp) ]

PRE.PCA.REPLACE

[T] =T: replace with the PCA columns the original numerical columns, =F: add the PCA columns

PRE.PCA.npc

[0] if >0: add monomials of degree 2 from the first PRE.PCA.npc columns (PCs) (only active, if opts$PRE.PCA!="none")

PRE.SFA

["none" (default)|"2nd"] SFA preprocessing (see package rSFA-package: [don't | do ormal SFA with 2nd degree expansion ]

PRE.SFA.REPLACE

[F] =T: replace the original numerical columns with the SFA columns; =F: add the SFA columns

PRE.SFA.npc

[0] if >0: add monomials of degree 2 from the first PRE.SFA.npc columns (only acitve, if opts$PRE.SFA!="none")

PRE.SFA.PPRANGE

[11] number of inputs after SFA preprocessing, only those inputs enter into SFA expansion

PRE.SFA.ODIM

[5] number of SFA output dimensions (slowest signals) to return

PRE.SFA.doPB

[T] =F|T: don't | do parametric bootstrap for SFA in case of marginal training data

PRE.SFA.fctPB

[sfaPBootstrap] the function to call in case of parametric bootstrap, see sfaPBootstrap in package rSFA-package for its interface description

PRE.allNonVali

[F] if =T, then use all non-validation data in the training-validation set for PCA or SFA preprocessing. If =F, use only the training set for PCA or SFA processing (only relevant if opts$PRE.PCA!="none" or opts$PRE.SFA!="none").

PRE.Xpgroup

[0.99] bind the fraction 1-PRE.Xpgroup in column OTHER (see tdmPreGroupLevels)

PRE.MaxLevel

[32] bind the N-32+1 least frequent cases in column OTHER (see tdmPreGroupLevels)

SRF.kind

["xperc" (default) |"ndrop" |"nkeep" |"none" ] the method used for feature selection, see tdmModSortedRFimport

SRF.ndrop

[0] how many variables to drop (only relevant if SRF.kind=="ndrop")

SRF.nkeep

[NULL] how many variables to keep, NULL="keep all" (only relevant if SRF.kind=="nkeep")

SRF.XPerc

[0.95] if >=0, keep that importance percentage, starting with the most important variables (if SRF.kind=="xperc")

SRF.calc

[T] =T: calculate importance & save on SRF.file, =F: load from srfFile (srfFile = Output/<confFile>.SRF.Rdata)

SRF.ntree

[50] number of RF trees

SRF.samp

sampsize for RF in importance estimation. See RF.samp for further info on sampsize.

SRF.verbose

[2]

SRF.maxS

[40] how many variables to show in plot

SRF.minlsi

[1] a lower bound for the length of SRF$input.variables

SRF.method

["RFimp"]

SRF.scale

[TRUE] option 'scale' for call importance() in tdmModSortedRFimport

MOD.SEED

[NULL] a seed for the random model initialization (if model is non-deterministic). If NULL, use tdmRandomSeed.

MOD.method

["RF" (default) |"MC.RF" |"SVM" |"NB" ]: use [RF | MetaCost-RF | SVM | Naive Bayes ] in tdmClassify
["RF" (default) |"SVM" |"LM" ]: use [RF | SVM | linear model ] in tdmRegress

RF.ntree

[500]

RF.samp

[1000] sampsize for RF in model training. If RF.samp is a scalar, then it specifies the total size of the sample. For classification, it can also be a vector of length n.class (= # of levels in response variable), then it specifies the size of each strata. The sum of the vector is the total sample size. If NULL, RF.samp will be replaced by 3000 later in tdmModAdjustSampsize*.

RF.mtry

[NULL]

RF.nodesize

[1]

RF.OOB

[TRUE] if =T, return OOB-training set error as tuning measure; if =F, return validation set error

RF.p.all

[FALSE]

SVM.kernel

[3] =1: linear, =2: polynomial, =3: RBF, =4: sigmoid

SVM.epsilon

[0.005] needed only for regression

SVM.gamma

[0.005]

SVM.coef0

[0.0] (needed only for opts$SVM.kernel=="polynomial" or =="sigmoid")

SVM.degree

[3] (needed only for opts$SVM.kernel=="polynomial")

SVM.tolerance

[0.008]

ADA.coeflearn

[1] =1: "Breiman", =2: "Freund", =3: "Zhu" as value for boosting(...,coeflearn,...) (AdaBoost)

ADA.mfinal

[10] number of trees in AdaBoost = mfinal boosting(...,mfinal,...)

ADA.rpart.minsplit

[20] minimum number of observations in a node in order for a split to be attempted

CLS.cutoff

[NULL] vote fractions for the classes (vector of length n.class = # of levels in response variable). The class i with maximum ratio (% votes)/CLS.cutoff[i] wins. If NULL, then each class gets the cutoff 1/n.class (i.e. majority vote wins). The smaller CLS.cutoff[i], the more likely class i will win.

CLS.CLASSWT

[NULL] class weights for the n.class classes, e.g.
c(A=10,B=20) for a 2-class problem with classes A and B
(the higher, the more costly is a misclassification of that real class). It should be a named vector with the same length and names as the levels of the response variable. If no names are given, the levels of the response variables in lexicographical order will be attached in tdmClassify. CLS.CLASSWT=NULL for no weights.

CLS.gainmat

[NULL] (n.class x n.class) gain matrix. If NULL, CLS.gainmat will be set to unit matrix in tdmClassify

rgain.type

["rgain" (default) |"meanCA" |"minCA" ] in case of tdmClassify: For classification, the measure Rgain returned from tdmClassifyLoop in result$R_* is [relative gain (i.e. gain/gainmax) | mean class accuracy | minimum class accuracy | minus Y ]. The goal is to maximize Rgain.
For binary classification there are the additional measures [ "arROC" | "arLIFT" | "arPRE" | "bYouden" ], see 'Value' in tdmModConfmat.
For regression, the goal is to minimize result$R_* returned from tdmRegress. In this case, possible values are rgain.type = ["rmae" (default) |"rmse" | "made" ] which stands for [ relative mean absolute error | root mean squared error | mean absolute deviation ].

ncopies

[0] if >0, activate tdmParaBootstrap in tdmClassify

fct.postproc

[NULL] name of a function with signature (pred, dframe, opts) where pred is the prediction of the model on the data frame dframe and opts is this list. This function may do some postprocessing on pred and it returns a (potentially modified) pred. This function will be called in tdmClassify if it is not NULL.

GD.DEVICE

["win"] ="win": all graphics to (several) windows (windows or X11 in package grDevices)
="rstudio": same as "win", but all graphics go to the RStudio device
="pdf": all graphics to one multi-page PDF
="png": all graphics in separate PNG files in opts$GD.PNGDIR
="non": no graphics at all
This concerns the TDMR graphics, not the SPOT (or other tuner) graphics. If running R from RStudio (if there is a device with name "RStudioGD") then the default "win" is changed to "rstudio" automatically.

GD.RESTART

[T] =T: restart the graphics device (i.e. close all 'old' windows or re-open multi-page pdf) in each call to tdmClassify or tdmRegress, resp.
=F: leave all windows open (suitable for calls from SPOT) or write more pages in same pdf.

GD.CLOSE

[T] =T: close graphics device "png", "pdf" at the end of main_*.r (suitable for main_*.r solo) or
=F: do not close (suitable for call from tdmStartSpot2, where all windows should remain open)

NRUN

[2] how many runs with different train & test samples - or - how many CV-runs, if opts$TST.kind="cv"

APPLY_TIME

[FALSE]

test2.show

[FALSE]

test2.string

["default cutoff"]

VERBOSE

[2] =2: print much output, =1: less, =0: none

Note

The variables opts$PRE.PCA.numericV and opts$PRE.SFA.numericV (string vectors of numeric input columns to be used for PCA or SFA) are not set by tdmOptsDefaultsSet or tdmOptsDefaultsFill. Either they are supplied by the user or, if NULL, TDMR will set them to input.variables in tdmClassifyLoop, assuming that all columns are numeric.

Author(s)

Wolfgang Konen, THK, 2013 - 2018

See Also

tdmOptsDefaultsFill tdmDefaultsFill


TDMR documentation built on March 3, 2020, 1:06 a.m.