tdmClassify: Core classification function of TDMR.
In TDMR: Tuned Data Mining in R

Description Usage Arguments Details Value Author(s) See Also Examples

tdmClassify is called by tdmClassifyLoop and returns an object of class tdmClass.
It trains a model on training set d_train and evaluates it on test set d_test. If this function is used for tuning, the test set d_test plays the role of a validation set.

tdmClassify(
  d_train,
  d_test,
  d_dis,
  d_preproc,
  response.variables,
  input.variables,
  opts,
  tsetStr = c("Validation", "validation")
)

`d_train`	training set
`d_test`	validation set, same columns as training set
`d_dis`	'disregard set', i.e. everything what is neither train nor test. The model is applied to all records in d_dis (needed for active learning, see ssl_methods.r)
`d_preproc`	data used for preprocessing. May be NULL, if no preprocessing is done (opts$PRE.SFA=="none" and opts$PRE.PCA=="none"). If preprocessing is done, then d_preproc is usually all non-validation data.
`response.variables`	name of column which carries the target variable - or - vector of names specifying multiple target columns (these columns are not used during prediction, only for evaluation)
`input.variables`	vector with names of input columns
`opts`	additional parameters [defaults in brackets] `SRF.` several parameters for `tdmModSortedRFimport` `RF.` several parameters for RF (Random Forest, defaults are set, if omitted) `SVM.*` several parameters for SVM (Support Vector Machines, defaults are set, if omitted) `filename` `data.title` `MOD.method` ["RF"] the main training method ["RF"\|"MC.RF"\|"SVM"\|"NB"]: use [Random forest\| MetaCost-RF\| SVM\| Naive Bayes] for the main model `MOD.SEED` =NULL: get a new random number seed with `tdmRandomSeed` (different RF trainings). =any value: set the random number seed to this value (+i) to get reproducible random numbers. In this way, the model training part (RF, NNET, ...) gets always a fixed seed (see also TST.SEED in `tdmClassifyLoop`) `CLASSWT` class weights (NULL, if all classes should have the same weight) (currently used only by methods RF, MC.RF and by `tdmModSortedRFimport`) `fct.postproc` [NULL] name of user-def'd function for postprocessing of predicted output `GD.DEVICE` if !="non", then make a pairs-plot of the 5 most important variables and make a true-false bar plot `VERBOSE` [2] =2: most printed output, =1: less, =0: no output
`tsetStr`	[c("Validation", "validation")]

Currently d_dis is allowed to be a 0-row data frame, but d_train and d_test must have at least one record.

res, an object of class tdmClass, this is a list containing

`d_train`	training set + predicted class column(s)
`d_test`	test set + predicted class column(s)
`d_dis`	disregard set + predicted class column(s)
`avgEVAL`	list with evaluation measures, averaged over all response variables
`allEVAL`	data frame with evaluation measures, one row for each response variable
`lastCmTrain`	a list with evaluation info for training set (confusion matrix, gain, class errors, ...)
`lastCmVali`	a list with evaluation info for validation set (confusion matrix, gain, class errors, ...)
`lastModel`	the last model built (i.e. for the last response variable)
`lastProbs`	a list with three probability matrices (row: records, col: classes) v_train, v_test, v_dis, if the model provides probabilities; NULL else.
`lastPred`	name of the colum where the prediction of the last model is appended to the datasets d_train, d_test and d_dis
`predProb`	a list with two data frames Trn and Val. They contain at least a column IND.dset (index of each train / validation record into data frame dset). If the model has probabilities, then they contain in addition a column for each response variable with the prediction probabilities.
`opts`	parameter list from input, some default values might have been added

The 9 evaluation measures in avgEVAL and allEVAL are cerr.* (misclassification errror), gain.* (total gain) and rgain.* (relative gain, i.e. total gain divided by max. achievable gain in *) where * = [trn | tst | tst2 ] stands for [ training set | test set | test set with special treatment ] and the special treatment is either opts$test2.string = "no postproc" or = "default cutoff".

The five items lastCmTrain, lastCmVali, lastModel, lastProbs, lastPred are specific for the *last* model (the one built for the last response variable in the last run and last fold)

Wolfgang Konen, THK, 2013

print.tdmClass tdmClassifyLoop tdmRegressLoop

#*# This demo shows a simple data mining process (phase 1 of TDMR) for classification on
#*# dataset iris.
#*# The data mining process in tdmClassify calls randomForest as the prediction model.
#*# It is called opts$NRUN=1 time with one random train-validation set splits.
#*# Therefore data frame res$allEval has one row
#*#
opts=tdmOptsDefaultsSet()                       # set all defaults for data mining process
gdObj <- tdmGraAndLogInitialize(opts);          # init graphics and log file

data(iris)
response.variables="Species"                    # names, not data (!)
input.variables=setdiff(names(iris),"Species")
opts$NRUN=1

idx_train = sample(nrow(iris))[1:110]
d_train=iris[idx_train,]
d_vali=iris[-idx_train,]
d_dis=iris[numeric(0),]
res <- tdmClassify(d_train,d_vali,d_dis,NULL,response.variables,input.variables,opts)

cat("\n")
print(res$allEVAL)