tdmClassify: Core classification function of TDMR.

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/tdmClassify.r

Description

tdmClassify is called by tdmClassifyLoop and returns an object of class tdmClass.
It trains a model on training set d_train and evaluates it on test set d_test. If this function is used for tuning, the test set d_test plays the role of a validation set.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
tdmClassify(
  d_train,
  d_test,
  d_dis,
  d_preproc,
  response.variables,
  input.variables,
  opts,
  tsetStr = c("Validation", "validation")
)

Arguments

d_train

training set

d_test

validation set, same columns as training set

d_dis

'disregard set', i.e. everything what is neither train nor test. The model is applied to all records in d_dis (needed for active learning, see ssl_methods.r)

d_preproc

data used for preprocessing. May be NULL, if no preprocessing is done (opts$PRE.SFA=="none" and opts$PRE.PCA=="none"). If preprocessing is done, then d_preproc is usually all non-validation data.

response.variables

name of column which carries the target variable - or - vector of names specifying multiple target columns (these columns are not used during prediction, only for evaluation)

input.variables

vector with names of input columns

opts

additional parameters [defaults in brackets]

SRF.*

several parameters for tdmModSortedRFimport

RF.*

several parameters for RF (Random Forest, defaults are set, if omitted)

SVM.*

several parameters for SVM (Support Vector Machines, defaults are set, if omitted)

filename
data.title
MOD.method

["RF"] the main training method ["RF"|"MC.RF"|"SVM"|"NB"]: use [Random forest| MetaCost-RF| SVM| Naive Bayes] for the main model

MOD.SEED

=NULL: get a new random number seed with tdmRandomSeed (different RF trainings).
=any value: set the random number seed to this value (+i) to get reproducible random numbers. In this way, the model training part (RF, NNET, ...) gets always a fixed seed (see also TST.SEED in tdmClassifyLoop)

CLASSWT

class weights (NULL, if all classes should have the same weight) (currently used only by methods RF, MC.RF and by tdmModSortedRFimport)

fct.postproc

[NULL] name of user-def'd function for postprocessing of predicted output

GD.DEVICE

if !="non", then make a pairs-plot of the 5 most important variables and make a true-false bar plot

VERBOSE

[2] =2: most printed output, =1: less, =0: no output

tsetStr

[c("Validation", "validation")]

Details

Currently d_dis is allowed to be a 0-row data frame, but d_train and d_test must have at least one record.

Value

res, an object of class tdmClass, this is a list containing

d_train

training set + predicted class column(s)

d_test

test set + predicted class column(s)

d_dis

disregard set + predicted class column(s)

avgEVAL

list with evaluation measures, averaged over all response variables

allEVAL

data frame with evaluation measures, one row for each response variable

lastCmTrain

a list with evaluation info for training set (confusion matrix, gain, class errors, ...)

lastCmVali

a list with evaluation info for validation set (confusion matrix, gain, class errors, ...)

lastModel

the last model built (i.e. for the last response variable)

lastProbs

a list with three probability matrices (row: records, col: classes) v_train, v_test, v_dis, if the model provides probabilities; NULL else.

lastPred

name of the colum where the prediction of the last model is appended to the datasets d_train, d_test and d_dis

predProb

a list with two data frames Trn and Val. They contain at least a column IND.dset (index of each train / validation record into data frame dset). If the model has probabilities, then they contain in addition a column for each response variable with the prediction probabilities.

opts

parameter list from input, some default values might have been added

The 9 evaluation measures in avgEVAL and allEVAL are cerr.* (misclassification errror), gain.* (total gain) and rgain.* (relative gain, i.e. total gain divided by max. achievable gain in *) where * = [trn | tst | tst2 ] stands for [ training set | test set | test set with special treatment ] and the special treatment is either opts$test2.string = "no postproc" or = "default cutoff".

The five items lastCmTrain, lastCmVali, lastModel, lastProbs, lastPred are specific for the *last* model (the one built for the last response variable in the last run and last fold)

Author(s)

Wolfgang Konen, THK, 2013

See Also

print.tdmClass tdmClassifyLoop tdmRegressLoop

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#*# This demo shows a simple data mining process (phase 1 of TDMR) for classification on
#*# dataset iris.
#*# The data mining process in tdmClassify calls randomForest as the prediction model.
#*# It is called opts$NRUN=1 time with one random train-validation set splits.
#*# Therefore data frame res$allEval has one row
#*#
opts=tdmOptsDefaultsSet()                       # set all defaults for data mining process
gdObj <- tdmGraAndLogInitialize(opts);          # init graphics and log file

data(iris)
response.variables="Species"                    # names, not data (!)
input.variables=setdiff(names(iris),"Species")
opts$NRUN=1

idx_train = sample(nrow(iris))[1:110]
d_train=iris[idx_train,]
d_vali=iris[-idx_train,]
d_dis=iris[numeric(0),]
res <- tdmClassify(d_train,d_vali,d_dis,NULL,response.variables,input.variables,opts)

cat("\n")
print(res$allEVAL)

TDMR documentation built on March 3, 2020, 1:06 a.m.