wrapper: Method selection wrapper

Description Usage Arguments Details Value Author(s) Examples

Description

The wrapper method is a strategy of selection of features for data mining models. It can be used for a serie of statistical methods as the Linear discriminant analysis for example

Usage

1
2
wrapper(dataset, learner="svm", train=0.7, algorithm="forward", k.lim=2,
        target=NULL, features=NULL, plot.err=TRUE, ...)

Arguments

dataset

numeric dataset of data to learn

learner

classifier used to learn the model. Default: svm

train

split dataset train/test. Default: value of the parameter is 0.7

algorithm

wrapper algorithm used to select variables. Default: forward

k.lim

number of iteration to continue without optimizing the error rate. Default: there are two iteration

target

index/name of the target column of the dataset that we want to learn. Default: last column if the parameter is NULL

features

index/name of the features column of the dataset that we want to learn from. Default: all but target if the parameter is NULL

plot.err

boolean indicating whether to plot or no the error rate for all the tested models. Default: plot is display

...

supplementary parameters for the classifier

Details

The algorithms backward, forward and genetics are implemented. It is possible to choose one of these approaches. Forward is an additive selection of features from nothing to all the features. Backward start with all features and eliminate the features which do not allow to decrease the rate of error of the model. Genetic algorithm is stochastic, less sensitive to local optima and stronger with big datasets.

It is possible to use the following methods of learning: the logistic regression (logit), the linear discriminating analysis (lda) and quadratique ( qda), the linear regression (lm), Support Vector Machine (svm), the bayesian naive classification (naiveBayes) and random forest (randomForest). The logistic regression is done with glm and the family="binomial" option. Furthermore, you can use your own functions (see examples)

the train parameter can be the proportion of the dataset to use for training (train < 1), the size of the training set (train > 1) or a vector containing the indices of the rows used for training (length(train) > 1).

As forward and backward are sensitive to local minimums, they can continue to look for k.lim iterations when they reach a local minimum without finding a best model. Thereby, the higher the value of k.lim, more the methods will be flexibles. On the other hand, they test more models and the execution time can increase greatly. This parameter is unused by the genetic algorithm.

... add named parameters to the learning function. This is useful for naive bayes, random forest and svm. Unforunately, this parameter cannot be used with the genetic algorithm.

The criterion of selection of the best model is based on the minimisation of the error rate (except lm which use the BIC).

Value

modele return an object of the called learner (logit, lda, qda, lm, svm, naiveBayes and randomForest). S3 object (~= list) containing the historic of models, the historic of error rate, the best formula and some parameters as which learner and which algorithm was used.

algorithm

name of the algorithm wrapper used

classifier

name of the method of learning

formula

formula allowing to find the model (object of type "language", obtained with as.formula)

best.feats

list of the best features

hist.mods

history of the subsets of features tested (in sub the shape of list)

hist.errors

vector of the rates of errors of the tested models

error.rate

rate of error of the best model

train.sample

vector of the numbers of lines used for the learning

para.sup

list of additional parameters

Author(s)

Implementation of model formula by Pierre Lejeail, Maite Garcia & Jean-Luc Yongwe (2016) based from Wrappers for performance enhancement and oblivious decision graphs by Ron Kohavi (1995).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
data("diabete")
head(diabete)
# We don't want the 6th and 7th variable as it is a factor
diabete <- diabete[, -c(6,7)]

# default the method svm and approch forward are applied
wrapper(diabete)

# We can select manually features to test abd target
# For example, with lda and backward
wrapper(diabete, learner="lda", target=7, features=1:6, algorithm="backward")


# Using own function with cross-validation
evalFUN <- function(features, ...) {  # The parameters must be the same
  # evry other variable you'll need to acces must be defined outside
  k_folds <- 8
  folds_i <- sample(rep(1:k_folds, length.out = nrow(diabete)))
  # Une validation
  error_rate <- function(k) {
    test_i <- which(folds_i == k)
    model <- e1071::svm(as.formula(paste("Outcome~", unpack(features))),
                               data=diabete[-test_i, ])
    prediction <- predict(model, diabete[test_i, ])
    tt <- table(diabete[test_i, "Outcome"], prediction)
    return(sum(tt[row(tt) != col(tt)]) / sum(tt))  # return the error rate
    # it can return another evaluation measure
    # if you want to return a measure which need to be maximized then multiply it with -1
  }
  # mean of all train/test sets generated by cross-validation
  return(mean(sapply(1:k_folds, error_rate)))
}

wrapped <- wrapper(dataset = diabete,
                   learner = evalFUN,
                   algorithm = "forward",
                   features = 1:6,
                   target = "Outcome"  # we can also select variables with names
)

plejeail/Wrapper documentation built on May 5, 2019, 5:54 p.m.