Feature Selection

Often, data sets include a great amount of variables and you want to reduce them. This technique of selecting a subset of relevant variables is called variable selection. Variable selection can make the model interpretable, the learning process faster and the fitted model more general by removing irrelevant variables. Different approaches exist to figure out what the relevant variables are. mlr supports filters and wrappers.


Filters are the simplest approach to find variables that do not contain a lot of additional information and thus can be left out. Different methods are built into mlr's function getFeatureFilterValues() all accessing filter algorithms from the package FSelector. The function is given a task and simply returns an importance vector.

task = makeClassifTask(data = iris, target = "Species")
importance = getFeatureFilterValues(task, method = "information.gain")
sort(importance, decreasing = TRUE)

So according to this filter Petal.Width and Petal.Length contain the most information. With mlr's function filterFeatures() you can now filter the task by leaving out all but a set number of features with the highest feature importance now into the task.

filtered.task = filterFeatures(task, method = "information.gain", n = 2)

Other filter options in `filterFeatures()' are to require a percentage of features filtered instead or to set a threshold for the numerical importance values.

In a proper experimental set up you might want to automate the selection of the variables so that it can be part of the validation method of your choice. We will use the standard 10-fold cross validation.

learner = makeLearner("classif.fnn")
learnerFiltered = makeFilterWrapper(learner = learner, fw.method = "information.gain", fw.percentage = 0.7)
rdesc = makeResampleDesc("CV", iters = 10)
rsres = resample(learner = learnerFiltered, task = task, resampling = rdesc, show.info = FALSE, models = TRUE)

Now you want might want to know which features have been used. Luckily we have called resample with the argument models=TRUE which means that rsres$models contains a list of each model used for a fold. In this case the Learner is also of the class FilterWrapper and we can call getFilteredFeatures() on each model.

sfeats = sapply(rsres$models, getFilteredFeatures)

The selection of features seems to be very stable. The Sepal.Width did not make it into a single fold.

Tuning the threshold

task = makeClassifTask(data = Sonar, target = "Class", positive = "M")
lrn = makeLearner("classif.rpart")
lrnFiltered = makeFilterWrapper(learner = lrn, fw.method = "chi.squared", fw.threshold = 0)
ps = makeParamSet(makeDiscreteParam("fw.threshold", values = seq(from = 0.2, 0.4, by = 0.05)))
tuneRes = tuneParams(lrnFiltered, task = task, resampling = makeResampleDesc("CV", iters = 5),
  par.set = ps, control = makeTuneControlGrid())


Unlike the filters wrappers make use of the performance a learner can achieve on a given subset of the features in the data.

Quick start

Classification example

Let's train a decision tree on the iris data and use a sequential forward search to find the best group of features w.r.t. the mmce (mean misclassification error).

task = makeClassifTask(data = iris, target = "Species")
lrn = makeLearner("classif.rpart")
rdesc = makeResampleDesc("Holdout")

ctrlSeq = makeFeatSelControlSequential(method = "sfs")
sfSeq = selectFeatures(learner = lrn, task = task, resampling = rdesc, control = ctrlSeq)
analyzeFeatSelResult(sfSeq, reduce = FALSE)

Regression example

We fit a simple linear regression model to the BostonHousing data set and use a genetic algorithm to find a feature set that reduces the mse (mean squared error).


task = makeRegrTask(data = BostonHousing, target = "medv")
lrn = makeLearner("regr.lm")
rdesc = makeResampleDesc("Holdout")

ctrlGA = makeFeatSelControlGA(maxit = 10)
sfGA = selectFeatures(learner = lrn, task = task, resampling = rdesc, control = ctrlGA, show.info = FALSE)

