Often, data sets include a great amount of variables and you want to reduce them. This technique of selecting a subset of relevant variables is called variable selection. Variable selection can make the model interpretable, the learning process faster and the fitted model more general by removing irrelevant variables. Different approaches exist to figure out what the relevant variables are. mlr supports filters and wrappers.
Filters are the simplest approach to find variables that do not contain a lot of additional information
and thus can be left out.
Different methods are built into mlr's function getFeatureFilterValues()
all accessing filter
algorithms from the package FSelector
.
The function is given a task
and simply returns an importance vector.
library("mlr") task = makeClassifTask(data = iris, target = "Species") importance = getFeatureFilterValues(task, method = "information.gain") sort(importance, decreasing = TRUE)
So according to this filter Petal.Width
and Petal.Length
contain the most information.
With mlr's function filterFeatures()
you can now filter the task by leaving out all
but a set number of features with the highest feature importance now into the task.
filtered.task = filterFeatures(task, method = "information.gain", n = 2)
Other filter options in `filterFeatures()' are to require a percentage of features filtered instead or to set a threshold for the numerical importance values.
In a proper experimental set up you might want to automate the selection of the variables so that it can be part of the validation method of your choice. We will use the standard 10-fold cross validation.
learner = makeLearner("classif.fnn") learnerFiltered = makeFilterWrapper(learner = learner, fw.method = "information.gain", fw.percentage = 0.7) rdesc = makeResampleDesc("CV", iters = 10) rsres = resample(learner = learnerFiltered, task = task, resampling = rdesc, show.info = FALSE, models = TRUE) rsres$aggr
Now you want might want to know which features have been used.
Luckily we have called resample
with the argument models=TRUE
which means that
rsres$models
contains a list
of each model used for a fold.
In this case the Learner
is also of the class FilterWrapper
and we can call
getFilteredFeatures()
on each model.
sfeats = sapply(rsres$models, getFilteredFeatures) table(sfeats)
The selection of features seems to be very stable.
The Sepal.Width
did not make it into a single fold.
library("mlbench") data(Sonar) task = makeClassifTask(data = Sonar, target = "Class", positive = "M") lrn = makeLearner("classif.rpart") lrnFiltered = makeFilterWrapper(learner = lrn, fw.method = "chi.squared", fw.threshold = 0) ps = makeParamSet(makeDiscreteParam("fw.threshold", values = seq(from = 0.2, 0.4, by = 0.05))) tuneRes = tuneParams(lrnFiltered, task = task, resampling = makeResampleDesc("CV", iters = 5), par.set = ps, control = makeTuneControlGrid())
Unlike the filters wrappers make use of the performance a learner can achieve on a given subset of the features in the data.
Let's train a decision tree on the iris
data and use a sequential forward search to find the
best group of features w.r.t. the mmce (mean misclassification error).
library("mlr") task = makeClassifTask(data = iris, target = "Species") lrn = makeLearner("classif.rpart") rdesc = makeResampleDesc("Holdout") ctrlSeq = makeFeatSelControlSequential(method = "sfs") sfSeq = selectFeatures(learner = lrn, task = task, resampling = rdesc, control = ctrlSeq) sfSeq analyzeFeatSelResult(sfSeq, reduce = FALSE)
We fit a simple linear regression model to the BostonHousing
data set and use a
genetic algorithm to find a feature set that reduces the mse (mean squared error).
library("mlbench") data(BostonHousing) task = makeRegrTask(data = BostonHousing, target = "medv") lrn = makeLearner("regr.lm") rdesc = makeResampleDesc("Holdout") ctrlGA = makeFeatSelControlGA(maxit = 10) sfGA = selectFeatures(learner = lrn, task = task, resampling = rdesc, control = ctrlGA, show.info = FALSE) sfGA
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.