GOFAST = FALSE library("ggplot2") knitr::opts_knit$set(root.dir = "~/lmu/master/testcars") knitr::opts_chunk$set(message = FALSE, fig.align = "center") configureMlr(show.info = FALSE) options(width=80) myrepcv = function(..., folds, reps) { rdesc = makeResampleDesc("CV", iters = folds) args = list(...) args$resampling = rdesc vals = parallelMap(function(dummy) do.call(resample, args)$aggr, seq_len(reps), show.info = FALSE, simplify = TRUE) aggr = mean(vals) se = sd(vals) / sqrt(reps) cat(sprintf("MMCE Mean %.4f, SE %.4f\n", aggr, se)) } if (GOFAST) { cv10 = mlr::hout repcv = function(...) { args = list(...) args$reps = 3 args$folds = 2 do.call(myrepcv, args) } } else { cv10 = makeResampleDesc("RepCV", folds = 10, reps = 10) repcv = function(...) { args = list(...) args$reps = 50 args$folds = 10 do.call(myrepcv, args) } } library("parallelMap") parallelStartSocket(parallel::detectCores()) parallelLibrary("stringr") parallelLibrary("mlr") head = function(x, ...) { kableExtra::kable_styling(knitr::kable(utils::head(x), "html")) }
mlrCPO
When applying a machine learning algorithm (i.e. a Learner
in mlr
) to a dataset, it is often beneficial to do certain operations to the data (e.g. removal of constant columns, principal component analysis) in a step called "preprocessing". mlr
already provides some basic preprocessing functionality, but it is limited in scope and quite inflexible. A much better toolset is now provided by mlrCPO.
The central objects provided by mlrCPO are "composable preprocessing operators", called CPOs. They make it possible to handle a preprocessing operation as an R object, to combine multiple operations (sequentially, or in parallel) to form new ones, and to attach preprocessing objects to Learner
objects to form complete machine learning pipelines. Just like a Learner
, a CPO
can have hyperparameters, which can also be tuned using mlr
's tuning functionality.
The possibilities offered by mlrCPO
can best be demonstrated using example. The following is just giving an overview and not an in-depth tutorial; to go deeper into mlrCPO
, consult the package's vignettes.
library("mlrCPO")
A dataset that can make good use of preprocessing operations is the "Titanic" dataset on Kaggle. We can use feature engineering on some of the more awkward character string features, and for some features we need to do missing value imputation. One issue with the dataset is that the baseline performance using just the "Sex" and "Pclass" columns is already quite high, which makes improvements of algorithm performance that result from preprocessing hard to measure. For demonstration purposes we are therefore going to simulate a "harder" dataset by removing the "Sex" and "Pclass" columns. If the train.csv
file from Kaggle is in the working directory, it should be loaded into an mlr
-compatible Task
using
titanic.table = read.csv("train.csv", na.str = "") titanic.table$Pclass = NULL titanic.table$Sex = NULL titanic.task = makeClassifTask("titanic", titanic.table, target = "Survived", positive = 1) test.table = read.csv("test.csv", na.str = "") test.table$Pclass = NULL test.table$Sex = NULL
To use a CPO
, it first needs to be created. mlrCPO
offers a wide range of "CPO Constructors" for this: They are called like R functions, with parameters that determine the behaviour of a CPO
. Once a CPO
is created, it can be used to manipulate data using the "%>>%
" operator. In our example, this can be used for feature engineering. It is a common operation to add a "Title" feature to the Titanic dataset, which consists of the title part of a passenger's name. To add a new column to a dataset from existing columns, use the cpoAddCols
CPO
. (In its default setting, cpoAddCols
automatically converts strings to factor features.)
library("stringr") titleadder = cpoAddCols(Title = str_match(Name, "\\w+\\.")) working.data = titanic.task %>>% titleadder
head(getTaskData(working.data))
After extracting the title from the name, we may decide that the "Name" column is not useful any more. The cpoSelect
CPO
can take care of it. It is used to include or exclude features from a dataset by different criteria. We build a CPO
that excludes the "Name" column like so:
remove.superfluous = cpoSelect(names = "Name", invert = TRUE, export = "names")
head(getTaskData(working.data %>>% remove.superfluous))
Note the export
parameter. It is a special parameter of every CPO
constructor that controls which hyperparameters, i.e. parameters that control the operation of a CPO
, to be exported. An exported hyperparameter can be changed later using the setHyperPars
function; inspect the set of exported hyperparameters using the getParamSet()
function, or using the verbose printing operator !
:
!remove.superfluous
We may, for example, decide that we also want to remove the "PassengerId
", "Ticket
", and "Cabin
" parameters. This can be done as the following:
remove.superfluous = setHyperPars(remove.superfluous, select.names = c("Name", "PassengerId", "Ticket", "Cabin")) head(getTaskData(working.data %>>% remove.superfluous))
An alternative to the data = data %>>% CPO
construct is the %<>>%
operator. It applies one or several CPO
s and assigns the result to the operand on the left. The following removes the Name
and superfluous columns from working.data
:
working.data %<>>% remove.superfluous
The %>>%
operator can not only be used to apply a CPO
to data, but also to concat multiple CPO
s in a row to form one combined CPO
. Suppose, for example, we want to build a CPO
that uses different imputation techniques on different column types. We may want to use this CPO
on multiple datasets without always having to write out the operations separately. The following constructs a composite CPO
which first operates on the ordered and factorial columns (always creating a new factor level specifically for missing values), and then on the numeric columns (using predictions made by the regr.cforest
Learner
). Which columns are seen by a CPO
is controlled by the "affect.type
" parameter on construction.
impute.fact = cpoImputeConstant("MISSING", affect.type = c("ordered", "factor"), export = character(0)) impute.num = cpoImputeLearner(makeLearner("regr.cforest", ntree = 50, mtry = 3), affect.type = "numeric", export = character(0)) impute.all = impute.fact %>>% impute.num
impute.all
After applying this CPO
to new data, we see that the Missings
property is FALSE
.
working.data %>>% impute.all
Sometimes we want to apply different operations on a dataset, but then go on working with a combination of the results of all these operations. Consider, for example, cpoMissingIndicators
, which replaces the input data with columns indicating whether data is missing. It can not be used after imputation (because no missing values remain), but using it before imputation would cause the imputation CPO
s to also create imputation models for the missing value indicating columns.
To perform several different actions in parallel and combine the result, use the cpoCbind
CPO
. It is named after the R cbind()
operation, which works in a similar way. cpoCbind
is constructed with different CPO
s as its arguments, named by a prefix to add to the resulting columns. To add missing value dummy columns, we do
impute.and.dummy = cpoCbind(impute.all, missing = cpoMissingIndicators(export = character(0)))
Detailed printing of impute.and.dummy
using !
gives a graphical representation of the operations being performed: One data stream does two imputation operations, the other adds the missing indicators:
!impute.and.dummy
The missing indicators are the names of the original columns prefixed with "missing.
", since that was the name given to the argument to cpoCbind
.
working.data %<>>% impute.and.dummy head(getTaskData(working.data))
After we train a model on the data we manipulated, we want to be able to make predictions on new data. This new data needs to be modified in the same way as the training data with one caveat: Some preprocessing operations, in our example the imputation of numeric columns, depend on the training data. We must not naively use cpoImputeLearner
on new prediction data, since a missing value should be imputed depending on training data, not depending on other prediction data.
To solve this problem, mlrCPO
always attaches a "CPOTrained
" object to preprocessed data, which can be retrieved using the retrafo()
function. It behaves similarly to a CPO
; in particular, it can be applied to new data. Printing it gives information about the operations that it has cached.
cpo.trained = retrafo(working.data) cpo.trained
If we wanted to perform prediction with new data ("test.table
"), we could do the following:
working.test = test.table %>>% cpo.trained head(working.test)
A common way to evaluate different machine learning algorithms is to (often repeatedly) train a model on a subset of available data, make a prediction on the remaining data, and compare that prediction with the known target values of that data. This process is called resampling, and mlr
offers the function resample()
which does exactly this. However, it is important that the entire machine learning process should happen inside the resampling, i.e. that not only the machine learning model, but also the preprocessing operations, get trained only on the training subset of the resampling step. For example, at this point it would be tempting to do resample(learner, working.data, cv10)
to do five fold crossvalidation using any "learner
" we might think of. The resampling would evaluate models trained on subsets of working.data
, but this dataset was already preprocessed using information from the whole available data, so information from the resampling test subset would sneak in. It might not give a good estimate of real world performance, since a model trained on (the whole of) working.data
does not contain any information about data still in the wild (e.g. test.table
). Luckily mlrCPO
makes it very easy to perform "proper" resampling by performing another simplification: attaching a CPO
to a Learner
.
CPO
s to Learner
sIt is possible (and encouraged) to build entire machine learning pipelines encompassing preprocessing and model fitting by attaching CPO
s to Learner
s. This is also done using the %>>%
operator and results in a CPOLearner
. This object behaves just like a normal mlr
Learner
and is very similar to the process of "wrapping" Learner
s in mlr
. Hyperparameters that were exported by the CPO
become part of the resulting Learner
and can be changed using setHyperPars()
and tuned using tuneParams()
.
The following combines all operations done so far with a randomForest
Learner
. The data that the randomForest
is trained on will look like the working.test
task we constructed above, but will only contain information of the resampling subset that it is trained on.
(When just adding the CPO
s from above, we get frequent errors during resampling because titleadder
sometimes creates new factor levels (from rare titles) during prediction that were not seen during training. To fix this, we add a cpoFixFactors()
after it, which turns unseen titles into NA
s.)
preproc.pipeline = titleadder %>>% cpoFixFactors(export = character(0)) %>>% remove.superfluous %>>% impute.and.dummy preproc.learner = preproc.pipeline %>>% makeLearner("classif.randomForest") preproc.learner
This Learner
can be used for resampling on the (original!) data without bad conscience, and can be used on the test data without needing to worry about retrafo()
s.
repcv(preproc.learner, titanic.task)
model = train(preproc.learner, titanic.task) predict(model, newdata = test.table)
A nice aspect of attaching CPO
s to Learner
s is that the hyperparameters of the CPO
s can be tuned using mlr
's useful tuneParams()
method. For the following example, we are going to consider a few more columns of the data that might be useful. Then we use principal component analysis on a model matrix of these columns. For the PCA, we need to choose the number of principal components to include, which we are going decide by tuning.
We are going to use the Name
column in the original data to create a feature NameLength
that indicates how many words the last name of a person is made out of. The deck and room number are extracted from the Cabin
feature. The numerical part of the Ticket
feature is also used; we take the logarithm because it has a large range. Note how we suffix every new column with .topca
, which makes it easy later to only apply PCA to these new columns.
coladder = cpoAddCols( NameLength.topca = { lastname = str_replace(Name, ".*\\.", "") lastname = str_replace(lastname, '[("].*', "") sapply(str_split(str_trim(lastname), "[ -]"), length) }, Deck.topca = as.ordered(str_sub(Cabin, 1, 1)), Roomno.topca = suppressWarnings(as.numeric(str_sub(Cabin, 2, 3))), ticketno.topca = { tno = suppressWarnings(as.numeric(str_replace(Ticket, ".* ", ""))) log(tno) })
We are also going to collapse the least common factors of the Title
feature using cpoCollapseFact
. Because the missing indicators of Roomno
and Deck
are the same (since both features are extracted from the Cabin
feature), one of them is removed. The output shown is only of the new columns added.
preproc.pipeline = titleadder %>>% coladder %>>% cpoCollapseFact(0.05, affect.names = "Title", export = character(0)) %>>% cpoFixFactors(export = character(0)) %>>% remove.superfluous %>>% impute.and.dummy %>>% cpoSelect(names = "missing.Roomno.topca", invert = TRUE) head(getTaskData(titanic.task %>>% preproc.pipeline %>>% cpoSelect(pattern = "\\.topca")))
In the following we first create the CPO
that first turns all columns into numerics (using cpoModelMatrix
) and then applies the PCA. To apply this CPO
only to columns with .topca
in their name, we use cpoCbind
in combination with cpoSelect
.
pca.cpo = cpoModelMatrix(~0 + ., export = character(0)) %>>% cpoDropConstants(export = character(0)) %>>% cpoScale(export = character(0)) %>>% cpoPca(export = "rank") pca.restricted = cpoCbind( cpoSelect(pattern = "\\.topca$", invert = TRUE, id = "restrict"), cpoSelect(pattern = "\\.topca$") %>>% pca.cpo) tune.learner = preproc.pipeline %>>% pca.restricted %>>% makeLearner("classif.randomForest") getParamSet(tune.learner)
Because we only do tuning in low dimensions, it is sensible to use grid tuning.
ctrl = makeTuneControlGrid() ps = makeParamSet(makeNumericParam("pca.rank", lower = 1, upper = 10)) tuneres = tuneParams(tune.learner, titanic.task, cv10, par.set = ps, control = ctrl) tuneres
The result seems to indicate that the performance is similar for the first five added columns, after which it degrades.
effects = generateHyperParsEffectData(tuneres) qplot(pca.rank, mmce.test.mean, data = effects$data)
(Note that all tuning result graphs shown were created using 10-fold repeated crossvalidation with 10 iterations, which reduces noise but takes a lot longer to run than the simple 10-fold crossvalidation used in the code.)
Making a comparison of CPO
using three PCA
components, using no PCA
, or not using the added columns altogether, shows that the performance is similar, although not doing PCA has a slight edge.
# using PCA repcv(setHyperPars(tune.learner, pca.rank = 3), titanic.task)
# the columns with no PCA repcv(preproc.pipeline %>>% makeLearner("classif.randomForest"), titanic.task)
# not using any columns used by the PCA repcv(preproc.pipeline %>>% cpoSelect(pattern = "\\.topca$", invert = TRUE) %>>% makeLearner("classif.randomForest"), titanic.task)
Instead of just tuning over the number of PCA columns to include, we may also wonder whether using the similar ICA method is a better choice in this case. It would be useful to have a hyperparameter that controls which CPO
gets applied: cpoPca
or cpoIca
. This is exactly what cpoMultiplex
does: It is created with several candidate CPO
s and exports a selected.cpo
parameter. All hyperparameters of the included CPO
s are exported as well, which makes it possible to not only tune over which CPO
is used, but also over their parameters.
The following is a combination of the cpoPca
and cpoIca
CPO
s. Looking at the resulting CPO
's parameter set shows that all the parameters of the included CPO
s are exported.
mplx = cpoMultiplex(list(cpoPca(scale = TRUE, export = "rank"), cpoIca(export = "n.comp"))) getParamIds(getParamSet(mplx))
We could tune this CPO
as we do the one above, but it would be nicer if the different parameters controling number of components could be pulled together into a single parameter. This is done using cpoTransformParams
, which wraps a CPO
, creates new hyperparameters, and makes it possible to express the wrapped CPO
's hyperparameters in terms of the new hyperparameters.
mplx.transformed = cpoTransformParams(mplx, transformations = alist(pca.rank = rank, ica.n.comp = rank), additional.parameters = makeParamSet(makeIntegerParam("rank", 1))) getParamSet(mplx.transformed)
This newly created CPO
can now be included in the preprocessing pipeline.
general.ca = cpoModelMatrix(~0 + ., export = character(0)) %>>% cpoDropConstants(export = character(0)) %>>% cpoScale() %>>% mplx.transformed ca.restricted = cpoCbind( cpoSelect(pattern = "\\.topca$", invert = TRUE, id = "restrict"), cpoSelect(pattern = "\\.topca$") %>>% general.ca)
Using different settings for rank
and selected.cpo
will now select the CPO
to use, while independently setting the fraction of columns to return. To get the results of selecting one column of PCA as above, we use:
head(getTaskData(titanic.task %>>% preproc.pipeline %>>% setHyperPars(selected.cpo = "pca", ca.restricted, rank = 1)))
Using ICA gives a different result, but the same number of columns:
head(getTaskData(titanic.task %>>% preproc.pipeline %>>% setHyperPars(ca.restricted, selected.cpo = "ica", rank = 1)))
We are now ready to tune this CPO
instead of the one previously used. We re-define tune.learner
and create a new tuning parameter set and are good to go.
tune.learner = preproc.pipeline %>>% ca.restricted %>>% makeLearner("classif.randomForest") ps = makeParamSet( makeDiscreteParam("selected.cpo", values = c("pca", "ica")), makeIntegerParam("rank", lower = 1, upper = 10)) tuneres = tuneParams(tune.learner, titanic.task, cv10, par.set = ps, control = ctrl) tuneres
effects = generateHyperParsEffectData(tuneres) qplot(rank, mmce.test.mean, color = selected.cpo, data = effects$data)
The difference seems to be minute, but for most values PCA seems to outperform ICA.
repcv(setHyperPars(tune.learner, rank = 3, selected.cpo = "ica"), titanic.task)
Even though mlrCPO
offers many versatile CPO
s, it may happen that a method that is not present should be implemented. For this case, it is possible to create new CPOConstructor
s using makeCPO()
. To go deeper into the topic of creating custom CPOs, you can query the respective vignette using vignette("a_4_custom_CPOs", package="mlrCPO")
, but the following gives a useful and relatively general example.
When looking at the distribution of the ticket numbers, we notice a few accumulation of points and might wonder whether having a ticket from one of these "clusters" gives any information about a passenger.
ggplot(getTaskData(titanic.task %>>% preproc.pipeline), aes(ticketno.topca, 0)) + geom_jitter(height = 1) + coord_fixed()
We choose to use the dbscan
library for clustering, even though there is no CPO
available that implements it. The wrong approach would be to use cpoAddCols
to add a column from the result of running dbscan
on the input, because this would give different clusters during training and prediction. Instead, we need to implement a new CPO
that assigns the correct clusters to new data values.
makeCPO()
splits the preprocessing operation up into two functions: a cpo.train
function that is applied only to the training data and returns a "control" object which contains all the trained information--in our case, the result from the dbscan()
call--, and a cpo.retrafo
function that is applied to both training and prediction data and does the actual data transformation--assigning the cluster index to every data point.
Our CPO
will take only numeric values, and will return a single factorial column indicating cluster membership. We use the properties.*
parameters of makeCPO()
to indicate:
properties.data = c("numerics", "missings")
CPO
to a Learner
, it must be able to handle "factors": properties.needed = "factors"
Learner
does not need to handle numeric values (since we are converting numerics to factors). Therefore, the "numerics" property is "added" to the Learner
when the CPO
gets attached: properties.adding= "numerics"
Our newly created CPO
exports the "eps" parameter of dbscan
, indicating the necessary proximity for cluster membership. This is declared by the par.set
parameter of makeCPO
.
cpoDbscan = makeCPO(cpo.name = "dbscan", par.set = makeParamSet(makeNumericParam("eps", 0)), properties.data = c("numerics", "missings"), properties.needed = "factors", properties.adding = "numerics", packages = "dbscan", cpo.train = function(data, target, eps) { data = as.matrix(data) list(cluster = dbscan::dbscan(data, eps), data = data) }, cpo.retrafo = function(data, control, eps) { prediction = predict(control$cluster, as.matrix(data), control$data) unique.clusters = sort(unique(control$cluster$cluster)) result = data.frame(factor(prediction, levels = unique.clusters)) names(result) = str_c(c("dbscan", colnames(data)), collapse = ".") result })
Inspecting the clustering done on both titanic.table
and test.table
shows that clustering work well:
clustered.train = titanic.task %>>% preproc.pipeline %>>% cpoSelectFreeProperties(names = "ticketno.topca") %>>% cpoCbind(NULLCPO, cpoDbscan(0.1)) ggplot(getTaskData(clustered.train, target.extra = TRUE)$data, aes(ticketno.topca, 0, color = dbscan.ticketno.topca)) + geom_jitter(height = 1, show.legend = FALSE) + coord_fixed()
clustered.predict = test.table %>>% retrafo(clustered.train) ggplot(clustered.predict, aes(ticketno.topca, 0, color = dbscan.ticketno.topca)) + geom_jitter(height = 1, show.legend = FALSE) + coord_fixed()
We can use this to cluster ticket numbers (and drop the other .topca
columns) and evaluate the resulting Learner.
cluster.learner = preproc.pipeline %>>% cpoCbind(cpoSelect(pattern = "\\.topca$", invert = TRUE), cpoSelectFreeProperties(names = "ticketno.topca") %>>% cpoDbscan(0.1)) %>>% makeLearner("classif.randomForest")
repcv(cluster.learner, titanic.task)
parallelStop()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.