GOFAST = FALSE
library("ggplot2")
knitr::opts_knit$set(root.dir = "~/lmu/master/testcars")
knitr::opts_chunk$set(message = FALSE, fig.align = "center")
configureMlr(show.info = FALSE)
options(width=80)
myrepcv = function(..., folds, reps) {
  rdesc = makeResampleDesc("CV", iters = folds)
  args = list(...)
  args$resampling = rdesc
  vals = parallelMap(function(dummy) do.call(resample, args)$aggr, seq_len(reps), show.info = FALSE, simplify = TRUE)
  aggr = mean(vals)
  se = sd(vals) / sqrt(reps)
  cat(sprintf("MMCE Mean %.4f, SE %.4f\n", aggr, se))
}
if (GOFAST) {
  cv10 = mlr::hout
  repcv = function(...) {
    args = list(...)
    args$reps = 3
    args$folds = 2
    do.call(myrepcv, args)
  }
} else {
  cv10 = makeResampleDesc("RepCV", folds = 10, reps = 10)
  repcv = function(...) {
    args = list(...)
    args$reps = 50
    args$folds = 10
    do.call(myrepcv, args)
  }
}
library("parallelMap")
parallelStartSocket(parallel::detectCores())
parallelLibrary("stringr")
parallelLibrary("mlr")

head = function(x, ...) {
  kableExtra::kable_styling(knitr::kable(utils::head(x), "html"))
}

Composable Preprocessing Operators: mlrCPO

When applying a machine learning algorithm (i.e. a Learner in mlr) to a dataset, it is often beneficial to do certain operations to the data (e.g. removal of constant columns, principal component analysis) in a step called "preprocessing". mlr already provides some basic preprocessing functionality, but it is limited in scope and quite inflexible. A much better toolset is now provided by mlrCPO.

The central objects provided by mlrCPO are "composable preprocessing operators", called CPOs. They make it possible to handle a preprocessing operation as an R object, to combine multiple operations (sequentially, or in parallel) to form new ones, and to attach preprocessing objects to Learner objects to form complete machine learning pipelines. Just like a Learner, a CPO can have hyperparameters, which can also be tuned using mlr's tuning functionality.

Example Application: The Kaggle "Titanic" Dataset

The possibilities offered by mlrCPO can best be demonstrated using example. The following is just giving an overview and not an in-depth tutorial; to go deeper into mlrCPO, consult the package's vignettes.

library("mlrCPO")

A dataset that can make good use of preprocessing operations is the "Titanic" dataset on Kaggle. We can use feature engineering on some of the more awkward character string features, and for some features we need to do missing value imputation. One issue with the dataset is that the baseline performance using just the "Sex" and "Pclass" columns is already quite high, which makes improvements of algorithm performance that result from preprocessing hard to measure. For demonstration purposes we are therefore going to simulate a "harder" dataset by removing the "Sex" and "Pclass" columns. If the train.csv file from Kaggle is in the working directory, it should be loaded into an mlr-compatible Task using

titanic.table = read.csv("train.csv", na.str = "")
titanic.table$Pclass = NULL
titanic.table$Sex = NULL

titanic.task = makeClassifTask("titanic", titanic.table,
  target = "Survived", positive = 1)

test.table = read.csv("test.csv", na.str = "")
test.table$Pclass = NULL
test.table$Sex = NULL

CPO Creation and Feature Engineering

To use a CPO, it first needs to be created. mlrCPO offers a wide range of "CPO Constructors" for this: They are called like R functions, with parameters that determine the behaviour of a CPO. Once a CPO is created, it can be used to manipulate data using the "%>>%" operator. In our example, this can be used for feature engineering. It is a common operation to add a "Title" feature to the Titanic dataset, which consists of the title part of a passenger's name. To add a new column to a dataset from existing columns, use the cpoAddCols CPO. (In its default setting, cpoAddCols automatically converts strings to factor features.)

library("stringr")
titleadder =  cpoAddCols(Title = str_match(Name, "\\w+\\."))
working.data = titanic.task %>>% titleadder
head(getTaskData(working.data))

CPO Hyperparameters and Manual Feature Selection

After extracting the title from the name, we may decide that the "Name" column is not useful any more. The cpoSelect CPO can take care of it. It is used to include or exclude features from a dataset by different criteria. We build a CPO that excludes the "Name" column like so:

remove.superfluous = cpoSelect(names = "Name", invert = TRUE, export = "names")
head(getTaskData(working.data %>>% remove.superfluous))

Note the export parameter. It is a special parameter of every CPO constructor that controls which hyperparameters, i.e. parameters that control the operation of a CPO, to be exported. An exported hyperparameter can be changed later using the setHyperPars function; inspect the set of exported hyperparameters using the getParamSet() function, or using the verbose printing operator !:

!remove.superfluous

We may, for example, decide that we also want to remove the "PassengerId", "Ticket", and "Cabin" parameters. This can be done as the following:

remove.superfluous = setHyperPars(remove.superfluous,
  select.names = c("Name", "PassengerId", "Ticket", "Cabin"))
head(getTaskData(working.data %>>% remove.superfluous))

An alternative to the data = data %>>% CPO construct is the %<>>% operator. It applies one or several CPOs and assigns the result to the operand on the left. The following removes the Name and superfluous columns from working.data:

working.data %<>>% remove.superfluous

CPO Composition and Feature Imputation

The %>>% operator can not only be used to apply a CPO to data, but also to concat multiple CPOs in a row to form one combined CPO. Suppose, for example, we want to build a CPO that uses different imputation techniques on different column types. We may want to use this CPO on multiple datasets without always having to write out the operations separately. The following constructs a composite CPO which first operates on the ordered and factorial columns (always creating a new factor level specifically for missing values), and then on the numeric columns (using predictions made by the regr.cforest Learner). Which columns are seen by a CPO is controlled by the "affect.type" parameter on construction.

impute.fact = cpoImputeConstant("MISSING", affect.type = c("ordered", "factor"),
  export = character(0))
impute.num = cpoImputeLearner(makeLearner("regr.cforest", ntree = 50, mtry = 3),
  affect.type = "numeric", export = character(0))
impute.all = impute.fact %>>% impute.num
impute.all

After applying this CPO to new data, we see that the Missings property is FALSE.

working.data %>>% impute.all

Parallel CPO Application and Missing Value Indicators

Sometimes we want to apply different operations on a dataset, but then go on working with a combination of the results of all these operations. Consider, for example, cpoMissingIndicators, which replaces the input data with columns indicating whether data is missing. It can not be used after imputation (because no missing values remain), but using it before imputation would cause the imputation CPOs to also create imputation models for the missing value indicating columns.

To perform several different actions in parallel and combine the result, use the cpoCbind CPO. It is named after the R cbind() operation, which works in a similar way. cpoCbind is constructed with different CPOs as its arguments, named by a prefix to add to the resulting columns. To add missing value dummy columns, we do

impute.and.dummy = cpoCbind(impute.all, missing = cpoMissingIndicators(export = character(0)))

Detailed printing of impute.and.dummy using ! gives a graphical representation of the operations being performed: One data stream does two imputation operations, the other adds the missing indicators:

!impute.and.dummy

The missing indicators are the names of the original columns prefixed with "missing.", since that was the name given to the argument to cpoCbind.

working.data %<>>% impute.and.dummy
head(getTaskData(working.data))

Retrafo and New Data

After we train a model on the data we manipulated, we want to be able to make predictions on new data. This new data needs to be modified in the same way as the training data with one caveat: Some preprocessing operations, in our example the imputation of numeric columns, depend on the training data. We must not naively use cpoImputeLearner on new prediction data, since a missing value should be imputed depending on training data, not depending on other prediction data.

To solve this problem, mlrCPO always attaches a "CPOTrained" object to preprocessed data, which can be retrieved using the retrafo() function. It behaves similarly to a CPO; in particular, it can be applied to new data. Printing it gives information about the operations that it has cached.

cpo.trained = retrafo(working.data)
cpo.trained

If we wanted to perform prediction with new data ("test.table"), we could do the following:

working.test = test.table %>>% cpo.trained
head(working.test)

How Not to Resample

A common way to evaluate different machine learning algorithms is to (often repeatedly) train a model on a subset of available data, make a prediction on the remaining data, and compare that prediction with the known target values of that data. This process is called resampling, and mlr offers the function resample() which does exactly this. However, it is important that the entire machine learning process should happen inside the resampling, i.e. that not only the machine learning model, but also the preprocessing operations, get trained only on the training subset of the resampling step. For example, at this point it would be tempting to do resample(learner, working.data, cv10) to do five fold crossvalidation using any "learner" we might think of. The resampling would evaluate models trained on subsets of working.data, but this dataset was already preprocessed using information from the whole available data, so information from the resampling test subset would sneak in. It might not give a good estimate of real world performance, since a model trained on (the whole of) working.data does not contain any information about data still in the wild (e.g. test.table). Luckily mlrCPO makes it very easy to perform "proper" resampling by performing another simplification: attaching a CPO to a Learner.

Attaching CPOs to Learners

It is possible (and encouraged) to build entire machine learning pipelines encompassing preprocessing and model fitting by attaching CPOs to Learners. This is also done using the %>>% operator and results in a CPOLearner. This object behaves just like a normal mlr Learner and is very similar to the process of "wrapping" Learners in mlr. Hyperparameters that were exported by the CPO become part of the resulting Learner and can be changed using setHyperPars() and tuned using tuneParams().

The following combines all operations done so far with a randomForest Learner. The data that the randomForest is trained on will look like the working.test task we constructed above, but will only contain information of the resampling subset that it is trained on.

(When just adding the CPOs from above, we get frequent errors during resampling because titleadder sometimes creates new factor levels (from rare titles) during prediction that were not seen during training. To fix this, we add a cpoFixFactors() after it, which turns unseen titles into NAs.)

preproc.pipeline = titleadder %>>% cpoFixFactors(export = character(0)) %>>%
  remove.superfluous %>>% impute.and.dummy

preproc.learner = preproc.pipeline %>>% makeLearner("classif.randomForest")
preproc.learner

This Learner can be used for resampling on the (original!) data without bad conscience, and can be used on the test data without needing to worry about retrafo()s.

repcv(preproc.learner, titanic.task)
model = train(preproc.learner, titanic.task)
predict(model, newdata = test.table)

CPO Parameter Tuning

A nice aspect of attaching CPOs to Learners is that the hyperparameters of the CPOs can be tuned using mlr's useful tuneParams() method. For the following example, we are going to consider a few more columns of the data that might be useful. Then we use principal component analysis on a model matrix of these columns. For the PCA, we need to choose the number of principal components to include, which we are going decide by tuning.

We are going to use the Name column in the original data to create a feature NameLength that indicates how many words the last name of a person is made out of. The deck and room number are extracted from the Cabin feature. The numerical part of the Ticket feature is also used; we take the logarithm because it has a large range. Note how we suffix every new column with .topca, which makes it easy later to only apply PCA to these new columns.

coladder = cpoAddCols(
    NameLength.topca = {
      lastname = str_replace(Name, ".*\\.", "")
      lastname = str_replace(lastname, '[("].*', "")
      sapply(str_split(str_trim(lastname), "[ -]"), length)
    },
    Deck.topca = as.ordered(str_sub(Cabin, 1, 1)),
    Roomno.topca = suppressWarnings(as.numeric(str_sub(Cabin, 2, 3))),
    ticketno.topca = {
      tno = suppressWarnings(as.numeric(str_replace(Ticket, ".* ", "")))
      log(tno)
    })

We are also going to collapse the least common factors of the Title feature using cpoCollapseFact. Because the missing indicators of Roomno and Deck are the same (since both features are extracted from the Cabin feature), one of them is removed. The output shown is only of the new columns added.

preproc.pipeline = titleadder %>>% coladder %>>%
  cpoCollapseFact(0.05, affect.names = "Title", export = character(0)) %>>%
  cpoFixFactors(export = character(0)) %>>%
  remove.superfluous %>>% impute.and.dummy %>>%
  cpoSelect(names = "missing.Roomno.topca", invert = TRUE)

head(getTaskData(titanic.task %>>% preproc.pipeline %>>%
    cpoSelect(pattern = "\\.topca")))

In the following we first create the CPO that first turns all columns into numerics (using cpoModelMatrix) and then applies the PCA. To apply this CPO only to columns with .topca in their name, we use cpoCbind in combination with cpoSelect.

pca.cpo = cpoModelMatrix(~0 + ., export = character(0)) %>>%
  cpoDropConstants(export = character(0)) %>>%
  cpoScale(export = character(0)) %>>% cpoPca(export = "rank")

pca.restricted = cpoCbind(
    cpoSelect(pattern = "\\.topca$", invert = TRUE, id = "restrict"),
    cpoSelect(pattern = "\\.topca$") %>>% pca.cpo)

tune.learner = preproc.pipeline %>>% pca.restricted %>>%
  makeLearner("classif.randomForest")

getParamSet(tune.learner)

Because we only do tuning in low dimensions, it is sensible to use grid tuning.

ctrl = makeTuneControlGrid()
ps = makeParamSet(makeNumericParam("pca.rank", lower = 1, upper = 10))
tuneres = tuneParams(tune.learner, titanic.task,
  cv10, par.set = ps, control = ctrl)
tuneres

The result seems to indicate that the performance is similar for the first five added columns, after which it degrades.

effects = generateHyperParsEffectData(tuneres)
qplot(pca.rank, mmce.test.mean, data = effects$data)

(Note that all tuning result graphs shown were created using 10-fold repeated crossvalidation with 10 iterations, which reduces noise but takes a lot longer to run than the simple 10-fold crossvalidation used in the code.)

Making a comparison of CPO using three PCA components, using no PCA, or not using the added columns altogether, shows that the performance is similar, although not doing PCA has a slight edge.

# using PCA
repcv(setHyperPars(tune.learner, pca.rank = 3),
  titanic.task)
# the columns with no PCA
repcv(preproc.pipeline %>>% makeLearner("classif.randomForest"),
  titanic.task)
# not using any columns used by the PCA
repcv(preproc.pipeline %>>% cpoSelect(pattern = "\\.topca$", invert = TRUE) %>>%
      makeLearner("classif.randomForest"),
  titanic.task)

CPO Multiplexing and Tuning over Different Methods

Instead of just tuning over the number of PCA columns to include, we may also wonder whether using the similar ICA method is a better choice in this case. It would be useful to have a hyperparameter that controls which CPO gets applied: cpoPca or cpoIca. This is exactly what cpoMultiplex does: It is created with several candidate CPOs and exports a selected.cpo parameter. All hyperparameters of the included CPOs are exported as well, which makes it possible to not only tune over which CPO is used, but also over their parameters.

The following is a combination of the cpoPca and cpoIca CPOs. Looking at the resulting CPO's parameter set shows that all the parameters of the included CPOs are exported.

mplx = cpoMultiplex(list(cpoPca(scale = TRUE, export = "rank"),
  cpoIca(export = "n.comp")))

getParamIds(getParamSet(mplx))

We could tune this CPO as we do the one above, but it would be nicer if the different parameters controling number of components could be pulled together into a single parameter. This is done using cpoTransformParams, which wraps a CPO, creates new hyperparameters, and makes it possible to express the wrapped CPO's hyperparameters in terms of the new hyperparameters.

mplx.transformed = cpoTransformParams(mplx,
  transformations = alist(pca.rank = rank, ica.n.comp = rank),
  additional.parameters = makeParamSet(makeIntegerParam("rank", 1)))

getParamSet(mplx.transformed)

This newly created CPO can now be included in the preprocessing pipeline.

general.ca = cpoModelMatrix(~0 + ., export = character(0)) %>>%
  cpoDropConstants(export = character(0)) %>>%
  cpoScale() %>>%
  mplx.transformed

ca.restricted = cpoCbind(
    cpoSelect(pattern = "\\.topca$", invert = TRUE, id = "restrict"),
    cpoSelect(pattern = "\\.topca$") %>>% general.ca)

Using different settings for rank and selected.cpo will now select the CPO to use, while independently setting the fraction of columns to return. To get the results of selecting one column of PCA as above, we use:

head(getTaskData(titanic.task %>>% preproc.pipeline %>>%
                 setHyperPars(selected.cpo = "pca", ca.restricted, rank = 1)))

Using ICA gives a different result, but the same number of columns:

head(getTaskData(titanic.task %>>% preproc.pipeline %>>%
                 setHyperPars(ca.restricted, selected.cpo = "ica", rank = 1)))

We are now ready to tune this CPO instead of the one previously used. We re-define tune.learner and create a new tuning parameter set and are good to go.

tune.learner = preproc.pipeline %>>%
  ca.restricted %>>%
  makeLearner("classif.randomForest")

ps = makeParamSet(
    makeDiscreteParam("selected.cpo", values = c("pca", "ica")),
    makeIntegerParam("rank", lower = 1, upper = 10))

tuneres = tuneParams(tune.learner, titanic.task, cv10, par.set = ps, control = ctrl)
tuneres
effects = generateHyperParsEffectData(tuneres)
qplot(rank, mmce.test.mean, color = selected.cpo, data = effects$data)

The difference seems to be minute, but for most values PCA seems to outperform ICA.

repcv(setHyperPars(tune.learner, rank = 3, selected.cpo = "ica"),
  titanic.task)

Custom CPOs and Feature Clusters

Even though mlrCPO offers many versatile CPOs, it may happen that a method that is not present should be implemented. For this case, it is possible to create new CPOConstructors using makeCPO(). To go deeper into the topic of creating custom CPOs, you can query the respective vignette using vignette("a_4_custom_CPOs", package="mlrCPO"), but the following gives a useful and relatively general example.

When looking at the distribution of the ticket numbers, we notice a few accumulation of points and might wonder whether having a ticket from one of these "clusters" gives any information about a passenger.

ggplot(getTaskData(titanic.task %>>% preproc.pipeline),
  aes(ticketno.topca, 0)) + geom_jitter(height = 1) + coord_fixed()

We choose to use the dbscan library for clustering, even though there is no CPO available that implements it. The wrong approach would be to use cpoAddCols to add a column from the result of running dbscan on the input, because this would give different clusters during training and prediction. Instead, we need to implement a new CPO that assigns the correct clusters to new data values.

makeCPO() splits the preprocessing operation up into two functions: a cpo.train function that is applied only to the training data and returns a "control" object which contains all the trained information--in our case, the result from the dbscan() call--, and a cpo.retrafo function that is applied to both training and prediction data and does the actual data transformation--assigning the cluster index to every data point.

Our CPO will take only numeric values, and will return a single factorial column indicating cluster membership. We use the properties.* parameters of makeCPO() to indicate:

Our newly created CPO exports the "eps" parameter of dbscan, indicating the necessary proximity for cluster membership. This is declared by the par.set parameter of makeCPO.

cpoDbscan = makeCPO(cpo.name = "dbscan",
  par.set = makeParamSet(makeNumericParam("eps", 0)),
  properties.data = c("numerics", "missings"),
  properties.needed = "factors",
  properties.adding = "numerics",
  packages = "dbscan",
  cpo.train = function(data, target, eps) {
    data = as.matrix(data)
    list(cluster = dbscan::dbscan(data, eps), data = data)
  },
  cpo.retrafo = function(data, control, eps) {
    prediction = predict(control$cluster, as.matrix(data), control$data)
    unique.clusters = sort(unique(control$cluster$cluster))
    result = data.frame(factor(prediction, levels = unique.clusters))
    names(result) = str_c(c("dbscan", colnames(data)), collapse = ".")
    result
  })

Inspecting the clustering done on both titanic.table and test.table shows that clustering work well:

clustered.train = titanic.task %>>% preproc.pipeline %>>%
  cpoSelectFreeProperties(names = "ticketno.topca") %>>%
  cpoCbind(NULLCPO, cpoDbscan(0.1))

ggplot(getTaskData(clustered.train, target.extra = TRUE)$data,
  aes(ticketno.topca, 0, color = dbscan.ticketno.topca)) +
  geom_jitter(height = 1, show.legend = FALSE) + coord_fixed()
clustered.predict = test.table %>>% retrafo(clustered.train)
ggplot(clustered.predict,
  aes(ticketno.topca, 0, color = dbscan.ticketno.topca)) +
  geom_jitter(height = 1, show.legend = FALSE) + coord_fixed()

We can use this to cluster ticket numbers (and drop the other .topca columns) and evaluate the resulting Learner.

cluster.learner = preproc.pipeline %>>%
  cpoCbind(cpoSelect(pattern = "\\.topca$", invert = TRUE),
    cpoSelectFreeProperties(names = "ticketno.topca") %>>% cpoDbscan(0.1)) %>>%
  makeLearner("classif.randomForest")
repcv(cluster.learner, titanic.task)
parallelStop()


mlr-org/mlrCPO documentation built on Nov. 18, 2022, 11:25 p.m.