knitr::opts_chunk$set(
  cache = FALSE,
  collapse = TRUE,
  comment = "#>"
)
set.seed(8008135)

This tutorial focusses on the more technical concepts underlying mlr3pipelines. The core building blocks of a Pipeline are PipeOperators (PipeOp) and Graphs.

PipeOp: A PipeOp is a node, that transforms data flowing through it.

PipeOp{ width=70% }

Graph: A Graph is a concatenation of one or several PipeOps.

By connecting several PipeOps we can build a Graphs. We will use the terms Pipeline and Graph interchangeably in this section, as a Pipeline inherently is a Directed Acyclic Graph. Before diving deeper into the concept of Graphs, we will quickly look into its basic building blocks: PipeOps.

PipeOp

In order to construct complicated pipelines, different PipeOps that help organizing how the data flows through the graph are required. We will quickly introduce two examples, where this becomes aparent:

In mlr3pipelines we consider 3 basic types of PipeOps; linear, broadcast and aggregate:

| Type | Input Dim | Output Dim | Examples | |--- |--- |--- |--- | | linear | 1 | 1 | PipeOpPCA | | linear | 1 | 1 | PipeOpLearner | |broadcast| 1 | n | PipeOpCopy | |broadcast| 1 | n | PipeOpBranch | |aggregate| n | 1 | PipeOpFeatureUnion | |aggregate| n | 1 | PipeOpUnbranch |

A deeper dive into PipeOps

In order to get a better understanding, we focus on an exemplary PipeOp: As an example we choose PipeOpLearner. First, we create an instance of it by calling the $new() method with a learner from mlr3. A PipeOp is an R6 class.

  library(mlr3)
  library(mlr3pipelines)
  lrn = mlr_learners$get("classif.rpart")
  op = PipeOpLearner$new(lrn)

The following slots (and more) are contained in each PipeOp: - $train(): A function used to train with the PipeOp. - $predict(): A function used to predict with the PipeOp. - $id: Allows us to set or get the id of the PipeOp. - $param_set: The set of all exposed parameters of the PipeOp. - $values: Current hyperparameter settings. - $is_trained: Is the PipeOp already trained?

We can check properties by accessing the respective slots.

  op$id
  op$is_trained

The param_set and values are required if a PipeOp contains hyperparameters we want to set. See [paradox::ParamSet] for a quick intro on how ParamSets work.

The $train() and $predict() functions define the core functionality of our PipeOp. In many cases, in order to not leak information from the test set into the training set it is imperative to treat train and test data separately. In order to achieve this, we requrie a train function that learns the appropriate transformations from the training set and a test function that applies the transformation on future data.

In the case of PipeOpLearner this means the following: - $train() trains a model on its input Task and saves the trained model to an additional slot, $state. It returns a list(NULL), as subsequent operators usually do not require any output. - $predict() uses the model stored in $state in order to predict the class of a new input task. It returns a [Prediction] object. This object contains the learner's predictions.

Connecting PipeOps to a Graph

We define three different PipeOps, which will be connected to a Pipeline:

pou = PipeOpSubsample$new()
pop = PipeOpPCA$new()
pos = PipeOpScale$new()

Now depending on the order of how we connect those PipeOps, different results can arise:

There are two basic ways of connecting PipeOps to a Graph:

Using the %>>% operator

In order to connect PipeOps, we can use the %>>% operator. It is conceptually similar to [mlrCPO](https://github.com/mlr-org/mlrCPO)'s %>>% which in turn stems from the idea of magrittrs %>% command.

The following defines a Graph, that connects the outputs of the left hand side PipeOp to the input of the right hand side.

gr = pou %>>% pos

The object returned by this operator is an R6 class: Graph that contains the PipeOps and some meta-information, i.e. how the different operators are connected.

If we want to extend the Graph by adding another PipeOp, we can simply append another operator. This connects the output of the last PipeOp in the Graph with the input of the new operator.

gr = gr %>>% pop

The %>>% operator also allows us to connect Graphs with Graphs, by connecting the first output of the lhs Graph with the first input of the rhs Graph.

Building the Graph from scratch

We can define the same Graph by sequentially adding PipeOps to the Graph and connecting them with an edge.

gr = Graph$new()
gr$add_pipeop(pou)
gr$add_pipeop(pos)
gr$add_edge("subsample", "scale")

Analogously to how it is done above, we can again add additional PipeOps:

gr$add_pipeop(pop)
gr$add_edge("scale", "pca")

The latter notation has some pro's and cons:

In order to see how the nodes in the graph are connected, we can simply visualize the graph:

gr$plot()

Graph

A Graph allows us to connect several PipeOps together, and thus let's us control how and in which order data flows through it. It defines vertices that connect the nodes (PipeOps) to a Graph. It is a container class for the complete computational graph, i.e. it allows us to go through the Graph and train or predict on every node.

In the example above, a new Graph was constructed using Graph$new(). Then new PipeOps are added to the $pipepops slot using $add_pipeop(). $pipeops is a list of PipeOps contained in the Graph, named by the PipeOp's $id. Afterwards we added edges between the PipeOps to the $edges slot using $add_edge(). $edges is a data.table with character columns "src_id", "src_channel", "dst_id", "dst_channel". This table contains the connections between the PipeOps, i.e. which PipeOp is connected to which and to which channel.

The full Graph also has an input and output node, i.e. the inputs and outputs of the first PipeOp and the last PipeOp in our Graph respectively. Those can be accessed using $input and $output. They return a data.table with character columns "name", "train", "predict", "op.id", "channel.name". We can obtain the ids of input and output PipeOps using $lhs and $rhs.

Additionally, a Graph collects information from the different PipeOps it contains.

We can obtain a sorted/unsorted list of the id's of all PipeOps contained in a Graph using ids(). The collection of all packages required to run the Pipeline can be found in the packages slot.

The $param_set collects the $param_sets of each PipeOp into a single ParamSet. Those contain parameters and parameter constraints for all PipeOps. The actual parameter values can be set or obtained from $param_set$values. Parameter names, as seen by the Graph have the naming scheme <PipeOp$id>.<PipeOp original parameter name>. Changing $param_set$values also propagates the changes directly to the contained PipeOps and is an alternative to changing a PipeOps $param_set$values directly.

In order to compare or check whether a Graph has changed, we can obtain its hash:

# First we compute the hash
gr$hash
# Now we set the scale parameter of the PCA operator
gr$param_set$values$pca.scale. = TRUE
# Compute hash again to see whether the object changed
gr$hash

Training and Prediction

The main components of each PipeOp are its train and predict function. The Graph orchestrates the training and prediction of each PipeOp by sequentially training PipeOps along the Graph. Training a Graph thus corresponds to training each PipeOp. When all PipeOps are trained, the Graph can be used for prediction.

We can for example train our graph on the iris Task.

gr$train(mlr_tasks$get("iris"))

and transform new data with the trained Pipeline:

gr$train(mlr_tasks$get("iris"))

Whether we store the intermediate results in the PipeOp's $.result slot can be controlled via keep_results . This is can be done mostly for debugging purposes. Default FALSE.

Graph Union and Replication

We define two important helper functions, that are usefull for building larger Graphs. PipeOp's can either be added to a Graph sequentially using %>>% or next to each other, i.e. in parallel. This can be achieved using gunion() and greplicate(). Putting PipeOps next to each other is especially usefull in situations, where we either want to do Branching, e.g. use tuning in order to select which of the operators to use, or Copying, where an input is copied to all following nodes, which can then be evaluated in parallel. The resulting outputs can then for example be collected into a single input using PipeOpFeatureUnion.

gunion

We can use gunion() to add a list of PipeOps or Graphs to a ned Graph. This results in a Graph without any edges between the unioned operators.

gr = gunion(list(pou, pop, pos))
gr$plot()

greplicate

A PipeOp or Graph can also be replicated $n$ times using greplicate(). In this example, we again use the PipeOps for Undersampling (pou) and CPA (pop) defined above and connect them to a simple Graph. We can then replicate this Graph 4 times in order to do Principal Component Analysis $4$ times after obtaining 4 different subsamples of our input data..

gr = pou %>>% pop
gr2 = greplicate(gr, 4)
gr$plot()

Meta Pipeoperators

We additionally want to detail some meta operators that can be useful in the process of building pipelines.

PipeOpNOP

It is often useful to simply pass on the data unchanged, while the data is transformed by some PipeOp in parallel. This can be done using PipeOpNull, which simply passes on its inputs unchanged, both during the training and the predict phase. An example for this can be seen in the vignette on Stacking.

In order to, for example pass on the original data, and some transformed features, we can use gunion() to add a PipeOpNOP and a PipeOpPCA to the pipeline. We can later on use PipeOpFeatureUnion to concatenate the resulting transformed data together.

po_null = PipeOpNOP$new()
pop = PipeOpPCA$new()
gunion(list(pop, po_null)) %>>% PipeOpFeatureUnion$new(2)

PipeOpFeatureUnion

In the example above, we used a PipeOpFeatureUnion to concatenate data from different PipeOps. Internally, this checks whether the target in each task are equal, and then cbind()s the features from all tasks together.

PipeOpBranch and PipeOpUnbranch

In order to build Pipelines that involve a choice of one or several methods over others, we can use PipeOpBranch and PipeOpUnbranch. This is mostly usefull if we want to tune over different options, i.e. have the tuning algorithm decide between different learners. For a more detailed introduction to branching and unbranching, see the vignette Showcase: Branching.



mlr-org/mlr3pipelines documentation built on April 30, 2024, 6:21 p.m.