Concept

Executive summary

This design document describes some classes that represent the current API for mlr3pipelines. In short, a Pipeline is a doubly connected graph, that contains several GraphNodes. Each GraphNode contains a PipeOP, that contains the logic used to manipulate the inputs, i.e. functions that transform a pipe's inputs to its outputs.


class PipeOp

Description: A PipeOp is a single tranformation of inputs into outputs. During training it takes inputs, tranforms them, while doing that learns and stores its parameters (into its 'state') and then returns the output. During prediction it applies the learned params to the input and transforms the input to an output.

A PipeOp specifies the types of inputs and outputs as intype and outtype, a list of . The length of these lists determines the length of input / output the PipeOp produces. Typically the PipeOp input / output is a list of specified length, but PipeOps with input / output length 1 can specify that they don't use lists but use singular values instead (.takeslist / .returnslist set to FALSE)

Usage: - new(id, params) [character(1), ParamSet] -> [PipeOp] - id [character(1)] - param_set [ParamSet] - param_vals [named list] - state [any] - result [any] # debug only - train(input) : [list of any] -> [list of any] - predict(input) : [list of any] -> [list of any] - packages : [character] - train_intypes : [character] - train_outtypes : [character] - predict_intypes : [character] - predict_outtypes : [character] - is_trained: [logical(1)]

Details: id: AB that allows to return and set the id of the PipeOps. Ids are user-configurable, and ids of PipeOps in graphs must be unique. param_set: The set of all exposed parameters of the PipeOp. param_vals: (AB r/w, automatically checks feasibility) checks A named list of parameter settings where all setting must come from param_set. state: The object of learned parameters, obtained in the training step, and applied in the predict step. If a PO creates an "empty" state during training this should be set to list() new: Baseclass constructor, called by derived classes. train: Function that is responsible to train on input, transform it to output and store the learned params. If the PipeOp is already trained, already present params are overwritten. predict: Function that is responsible to predict on input, and transform it to output by applying the learned params. If is_trained = FALSE the function cannot be applied. result: A slot to store the result of either the train or the predict step, after it was applied. Set by the framework, (planned: depending on debug option) is_trained: (AB, checks is.null(state)) Is the PipeOp currently trained? train_intypes: Types of the inputs expected during training. Can be any if the PipeOp does not require a specific type. Index specifies the position in the input. train_outtypes: Types of the inputs expected during training. Can be any if the PipeOp does not require a specific type. predict_intypes: Types of the inputs expected during prediction. Can be any if the PipeOp does not require a specific type. predict_outtypes: Types of the inputs expected during prediction. Can be any if the PipeOp does not require a specific type.

class NodeChannel

Description Represents an input / output slot of a node. A node can have multiple of these slots for input / output, they may be named and / or numbered.

Usage: - node: [GraphNode] - channel: [numeric(1) | character(1)] - direction: [character(1)] "in" or "out"

Details: - node: The GraphNode whose input / output is considered. - channel: Indexes the connection. May be a character(1) or a numeric(1). - direction must be "in" if this is a channel representing input values of a PipeOp or "out" if this represents an item of the output list given by that PipeOp.

class GraphNode

node$next_nodes = list(node_a, node_b, node_c)

node$next_nodes = list([node_a, eingang 4543], [node_b, eingang 34345 von node_b], [node_c, eingang 2])

Description: A GraphNode is a node of a (doubly connected) directed acyclic multigraph, where each node carries as payload a single PipeOp, and each edge represents the flow of a single list-element of a list returned by one PipeOp, into a single list-element of a list used as input of another PipeOp. Computational results in that graph only flow forward, but we store connections to predecessors of a node for convenience. A GraphNode is always a member of exactly one Graph, an can not have connections to GraphNodes not in this graph.

A GraphNode has a set of "input channels" and "output channels"; edges go from output channels of one node to the input channel of another node.

Usage: - graph [Graph] - pipeop [PipeOp] - in_channels [list of NodeChannel] - out_channels [list of NodeChannel] - next_node_channels [list of NodeChannel] - prev_node_channels [list of NodeChannel] - next_nodes [list of GraphNode]: - prev_nodes [list of GraphNode]: - input_complete [logical(1)] - output_complete [logical(1)] - intype [list of any] - outtype [list of any]

Details: graph: the graph that this node belongs to pipeop: Operator in that node in_channels: list of NodeChannel that can be put into next_node_channels of another node to form a connection. out_channels: list of NodeChannel that can be put into prev_node_channels of another node to form a connection. next_node_channels: the respective in_channels of successor nodes where data flows. Mutable to change connections. prev_node_channels: the respective out_channels of predecessor nodes where data comes from. Mutable to change connections. next_nodes: nodes connected by next_node_channels, readonly prev_nodes: nodes connected by prev_node_channels, readonly input_complete: whether all input channels have a connected node output_complete: whether all output channels have a connected note intype mirrors the pipeop intype outtype mirrors the pipeop outtype

class Graph

Description: The graph is a container class for the complete computational graph. It is made up of a list of (connected) GraphNodes, it can be trained and predicted.

Usage: - new [Graph] - node_list [list of GraphNode] - sorted_node_list [list of GraphNode] - intype [list of any]] - outtype [list of any]] - in_channels [list of NodeChannel] - out_channels [list of NodeChannel] - source_nodes [list of GraphNode] - sink_nodes [list of GraphNode] - add_node() [GraphNode | PipeOp] -> [Graph] - train() [any] -> [any] - predict() [any] -> [any] - plot() - extend() [Graph] -> [Graph] - map() [function], [logical] -> [list of any]

Aggregated info - param_set [ParamSet] - param_vals [list] - packages [character]

Remove the following?: - is_learnt - rhs - lhs

Details: - Get node by [[id]] - new with optional Graph argument: copy constructor - node_list list of GraphNode, indexed by ID - sorted_node_list like node_list, but ordered by their connections - intype: types of the in_channels - outtype: types of the out_channels - source_nodes: nodes that have unconnected input channels and therefore act as graph input - sink_nodes: nodes that have unconnected output channels and therefore act as graph output - add_node: Mutates graph by adding a PipeOp or GraphNode. GraphNode calls this automatically on construction, so a user should only call this with a PipeOp.

Operation G >> H

Connection operator, connects 2 partial graphs G and H.

[Graph], [Graph] -> [Graph] If G or H are PipeOps, they the will leveled up to a Graph, containing a single node.

We check that the ids(G) and ids(H) are disjoint (or throw an error), then we connect the rhs(G) to the lhs(H) as outlined below. Case 1-1: If length(rhs(G)) == 1 and length(lhs(H)) == 1, so G has a single sink and H has a single source, we directly connect them. Case n-1: If length(rhs(G)) == n and length(lhs(H)) == 1, so G has multiple sinks, and H a single source, we connect all nodes in rhs(G) withe the node lhs(H). Case 1-n: If length(rhs(G)) == 1 and length(lhs(H)) == n, we require that the PipeOp in rhs(G) is a so-called broadcaster, see below. We connect this broadcaster with all elements in lhs(H). All other cases: Error.

Operator >> will always deep-clone its arguments on call.

greplicate(G, k): Replicate a graph. Results in a graph

[Graph], integer(1) -> [Graph] If G is a PipeOp it will be leveled up to a Graph, containing a single node.

Copy the structure of G k-times, make all PipeOP-ids unique by post-fixing them all with "_rep_1" (for the first copy), "_rep_2" (for the second copy) and so on. greplicate will always deep-clone its arguments on call.

Aggregation and Broadcasting PipeOps

PipeOps can have the property "aggregate" or "broadcast".

An aggregating PipeOp will only take an input which is a list. This list is made up of the results of all of its predecessors, when the PipeOp is used in a graph. Its specific aggregation functionality handles this list (typically by somehow combinining the elements to a larger object, or selecting something from the input elements). Examples of aggregators are PipeOpModelAvg, PipeOpFeatureUnion, PipeOpUnbranch.

A broadcasting PipeOp always returns a list. If the PipeOp is used in a graph, then its result-list is split up into single elements and these elements are propagated one-per-edge on the outgoing edges. Note that this implies that the PipeOp always has to return a list of the same length as its encapsulating GraphNode has number of successors. Examples of broadcasters are PipeOpCopy, PipeOpBranch, PipeOpChunk

Topological sorting and layers of a Graph

A DAG is made up of layers of nodes, depending on how "late" in the DAG a node appears, i.e., many how previous computational steps are needed to be processed until we can finally compute that node. To make this more explicit: We call the set of all nodes with no incoming edges the set of "sources" of a graph, or the LHS of a graph. This is layer 0. Layer 1 is made up of all nodes, whose direct predecessors are only in layer 0. Layer 2 is made up of all nodes, whose direct predecessors are only in layer 1 or 0. And so on. The last layer, which is the set of all nodes without outgoing edges, we call the set of "sinks" of a graphs or the RHS of the graph.

Training and predicting on a computational graph

When we train a computational graph, we proceed in topological layers. We initialize by feeding the input object (a task) into each node in layer 0. We then compute all results in layer 0, store the respective results on each node. Then we train layer 1. This is possible, as all nodes in layer 1 only require results from layer 0. Then we train layer 2. This is possible as all nodes in layer 2 only require results from layer 1 or 0. And so on. Computing the results of one layer can be done in parallel. After a finite amount of steps, the RHS of the graph will have been processed. This list of results of the RHS is the result of the graph computation.

Prediction on a graph works exactly the same. We feed the input object to the LHS, then call the predict on each node, layer by layer, until we have done that also on the RHS.

FIXME: we need to specifiy the concept of branching / multiplexing

Graph Operation Concepts

The four classes involved in graph / operation processing are PipeOp, GraphNode, NodeChannel, and Graph.

PipeOp

At the lowest level is the PipeOp, which represents a single set of data transformation operations (one for training and prediction each). This operation should not be aware that it is being processed in a graph, and especially should not have to care about the layout of that graph.

In a way, a PipeOp can be seen as a generalized kind of function definition with multiple inputs and multiple outputs ("returns"). However, instead of having functions with multiple arguments function(a, b, c), the actual train and predict functions only take one argument, which must be lists of a given length. (Node: we could change this and make the train and predict actually take multiple arguments). Also, the functions must always return lists of a given length. What kind of information the train/predict function take, and what kind of information they return, is defined in the .intype and .outtype private member variables. These are (possibly named) lists which specify the "type" that each position of an input / output list must take.

--- side note ---

Currently the way types are specified is not yet specified. Some thoughts here: It should probably be specified kind of data backends a PipeOp can handle. Long-term we could consider automatically converting data to fit each PipeOp if necessary It should be stated whether a PipeOp takes / returns an actual dataset, or a prediction, a learner model, a set of optimised parameter values, or something entirely different Maybe we should have .intype and .outtype separated for train and predict. Maybe there is a principle by which e.g. an operation that takes a Task during training should always take a data.frame / data.table during prediction, whereas an operation that creates a Prediction object during prediction should always return NULL during training etc--i.e. there may be a coherent relationship between training and prediction types that could be exploited to simplify type specification Maybe there is a hierarchy between types so that an operation that takes a DataFrame should also be able to take a DataTable * There should probably be a way to specify multiple supported types. E.g. many operations should probably handle both Tasks and DataFrame / DataTable during prediction.

--- side note end ---

For example, if a given operation takes three separate inputs (that could be the training input data directly, or have been generated by other PipeOps) and produces two outputs that could be processed separately by other PipeOPs (an example would be to take a Task, a Learner and a resampling instance, and to output a trained Model and a new Task with a column of resampled predictions), then .intype would need to be a list of length 3, and .outtype a list of length 2. Each entry of .intype would need to further specify what kind of input is expected. The train and predict functions will then, during training / prediction, be called with a list of 3 elements that contain data of the requested type, and they are expected to return a list with length 2, with each entry satisfying the type requirement given for the respective entry in the .outtype list.

Entries in the .intype or .outtype lists may be named, which may be useful for identifying node channels (see below).

(A PipeOp also has a param_set and param_vals, but these do not pertain to the graph structure and should be pretty obvious.)

GraphNode, NodeChannel

A GraphNode instance represents the positioning of a PipeOp inside a Graph. The user should never create a GraphNode on his own, instead it is created automatically whenever a PipeOp is added to a Graph, e.g. by graph$add_node() or the %>>% operator.

A single PipeOp operation may use data from multiple different sources, and may produce different output slots that should then be used by other PipeOps that are distinct. Therefore each GraphNode must specify, for each element of its function result lists, not only the following GraphNode to which the result will be passed, but also which input list slot of that GraphNode the result will be passed. E.g. the example above with three input list slots could get all three input values from different other PipeOps (it could also get all inputs from the same PipeOp), but each of these PipeOps would be contained in GraphNodes that specify which of the three input slots specifically their results will be written in, before the three-input-PipeOp is called.

Each GraphNode keeps track of where each of its input values comes from, and where each of its output values is sent afterwards. As noted above, if a result was computed, the necessary information is 1) what GraphNode to send this result to, and 2) which input slot specifically of the following PipeOp to put this result into

This information is contained in an instance of NodeChannel. The metaphor here is that each PipeOp contains a number of "output channels" where data comes out, which can then be fed into subsequent "input channels" of different PipeOps. A NodeChannel instance identifies such a channel: it knows which GraphNode data comes out of / goes into, and which data slot / channel ("channel_id") the data goes into. What channels are available is decided by the number of slots of the .intype and .outtype member variables of a PipeOp. For each entry in the .outtype list, a GraphNode saves the channel that the specific slot of the result data will be routed to in the .next_node_channels member variable. For the example above with two output slots, .next_node_channels is therefore a list of two NodeChannel objects which point to GraphNode objects and channel indexes that identify the input of subsequent PipeOps that a result will be written to.

For reasons, a GraphNode also contains information about the origin of each of its input list slots. The functionality mirrors the functionality of .next_node_channels: A list .prev_node_channels contains instances of NodeChannel that specify which GraphNodes, and which output slots from these GraphNodes, generate the data that is fed to a given slot of a train / predict function input list.

GraphNode provides active bindings next_node_channels and prev_node_channels that make it easy to create and change connections between GraphNodes. To make sure that the second list element of the result of node AA is fed to the third list element of the input of node BB, one would just assign the corresponding NodeChannel to next_node_channels of AA:

AA$next_node_channels[[2]] = NodeChannel$new(node = BB, channel_id = 3, direction = "in")

Alternatively, it is possible to assign the output node channel of AA to the prev_node_channels entry of BB:

BB$prev_node_channels[[3]] = NodeChannel$new(node = AA, channel_id = 2, direction = "out")

The active binding code makes sure that these two assignments have the same effect. In addition, whenever a connection is updated like this, the code ensures that previous connections to other nodes are removed.

As part of the "channel" metaphor, one could imagine that a NodeChannel object actually belongs to the GraphNode that it points to. Maybe, instead of creating a new NodeChannel object, one could actually retrieve the NodeChannel object directly from the GraphNode that it belongs to. As an idea, all the NodeChannel objects for a GraphNode are already cached in the in_channels and out_channels lists of a GraphNode. Then, for the statements above, one could instead write the following effectively identical statements:

AA$next_node_channels[[2]] = BB$in_channels[[3]]

or

BB$prev_node_channels[[3]] = AA$out_channels[[2]]

Graph

A Graph object contains all the PipeOps that are part of the Graph's operation, encapsulated in GraphNode objects that have the information about the sequence and information flow of operations. It offers information about the graph nodes that are not fully connected within the graph (i.e. have channels that are still open and could be connected to new nodes, or to other graphs), and exports a unified parameter set for the collection of all contained PipeOps.

The .node_list private member variable contains all the GraphNodes. It is kept in sorted order (so that the output of each GraphNode in this list is only fed to GraphNodes coming after it). Whenever a new GraphNode is added (through add_node) or connections of nodes within the graph are changed (through the next_node_channels / prev_node_channels active bindings in GraphNode) the update_connections() function should be called to ensure this.

All the nodes and respective input / output slots within the graph that are not yet connected are made visible through the in_channels and out_channels variables (which are updated in update_connections() as well). This information may be useful to decide whether and how a graph should be connected to other graphs. The input / output types of these channels is stored in intype / outtype.



mlr-org/mlr3pipelines documentation built on March 29, 2024, 5:52 p.m.