PipeOp: PipeOp Base Class

PipeOpR Documentation

PipeOp Base Class

Description

A PipeOp represents a transformation of a given "input" into a given "output", with two stages: "training" and "prediction". It can be understood as a generalized function that not only has multiple inputs, but also multiple outputs (as well as two stages). The "training" stage is used when training a machine learning pipeline or fitting a statistical model, and the "predicting" stage is then used for making predictions on new data.

To perform training, the ⁠$train()⁠ function is called which takes inputs and transforms them, while simultaneously storing information in its ⁠$state⁠ slot. For prediction, the ⁠$predict()⁠ function is called, where the ⁠$state⁠ information can be used to influence the transformation of the new data.

A PipeOp is usually used in a Graph object, a representation of a computational graph. It can have multiple input channels—think of these as multiple arguments to a function, for example when averaging different models—, and multiple output channels—a transformation may return different objects, for example different subsets of a Task. The purpose of the Graph is to connect different outputs of some PipeOps to inputs of other PipeOps.

Input and output channel information of a PipeOp is defined in the ⁠$input⁠ and ⁠$output⁠ slots; each channel has a name, a required type during training, and a required type during prediction. The ⁠$train()⁠ and ⁠$predict()⁠ function are called with a list argument that has one entry for each declared channel (with one exception, see next paragraph). The list is automatically type-checked for each channel against ⁠$input⁠ and then passed on to the private$.train() or private$.predict() functions. There the data is processed and a result list is created. This list is again type-checked for declared output types of each channel. The length and types of the result list is as declared in ⁠$output⁠.

A special input channel name is "...", which creates a vararg channel that takes arbitrarily many arguments, all of the same type. If the ⁠$input⁠ table contains an "..."-entry, then the input given to ⁠$train()⁠ and ⁠$predict()⁠ may be longer than the number of declared input channels.

This class is an abstract base class that all PipeOps being used in a Graph should inherit from, and is not intended to be instantiated.

Format

Abstract R6Class.

Construction

PipeOp$new(id, param_set = ParamSet$new(), param_vals = list(), input, output, packages = character(0), tags = character(0))
  • id :: character(1)
    Identifier of resulting object. See ⁠$id⁠ slot.

  • param_set :: ParamSet | list of expression
    Parameter space description. This should be created by the subclass and given to super$initialize(). If this is a ParamSet, it is used as the PipeOp's ParamSet directly. Otherwise it must be a list of expressions e.g. created by alist() that evaluate to ParamSets. These ParamSet are combined using a ParamSetCollection.

  • param_vals :: named list
    List of hyperparameter settings, overwriting the hyperparameter settings given in param_set. The subclass should have its own param_vals parameter and pass it on to super$initialize(). Default list().

  • input :: data.table with columns name (character), train (character), predict (character)
    Sets the ⁠$input⁠ slot of the resulting object; see description there.

  • output :: data.table with columns name (character), train (character), predict (character)
    Sets the ⁠$output⁠ slot of the resulting object; see description there.

  • packages :: character
    Set of all required packages for the PipeOp's ⁠$train⁠ and ⁠$predict⁠ methods. See ⁠$packages⁠ slot. Default is character(0).

  • tags ::character
    A set of tags associated with the PipeOp. Tags describe a PipeOp's purpose. Can be used to filter as.data.table(mlr_pipeops). Default is "abstract", indicating an abstract PipeOp.

Internals

PipeOp is an abstract class with abstract functions private$.train() and private$.predict(). To create a functional PipeOp class, these two methods must be implemented. Each of these functions receives a named list according to the PipeOp's input channels, and must return a list (names are ignored) with values in the order of output channels in ⁠$output⁠. The private$.train() and private$.predict() function should not be called by the user; instead, a ⁠$train()⁠ and ⁠$predict()⁠ should be used. The most convenient usage is to add the PipeOp to a Graph (possibly as singleton in that Graph), and using the Graph's ⁠$train()⁠ / ⁠$predict()⁠ methods.

private$.train() and private$.predict() should treat their inputs as read-only. If they are R6 objects, they should be cloned before being manipulated in-place. Objects, or parts of objects, that are not changed, do not need to be cloned, and it is legal to return the same identical-by-reference objects to multiple outputs.

Fields

  • id :: character
    ID of the PipeOp. IDs are user-configurable, and IDs of PipeOps must be unique within a Graph. IDs of PipeOps must not be changed once they are part of a Graph, instead the Graph's ⁠$set_names()⁠ method should be used.

  • packages :: character
    Packages required for the PipeOp. Functions that are not in base R should still be called using :: (or explicitly attached using require()) in private$.train() and private$.predict(), but packages declared here are checked before any (possibly expensive) processing has started within a Graph.

  • param_set :: ParamSet
    Parameters and parameter constraints. Parameter values that influence the functioning of ⁠$train⁠ and / or ⁠$predict⁠ are in the ⁠$param_set$values⁠ slot; these are automatically checked against parameter constraints in ⁠$param_set⁠.

  • state :: any | NULL
    Method-dependent state obtained during training step, and usually required for the prediction step. This is NULL if and only if the PipeOp has not been trained. The ⁠$state⁠ is the only slot that can be reliably modified during ⁠$train()⁠, because private$.train() may theoretically be executed in a different R-session (e.g. for parallelization). ⁠$state⁠ should furthermore always be set to something with copy-semantics, since it is never cloned. This is a limitation not of PipeOp or mlr3pipelines, but of the way the system as a whole works, together with GraphLearner and mlr3.

  • input :: data.table with columns name (character), train (character), predict (character)
    Input channels of PipeOp. Column name gives the names (and order) of values in the list given to ⁠$train()⁠ and ⁠$predict()⁠. Column train is the (S3) class that an input object must conform to during training, column predict is the (S3) class that an input object must conform to during prediction. Types are checked by the PipeOp itself and do not need to be checked by private$.train() / private$.predict() code.
    A special name is "...", which creates a vararg input channel that accepts a variable number of inputs.
    If a row has both train and predict values enclosed by square brackets ("[", "⁠]⁠), then this channel is Multiplicity-aware. If the PipeOp receives a Multiplicity value on these channels, this Multiplicity is given to the .train() and .predict() functions directly. Otherwise, the Multiplicity is transparently unpacked and the .train() and .predict() functions are called multiple times, once for each Multiplicity element. The type enclosed by square brackets indicates that only a Multiplicity containing values of this type are accepted. See Multiplicity for more information.

  • output :: data.table with columns name (character), train (character), predict (character)
    Output channels of PipeOp, in the order in which they will be given in the list returned by ⁠$train⁠ and ⁠$predict⁠ functions. Column train is the (S3) class that an output object must conform to during training, column predict is the (S3) class that an output object must conform to during prediction. The PipeOp checks values returned by private$.train() and private$.predict() against these types specifications.
    If a row has both train and predict values enclosed by square brackets ("[", "⁠]⁠), then this signals that the channel emits a Multiplicity of the indicated type. See Multiplicity for more information.

  • innum :: numeric(1)
    Number of input channels. This equals ⁠nrow($input)⁠.

  • outnum :: numeric(1)
    Number of output channels. This equals ⁠nrow($output)⁠.

  • is_trained :: logical(1)
    Indicate whether the PipeOp was already trained and can therefore be used for prediction.

  • tags ::character
    A set of tags associated with the PipeOp. Tags describe a PipeOp's purpose. Can be used to filter as.data.table(mlr_pipeops). PipeOp tags are inherited and child classes can introduce additional tags.

  • hash :: character(1)
    Checksum calculated on the PipeOp, depending on the PipeOp's class and the slots ⁠$id⁠ and ⁠$param_set$values⁠. If a PipeOp's functionality may change depending on more than these values, it should inherit the ⁠$hash⁠ active binding and calculate the hash as ⁠digest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")⁠.

  • phash :: character(1)
    Checksum calculated on the PipeOp, depending on the PipeOp's class and the slots ⁠$id⁠ but ignoring ⁠$param_set$values⁠. If a PipeOp's functionality may change depending on more than these values, it should inherit the ⁠$hash⁠ active binding and calculate the hash as ⁠digest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")⁠.

  • .result :: list
    If the Graph's ⁠$keep_results⁠ flag is set to TRUE, then the intermediate Results of ⁠$train()⁠ and ⁠$predict()⁠ are saved to this slot, exactly as they are returned by these functions. This is mainly for debugging purposes and done, if requested, by the Graph backend itself; it should not be done explicitly by private$.train() or private$.predict().

  • man :: character(1)
    Identifying string of the help page that shows with help().

Methods

  • train(input)
    (list) -> named list
    Train PipeOp on inputs, transform it to output and store the learned ⁠$state⁠. If the PipeOp is already trained, already present ⁠$state⁠ is overwritten. Input list is typechecked against the ⁠$input⁠ train column. Return value is a list with as many entries as ⁠$output⁠ has rows, with each entry named after the ⁠$output⁠ name column and class according to the ⁠$output⁠ train column. The workhorse function for training each PipeOp is the private .train(input)
    : (named list) -> list
    function. It's an Abstract function that must be implemented by concrete subclasses. private$.train() is called by ⁠$train()⁠ after typechecking. It must change the ⁠$state⁠ value to something non-NULL and return a list of transformed data according to the ⁠$output⁠ train column. Names of the returned list are ignored.
    The private$.train() method should not be called by a user; instead, the ⁠$train()⁠ method should be used which does some checking and possibly type conversion.

  • predict(input)
    (list) -> named list
    Predict on new data in input, possibly using the stored ⁠$state⁠. Input and output are specified by ⁠$input⁠ and ⁠$output⁠ in the same way as for ⁠$train()⁠, except that the predict column is used for type checking. The workhorse function for predicting in each using each PipeOp is .predict(input)
    (named list) -> list
    Abstract function that must be implemented by concrete subclasses. private$.predict() is called by ⁠$predict()⁠ after typechecking and works analogously to private$.train(). Unlike private$.train(), private$.predict() should not modify the PipeOp in any way.
    Just as private$.train(), private$.predict() should not be called by a user; instead, the ⁠$predict()⁠ method should be used.

  • print()
    () -> NULL
    Prints the PipeOps most salient information: ⁠$id⁠, ⁠$is_trained⁠, ⁠$param_set$values⁠, ⁠$input⁠ and ⁠$output⁠.

  • help(help_type)
    (character(1)) -> help file
    Displays the help file of the concrete PipeOp instance. help_type is one of "text", "html", "pdf" and behaves as the help_type argument of R's help().

Inheriting

To create your own PipeOp, you need to overload the private$.train() and private$.test() functions. It is most likely also necessary to overload the ⁠$initialize()⁠ function to do additional initialization. The ⁠$initialize()⁠ method should have at least the arguments id and param_vals, which should be passed on to super$initialize() unchanged. id should have a useful default value, and param_vals should have the default value list(), meaning no initialization of hyperparameters.

If the ⁠$initialize()⁠ method has more arguments, then it is necessary to also overload the private$.additional_phash_input() function. This function should return either all objects, or a hash of all objects, that can change the function or behavior of the PipeOp and are independent of the class, the id, the ⁠$state⁠, and the ⁠$param_set$values⁠. The last point is particularly important: changing the ⁠$param_set$values⁠ should not change the return value of private$.additional_phash_input().

See Also

https://mlr-org.com/pipeops.html

Other mlr3pipelines backend related: Graph, PipeOpTargetTrafo, PipeOpTaskPreprocSimple, PipeOpTaskPreproc, mlr_graphs, mlr_pipeops_updatetarget, mlr_pipeops

Other PipeOps: PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreprocSimple, PipeOpTaskPreproc, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_encode, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_scale, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_spatialsign, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_textvectorizer, mlr_pipeops_threshold, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson, mlr_pipeops

Examples

# example (bogus) PipeOp that returns the sum of two numbers during $train()
# as well as a letter of the alphabet corresponding to that sum during $predict().

PipeOpSumLetter = R6::R6Class("sumletter",
  inherit = PipeOp,  # inherit from PipeOp
  public = list(
    initialize = function(id = "posum", param_vals = list()) {
      super$initialize(id, param_vals = param_vals,
        # declare "input" and "output" during construction here
        # training takes two 'numeric' and returns a 'numeric';
        # prediction takes 'NULL' and returns a 'character'.
        input = data.table::data.table(name = c("input1", "input2"),
          train = "numeric", predict = "NULL"),
        output = data.table::data.table(name = "output",
          train = "numeric", predict = "character")
      )
    }
  ),
  private = list(
    # PipeOp deriving classes must implement .train and
    # .predict; each taking an input list and returning
    # a list as output.
    .train = function(input) {
      sum = input[[1]] + input[[2]]
      self$state = sum
      list(sum)
    },
    .predict = function(input) {
      list(letters[self$state])
    }
  )
)
posum = PipeOpSumLetter$new()

print(posum)

posum$train(list(1, 2))
# note the name 'output' is the name of the output channel specified
# in the $output data.table.

posum$predict(list(NULL, NULL))

mlr3pipelines documentation built on May 31, 2023, 9:26 p.m.