PipeOp | R Documentation |
A PipeOp
represents a transformation of a given "input" into a given "output", with two stages: "training"
and "prediction". It can be understood as a generalized function that not only has multiple inputs, but
also multiple outputs (as well as two stages). The "training" stage is used when training a machine learning pipeline or
fitting a statistical model, and the "predicting" stage is then used for making predictions
on new data.
To perform training, the $train()
function is called which takes inputs and transforms them, while simultaneously storing information
in its $state
slot. For prediction, the $predict()
function is called, where the $state
information can be used to influence the transformation
of the new data.
A PipeOp
is usually used in a Graph
object, a representation of a computational graph. It can have
multiple input channels—think of these as multiple arguments to a function, for example when averaging
different models—, and multiple output channels—a transformation may
return different objects, for example different subsets of a Task
. The purpose of the Graph
is to
connect different outputs of some PipeOp
s to inputs of other PipeOp
s.
Input and output channel information of a PipeOp
is defined in the $input
and $output
slots; each channel has a name, a required
type during training, and a required type during prediction. The $train()
and $predict()
function are called with a list
argument
that has one entry for each declared channel (with one exception, see next paragraph). The list
is automatically type-checked
for each channel against $input
and then passed on to the private$.train()
or private$.predict()
functions. There the data is processed and
a result list
is created. This list
is again type-checked for declared output types of each channel. The length and types of the result
list
is as declared in $output
.
A special input channel name is "..."
, which creates a vararg channel that takes arbitrarily many arguments, all of the same type. If the $input
table contains an "..."
-entry, then the input given to $train()
and $predict()
may be longer than the number of declared input channels.
This class is an abstract base class that all PipeOp
s being used in a Graph
should inherit from, and
is not intended to be instantiated.
Abstract R6Class
.
PipeOp$new(id, param_set = ps(), param_vals = list(), input, output, packages = character(0), tags = character(0))
id
:: character(1)
Identifier of resulting object. See $id
slot.
param_set
:: ParamSet
| list
of expression
Parameter space description. This should be created by the subclass and given to super$initialize()
.
If this is a ParamSet
, it is used as the PipeOp
's ParamSet
directly. Otherwise it must be a list
of expressions e.g. created by alist()
that evaluate to ParamSet
s.
These ParamSet
are combined using a ParamSetCollection
.
param_vals
:: named list
List of hyperparameter settings, overwriting the hyperparameter settings given in param_set
. The
subclass should have its own param_vals
parameter and pass it on to super$initialize()
. Default list()
.
input :: data.table
with columns name
(character
), train
(character
), predict
(character
)
Sets the $input
slot of the resulting object; see description there.
output :: data.table
with columns name
(character
), train
(character
), predict
(character
)
Sets the $output
slot of the resulting object; see description there.
packages :: character
Set of all required packages for the PipeOp
's $train
and $predict
methods. See $packages
slot.
Default is character(0)
.
tags
::character
A set of tags associated with the PipeOp
. Tags describe a PipeOp's purpose.
Can be used to filter as.data.table(mlr_pipeops)
. Default is "abstract"
, indicating an abstract PipeOp
.
PipeOp
is an abstract class with abstract functions private$.train()
and private$.predict()
. To create a functional
PipeOp
class, these two methods must be implemented. Each of these functions receives a named list
according to
the PipeOp
's input channels, and must return a list
(names are ignored) with values in the order of output
channels in $output
. The private$.train()
and private$.predict()
function should not be called by the user;
instead, a $train()
and $predict()
should be used. The most convenient usage is to add the PipeOp
to a Graph
(possibly as singleton in that Graph
), and using the Graph
's $train()
/ $predict()
methods.
private$.train()
and private$.predict()
should treat their inputs as read-only. If they are R6
objects,
they should be cloned before being manipulated in-place. Objects, or parts of objects, that are not changed, do
not need to be cloned, and it is legal to return the same identical-by-reference objects to multiple outputs.
id
:: character
ID of the PipeOp
. IDs are user-configurable, and IDs of PipeOp
s must be unique within a Graph
. IDs of
PipeOp
s must not be changed once they are part of a Graph
, instead the Graph
's $set_names()
method
should be used.
packages
:: character
Packages required for the PipeOp
. Functions that are not in base R should still be called using ::
(or explicitly attached using require()
) in private$.train()
and private$.predict()
, but
packages declared here are checked before any (possibly expensive) processing has started within a Graph
.
param_set
:: ParamSet
Parameters and parameter constraints. Parameter values that influence the functioning of $train
and / or $predict
are
in the $param_set$values
slot; these are automatically checked against parameter constraints in $param_set
.
state
:: any
| NULL
Method-dependent state obtained during training step, and usually required for the prediction step. This is NULL
if and only if the PipeOp
has not been trained. The $state
is the only slot that can be reliably modified during
$train()
, because private$.train()
may theoretically be executed in a different R
-session (e.g. for parallelization).
$state
should furthermore always be set to something with copy-semantics, since it is never cloned. This is a limitation
not of PipeOp
or mlr3pipelines
, but of the way the system as a whole works, together with GraphLearner
and mlr3
.
input :: data.table
with columns name
(character
), train
(character
), predict
(character
)
Input channels of PipeOp
. Column name
gives the names (and order) of values in the list given to
$train()
and $predict()
. Column train
is the (S3) class that an input object must conform to during
training, column predict
is the (S3) class that an input object must conform to during prediction. Types
are checked by the PipeOp
itself and do not need to be checked by private$.train()
/ private$.predict()
code.
A special name is "..."
, which creates a vararg input channel that accepts a variable number of inputs.
If a row has both train
and predict
values enclosed by square brackets ("[
", "]
), then this channel is
Multiplicity
-aware. If the PipeOp
receives a Multiplicity
value on these channels, this Multiplicity
is given to the .train()
and .predict()
functions directly. Otherwise, the Multiplicity
is transparently
unpacked and the .train()
and .predict()
functions are called multiple times, once for each Multiplicity
element.
The type enclosed by square brackets indicates that only a Multiplicity
containing values of this type are accepted.
See Multiplicity
for more information.
output :: data.table
with columns name
(character
), train
(character
), predict
(character
)
Output channels of PipeOp
, in the order in which they will be given in the list returned by $train
and
$predict
functions. Column train
is the (S3) class that an output object must conform to during training,
column predict
is the (S3) class that an output object must conform to during prediction. The PipeOp
checks
values returned by private$.train()
and private$.predict()
against these types specifications.
If a row has both train
and predict
values enclosed by square brackets ("[
", "]
), then this signals that the channel
emits a Multiplicity
of the indicated type. See Multiplicity
for more information.
innum
:: numeric(1)
Number of input channels. This equals nrow($input)
.
outnum
:: numeric(1)
Number of output channels. This equals nrow($output)
.
is_trained
:: logical(1)
Indicate whether the PipeOp
was already trained and can therefore be used for prediction.
tags
::character
A set of tags associated with the PipeOp
. Tags describe a PipeOp's purpose.
Can be used to filter as.data.table(mlr_pipeops)
.
PipeOp tags are inherited and child classes can introduce additional tags.
hash
:: character(1)
Checksum calculated on the PipeOp
, depending on the PipeOp
's class
and the slots $id
and $param_set$values
. If a
PipeOp
's functionality may change depending on more than these values, it should inherit the $hash
active
binding and calculate the hash as digest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")
.
phash
:: character(1)
Checksum calculated on the PipeOp
, depending on the PipeOp
's class
and the slots $id
but ignoring $param_set$values
. If a
PipeOp
's functionality may change depending on more than these values, it should inherit the $hash
active
binding and calculate the hash as digest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")
.
.result
:: list
If the Graph
's $keep_results
flag is set to TRUE
, then the intermediate Results of $train()
and $predict()
are saved to this slot, exactly as they are returned by these functions. This is mainly for debugging purposes
and done, if requested, by the Graph
backend itself; it should not be done explicitly by private$.train()
or private$.predict()
.
man
:: character(1)
Identifying string of the help page that shows with help()
.
properties
:: character()
The properties of the pipeop.
Currently supported values are:
"validation"
: the PipeOp
can make use of the $internal_valid_task
of an mlr3::Task
.
This is for example used for PipeOpLearner
s that wrap a Learner
with this property, see mlr3::Learner
.
PipeOp
s that have this property, also have a $validate
field, which controls whether to use the validation task,
as well as a $internal_valid_scores
field, which allows to access the internal validation scores after training.
"internal_tuning"
: the PipeOp
is able to internally optimize hyperparameters.
This works analogously to the internal tuning implementation for mlr3::Learner
.
PipeOp
s with that property also implement the standardized accessor $internal_tuned_values
and have at least one
parameter tagged with "internal_tuning"
.
An example for such a PipeOp
is a PipeOpLearner
that wraps a Learner
with the "internal_tuning"
property.
Programatic access to all available properties is possible via mlr_reflections$pipeops$properties
.
train(input)
(list
) -> named list
Train PipeOp
on inputs
, transform it to output and store the learned $state
. If the PipeOp is already
trained, already present $state
is overwritten. Input list is typechecked against the $input
train
column.
Return value is a list with as many entries as $output
has
rows, with each entry named after the $output
name
column and class according to the $output
train
column.
The workhorse function for training each PipeOp
is the private
.train(input)
: (named list
) -> list
function.
It's an Abstract function that must be implemented by concrete subclasses. private$.train()
is called by $train()
after
typechecking. It must change the $state
value to something non-NULL
and return a list of transformed data according to
the $output
train
column. Names of the returned list are ignored.
The private$.train()
method should not be called by a user; instead, the $train()
method should be used which does some
checking and possibly type conversion.
predict(input)
(list
) -> named list
Predict on new data in input
, possibly using the stored $state
. Input and output are specified by $input
and $output
in the same way as for $train()
, except that
the predict
column is used for type checking.
The workhorse function for predicting in each using each PipeOp
is
.predict(input)
(named list
) -> list
Abstract function that must be implemented by concrete subclasses. private$.predict()
is called by $predict()
after
typechecking and works analogously to private$.train()
. Unlike private$.train()
, private$.predict()
should not modify
the PipeOp
in any way.
Just as private$.train()
, private$.predict()
should not be called by a user; instead, the $predict()
method should be used.
print()
() -> NULL
Prints the PipeOp
s most salient information: $id
, $is_trained
, $param_set$values
, $input
and $output
.
help(help_type)
(character(1)
) -> help file
Displays the help file of the concrete PipeOp
instance. help_type
is one of "text"
, "html"
, "pdf"
and behaves
as the help_type
argument of R's help()
.
To create your own PipeOp
, you need to overload the private$.train()
and private$.test()
functions.
It is most likely also necessary to overload the $initialize()
function to do additional initialization.
The $initialize()
method should have at least the arguments id
and param_vals
, which should be passed on to super$initialize()
unchanged.
id
should have a useful default value, and param_vals
should have the default value list()
, meaning no initialization of hyperparameters.
If the $initialize()
method has more arguments, then it is necessary to also overload the private$.additional_phash_input()
function.
This function should return either all objects, or a hash of all objects, that can change the function or behavior of the PipeOp
and are independent
of the class, the id, the $state
, and the $param_set$values
. The last point is particularly important: changing the $param_set$values
should
not change the return value of private$.additional_phash_input()
.
https://mlr-org.com/pipeops.html
Other mlr3pipelines backend related:
Graph
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_graphs
,
mlr_pipeops
,
mlr_pipeops_updatetarget
Other PipeOps:
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
# example (bogus) PipeOp that returns the sum of two numbers during $train()
# as well as a letter of the alphabet corresponding to that sum during $predict().
PipeOpSumLetter = R6::R6Class("sumletter",
inherit = PipeOp, # inherit from PipeOp
public = list(
initialize = function(id = "posum", param_vals = list()) {
super$initialize(id, param_vals = param_vals,
# declare "input" and "output" during construction here
# training takes two 'numeric' and returns a 'numeric';
# prediction takes 'NULL' and returns a 'character'.
input = data.table::data.table(name = c("input1", "input2"),
train = "numeric", predict = "NULL"),
output = data.table::data.table(name = "output",
train = "numeric", predict = "character")
)
}
),
private = list(
# PipeOp deriving classes must implement .train and
# .predict; each taking an input list and returning
# a list as output.
.train = function(input) {
sum = input[[1]] + input[[2]]
self$state = sum
list(sum)
},
.predict = function(input) {
list(letters[self$state])
}
)
)
posum = PipeOpSumLetter$new()
print(posum)
posum$train(list(1, 2))
# note the name 'output' is the name of the output channel specified
# in the $output data.table.
posum$predict(list(NULL, NULL))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.