mlr_pipeops_imputeoor | R Documentation |
Impute factorial features by adding a new level ".MISSING"
.
Impute numerical features by constant values shifted below the minimum or above the maximum by
using min(x) - offset - multiplier * diff(range(x))
or
max(x) + offset + multiplier * diff(range(x))
.
This type of imputation is especially sensible in the context of tree-based methods, see also Ding & Simonoff (2010).
Learner
s expect input Task
s to have the same factor
(or ordered
) levels during
training as well as prediction. This PipeOp
modifies the levels of factor
and ordered
features,
and since it may occur that a factor
or ordered
feature contains missing values only during prediction, but not
during training, the output Task
could also have different levels during the two stages.
To avoid problems with the Learner
s' expectation, controlling the PipeOp
s' handling of this edge-case is necessary.
For this, use the create_empty_level
hyperparameter inherited from PipeOpImpute
.
If create_empty_level
is set to TRUE
, then an unseen level ".MISSING"
is added to the feature during
training and missing values are imputed as ".MISSING"
during prediction.
However, empty factor levels during training can be a problem for many Learners
.
If create_empty_level
is set to FALSE
, then no empty level is introduced during training, but columns that
have missing values only during prediction will not be imputed. This is why it may still be necessary to use
po("imputesample", affect_columns = selector_type(types = c("factor", "ordered")))
(or another imputation method) after this imputation method.
Note that setting create_empty_level
to FALSE
is the same as setting it to TRUE
and using PipeOpFixFactors
after this PipeOp
.
R6Class
object inheriting from PipeOpImpute
/PipeOp
.
PipeOpImputeOOR$new(id = "imputeoor", param_vals = list())
id
:: character(1)
Identifier of resulting object, default "imputeoor"
.
param_vals
:: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list()
.
Input and output channels are inherited from PipeOpImpute
.
The output is the input Task
with all affected features having missing values imputed as described above.
The $state
is a named list
with the $state
elements inherited from PipeOpImpute
.
The $state$model
contains either ".MISSING"
used for character
and factor
(also
ordered
) features or numeric(1)
indicating the constant value used for imputation of
integer
and numeric
features.
The parameters are the parameters inherited from PipeOpImpute
, as well as:
min
:: logical(1)
Should integer
and numeric
features be shifted below the minimum? Initialized to TRUE
. If FALSE
they are shifted above the maximum. See also the description above.
offset
:: numeric(1)
Numerical non-negative offset as used in the description above for integer
and numeric
features. Initialized to 1
.
multiplier
:: numeric(1)
Numerical non-negative multiplier as used in the description above for integer
and numeric
features. Initialized to 1
.
Adds an explicit new level()
to factor
and ordered
features, but not to character
features.
For integer
and numeric
features uses the min
, max
, diff
and range
functions.
integer
and numeric
features that are entirely NA
are imputed as 0
. factor
and ordered
features that are
entirely NA
are imputed as ".MISSING"
.
Only fields inherited from PipeOp
.
Only methods inherited from PipeOpImpute
/PipeOp
.
Ding Y, Simonoff JS (2010). “An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data.” Journal of Machine Learning Research, 11(6), 131-170. https://jmlr.org/papers/v11/ding10a.html.
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp
,
PipeOpEncodePL
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_decode
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_encodeplquantiles
,
mlr_pipeops_encodepltree
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_learner_pi_cvplus
,
mlr_pipeops_learner_quantiles
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nearmiss
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tomek
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
Other Imputation PipeOps:
PipeOpImpute
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputesample
library("mlr3")
set.seed(2409)
data = tsk("pima")$data()
data$y = factor(c(NA, sample(letters, size = 766, replace = TRUE), NA))
data$z = ordered(c(NA, sample(1:10, size = 767, replace = TRUE)))
task = TaskClassif$new("task", backend = data, target = "diabetes")
task$missings()
po = po("imputeoor")
new_task = po$train(list(task = task))[[1]]
new_task$missings()
new_task$data()
# recommended use when missing values are expected during prediction on
# factor columns that had no missing values during training
gr = po("imputeoor", create_empty_level = FALSE) %>>%
po("imputesample", affect_columns = selector_type(types = c("factor", "ordered")))
t1 = as_task_classif(data.frame(l = as.ordered(letters[1:3]), t = letters[1:3]), target = "t")
t2 = as_task_classif(data.frame(l = as.ordered(c("a", NA, NA)), t = letters[1:3]), target = "t")
gr$train(t1)[[1]]$data()
# missing values during prediction are sampled randomly
gr$predict(t2)[[1]]$data()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.