library("mlr") library("BBmisc") library("ParamHelpers") # show grouped code output instead of single lines knitr::opts_chunk$set(collapse = TRUE)
Function makeImputeMethod()
permits to create your own imputation method.
For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation.
The learn and impute functions both have at least the following formal arguments:
data
is a base::data.frame()
with missing values in some features.col
indicates the feature to be imputed.target
indicates the target variable(s) in a supervised learning task.Let's have a look at function imputeMean (imputations()
).
showFunctionDef = function(fun, name = sub("^[^:]*:+", "", deparse(substitute(fun)))) { l = deparse(fun) l[1] = paste(name, l[1], sep = " = ") l }
imputeMean (imputations()
) calls the unexported mlr
function simpleImpute
which is defined as follows.
The learn function calculates the mean of the non-missing observations in column col
.
The mean is passed via argument const
to the impute function that replaces all missing values in feature col
.
Now let's write a new imputation method: A frequently used simple technique for longitudinal data is last observation carried forward (LOCF). Missing values are replaced by the most recent observed value.
In the R code below the learn function determines the last observed value previous to each NA
(values
) as well as the corresponding number of consecutive NA's
(times
).
The impute function generates a vector by replicating the entries in values
according to times
and replaces the NA's
in feature col
.
imputeLOCF = function() { makeImputeMethod( learn = function(data, target, col) { x = data[[col]] ind = is.na(x) dind = diff(ind) lastValue = which(dind == 1) # position of the last observed value previous to NA lastNA = which(dind == -1) # position of the last of potentially several consecutive NA's values = x[lastValue] # last observed value previous to NA times = lastNA - lastValue # number of consecutive NA's return(list(values = values, times = times)) }, impute = function(data, target, col, values, times) { x = data[[col]] replace(x, is.na(x), rep(values, times)) } ) }
Note that this function is just for demonstration and is lacking some checks for real-world usage (for example 'What should happen if the first value in x
is already missing?').
Below it is used to impute the missing values in features Ozone
and Solar.R
in the airquality (datasets::airquality()
) data set.
data(airquality) imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()), dummy.cols = c("Ozone", "Solar.R")) head(imp$data, 10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.