knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.align = "center" )
This vignette showcases the functions regressionImp()
and rangerImpute()
,
which can both be used to generate imputations for several variables in a
dataset using a formula interface.
For data, a subset of sleep
is used. The columns have been selected
deliberately to include some interactions between the missing values.
library(VIM) dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")] dataset$BodyWgt <- log(dataset$BodyWgt) dataset$Span <- log(dataset$Span) aggr(dataset) str(dataset)
In order to invoke the imputation methods, a formula is used to specify which
variables are to be estimated and which variables should be used as regressors.
We will start by imputing NonD
based in BodyWgt
and Span
.
imp_regression <- regressionImp(NonD ~ BodyWgt + Span, dataset) imp_ranger <- rangerImpute(NonD ~ BodyWgt + Span, dataset) aggr(imp_regression, delimiter = "_imp")
We can see that for regrssionImp()
there are still missings in NonD
for all observations where
Span
is unobserved. This is because the regression model could not be applied
to those observations. The same is true for the values imputed via
rangerImpute()
.
As we can see in the next two plots, the correlation structure of NonD
and
BodyWgt
is preserved by both imputation methods. In the case of
regressionImp()
all imputed values almost follow a straight line. This
suggests that the variable Span
had little to no effect on the model.
imp_regression[, c("NonD", "BodyWgt", "NonD_imp")] |> marginplot(delimiter = "_imp")
For rangerImpute()
on the other hand, Span
played an important role in the
generation of the imputed values.
imp_ranger[, c("NonD", "BodyWgt", "NonD_imp")] |> marginplot(delimiter = "_imp") imp_ranger[, c("NonD", "Span", "NonD_imp")] |> marginplot(delimiter = "_imp")
To impute several variables at once, the formula in rangerImpute()
and
regressionImp()
can be specified with more than one column name in the
left hand side.
imp_regression <- regressionImp(Dream + NonD ~ BodyWgt + Span, dataset) imp_ranger <- rangerImpute(Dream + NonD ~ BodyWgt + Span, dataset) aggr(imp_regression, delimiter = "_imp")
Again, there are missings left for both Dream
and NonD
.
In order to validate the performance of regressionImp()
the iris
dataset is used. Firstly, some values are randomly set to NA
.
library(reactable) data(iris) df <- iris colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species") # randomly produce some missing values in the data set.seed(1) nbr_missing <- 50 y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T), col=sample(ncol(iris)-1,size = nbr_missing,replace = T)) y<-y[!duplicated(y),] df[as.matrix(y)]<-NA aggr(df) sapply(df, function(x)sum(is.na(x)))
We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and Species
serves as a regressor.
imp_regression <- regressionImp(S.Length + S.Width + P.Length + P.Width ~ Species, df) aggr(imp_regression, delimiter = "imp")
The plot indicates that all missing values have been imputed by the regressionImp()
algorithm. The following table displays the rounded first five results of the imputation for all variables.
results <- cbind("TRUE1" = as.numeric(iris[as.matrix(y[which(y$col==1),])]), "IMPUTED1" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==1),])]),2), "TRUE2" = as.numeric(iris[as.matrix(y[which(y$col==2),])]), "IMPUTED2" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==2),])]),2), "TRUE3" = as.numeric(iris[as.matrix(y[which(y$col==3),])]), "IMPUTED3" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==3),])]),2), "TRUE4" = as.numeric(iris[as.matrix(y[which(y$col==4),])]), "IMPUTED4" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==4),])]),2))[1:5,] reactable(results, columns = list( TRUE1 = colDef(name = "True"), IMPUTED1 = colDef(name = "Imputed"), TRUE2 = colDef(name = "True"), IMPUTED2 = colDef(name = "Imputed"), TRUE3 = colDef(name = "True"), IMPUTED3 = colDef(name = "Imputed"), TRUE4 = colDef(name = "True"), IMPUTED4 = colDef(name = "Imputed") ), columnGroups = list( colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")), colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")), colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")), colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4")) ), striped = TRUE, highlight = TRUE, bordered = TRUE )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.