knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Imputating missing values is an iterative process. naniar
aims to make it easier to manage imputed values by providing the nabular
data structure to simplify managing missingness. This vignette provides some useful recipes for imputing and exploring imputed data.
naniar
implements a few imputation methods to facilitate exploration and visualisations, which were not otherwise available: impute_below
, and impute_mean
. For single imputation, the R package simputation
works very well with naniar
, and provides the main example given.
impute_below
impute_below
imputes values below the minimum of the data, with some noise to reduce overplotting. The amount data is imputed below, and the amount of jitter, can be changed by changing the arguments prop_below
and jitter
.
library(dplyr) library(naniar) airquality %>% impute_below_at(vars(Ozone)) %>% select(Ozone, Solar.R) %>% head()
impute_mean
The mean can be imputed using impute_mean
, and is useful to explore structure in missingness, but are not recommended for use in analysis. Similar to simputation
, each impute_
function returns the data with values imputed.
Imputation functions in naniar
implement "scoped variants" for imputation: _all
, _at
and _if
.
This means:
_all
operates on all columns_at
operates on specific columns, and _if
operates on columns that meet some condition (such as is.numeric
or is.character
). If the impute_
functions are used as-is - e.g., impute_mean
, this will work on a single vector, but not a data.frame.
Some examples for impute_mean
are now given:
impute_mean(oceanbuoys$air_temp_c) %>% head() impute_mean_at(oceanbuoys, .vars = vars(air_temp_c)) %>% head() impute_mean_if(oceanbuoys, .predicate = is.integer) %>% head() impute_mean_all(oceanbuoys) %>% head()
When we impute data like this, we cannot identify where the imputed values are - we need to track them. We can track the imputed values using the nabular
format of the data.
We can track the missing values by combining the verbs bind_shadow
, impute_
, add_label_shadow
. We can then refer to missing values by their shadow variable, _NA
. The add_label_shadow
function adds an additional column called any_missing
, which tells us if any observation has a missing value.
We can impute the data using the easy-to-use simputation
package, and then track the missingness using bind_shadow
and add_label_shadow
:
library(simputation) ocean_imp <- oceanbuoys %>% bind_shadow() %>% impute_lm(air_temp_c ~ wind_ew + wind_ns) %>% impute_lm(humidity ~ wind_ew + wind_ns) %>% impute_lm(sea_temp_c ~ wind_ew + wind_ns) %>% add_label_shadow()
We can then show the previously missing (now imputed!) data in a scatterplot with ggplot2 by setting the color
aesthetic in ggplot to any_missing
:
library(ggplot2) ggplot(ocean_imp, aes(x = air_temp_c, y = humidity, color = any_missing)) + geom_point() + scale_color_brewer(palette = "Dark2") + theme(legend.position = "bottom")
Or, if you want to look at one variable, you can look at a density plot of one variable, using fill = any_missing
ggplot(ocean_imp, aes(x = air_temp_c, fill = any_missing)) + geom_density(alpha = 0.3) + scale_fill_brewer(palette = "Dark2") + theme(legend.position = "bottom") ggplot(ocean_imp, aes(x = humidity, fill = any_missing)) + geom_density(alpha = 0.3) + scale_fill_brewer(palette = "Dark2") + theme(legend.position = "bottom")
We can also compare imputed values to complete cases by grouping by any_missing
, and summarising.
ocean_imp %>% group_by(any_missing) %>% summarise_at(.vars = vars(air_temp_c), .funs = list( min = ~ min(.x, na.rm = TRUE), mean = ~ mean(.x, na.rm = TRUE), median = ~ median(.x, na.rm = TRUE), max = ~ max(.x, na.rm = TRUE) ))
One thing that we notice with our imputations are that they aren't very good - we can improve upon the imputation by including the variables year and latitude and longitude:
ocean_imp_yr <- oceanbuoys %>% bind_shadow() %>% impute_lm(air_temp_c ~ wind_ew + wind_ns + year + longitude + latitude) %>% impute_lm(humidity ~ wind_ew + wind_ns + year + longitude + latitude) %>% impute_lm(sea_temp_c ~ wind_ew + wind_ns + year + longitude + latitude) %>% add_label_shadow()
ggplot(ocean_imp_yr, aes(x = air_temp_c, y = humidity, color = any_missing)) + geom_point() + scale_color_brewer(palette = "Dark2") + theme(legend.position = "bottom")
Not all imputation packages return data in tidy
We can explore using a single imputation of Hmisc::aregImpute()
, which allows for multiple imputation with bootstrapping, additive regression, and predictive mean matching. We are going to explore predicting mean matching, and single imputation.
library(Hmisc) aq_imp <- aregImpute(~Ozone + Temp + Wind + Solar.R, n.impute = 1, type = "pmm", data = airquality) aq_imp
We are now going to get our data into nabular
form, and then insert the imputed values:
# nabular form! aq_nab <- nabular(airquality) %>% add_label_shadow() # insert imputed values aq_nab$Ozone[is.na(aq_nab$Ozone)] <- aq_imp$imputed$Ozone aq_nab$Solar.R[is.na(aq_nab$Solar.R)] <- aq_imp$imputed$Solar.R
In the future there will be a more concise way to insert these imputed values into data, but for the moment the method above is what I would recommend for single imputation.
We can then explore the imputed values like so:
ggplot(aq_nab, aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.