knitr::opts_chunk$set(echo = TRUE, comment = NA, error = TRUE)
This tutorial covers techniques of multiple imputation. Multiple imputation is a strategy for dealing with missing data. Whereas we typically (i.e., automatically) deal with missing data through casewise deletion of any observations that have missing values on key variables, imputation attempts to replace missing values with an estimated value.
In single imputation
, we guess that missing value one time (perhaps based on the means of observed values, or a random sampling of those values). In multiple imputation
, we instead draw multiple values for each missing value, effectively building multiple datasets, each of which replaces the missing data in a different way.
There are numerous algorithms for this, each of which builds those multiple datasets in different ways. We're not going to discuss the details here, but instead focus on executing multiple imputation in R. The main challenge of multiple imputation is not the analysis (it simply proceeds as usual on each imputed dataset) but instead the aggregation of those separate analyses. The examples below discuss how to do this.
To get a basic feel for the process, let's imagine that we're trying to calculate the mean of a vector of values that contains missing values. We can impute the missing values by drawing from the observed values, repeat the process several times, and then average across the estimated means
to get an estimate of the mean with a measure of uncertainty that accounts for the uncertainty due to imputation
. Let's create a vector of ten values, seven of which we observe and three of which are missing, and imagine that they are random draws from the population whose mean we're trying to estimate:
Source: http://thomasleeper.com/Rcourse/Tutorials/mi.html
n <- 10 # "n" values; vector of "n" length bad <- 3 # number of NAs good <- n - bad # valid values set.seed(10) x <- c(sample(1:n, good, replace = TRUE), rep(NA, bad)) # build vector of "n" lements x
plot(x, col = "blue")
We can find the mean using case deletion:
mean(x, na.rm = TRUE)
Our estimate of the sample standard error
is then:
# Standard Error of only the elements that are not NA sum(!is.na(x)) sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
Now let's impute several times to generate a list of imputed vectors:
# Generate 15 samples of new data replacing the NA elements with values # extracted from the GOOD values. na.count <- length(x[is.na(x) == TRUE]) # calculate the # of NAs imp <- replicate(15, c(x[!is.na(x)], # get the 7 non-NA elements sample(x[!is.na(x)], na.count, TRUE)), # get 3 elements from non-NA vec simplify = FALSE) # output to be a vector imp
plot(imp[[1]])
library(zoo) St <- x Sti <- na.locf(x) nas.count(Sti) par(mfrow=c(1,1)) limRow <- length(x) colvec <- ifelse(is.na(St), mdc(2), mdc(1)) # missing data coloring (Mdc) plot(Sti[1:limRow], col = colvec, type="l", xlab="Observation", ylab="Value") points(Sti[1:limRow], col = colvec, pch=20, cex=2) # points(imp[[1]], col = colvec, pch=21, cex=1.5) # points(imp[[2]], col = colvec, pch=22, cex=1.5) # points(imp[[3]], col = colvec, pch=23, cex=1.5) # points(imp[[4]], col = colvec, pch=24, cex=1.5) # points(imp[[5]], col = colvec, pch=25, cex=1.5) for (i in 1:length(imp)) { points(imp[[i]], col = colvec, pch=i, cex=1) }
The result is a list of fifteen vectors. The first seven values of each is the same as our original data, but the three missing values have been replaced with different combinations of the observed values.
To get our new estimated mean
, we simply take the mean of each vector, and then average across them:
means <- sapply(imp, mean) means
# the grand mean or the mean of all means grandm <- mean(means) grandm
The result is 4.147, about the same as our original estimate. To get the standard error of our multiple imputation estimate, we need to combine the standard errors
of each of our estimates, so that means we need to start by getting the SEs of each imputed vector:
# standard error ses <- sapply(imp, sd) / sqrt(10) ses
Aggregating the standard errors is a bit complicated, but basically sums the mean of the SEs (i.e., the “within-imputation variance”) with the variance across the different estimated means (the “between-imputation variance”). To calculate the within-imputation variance, we simply average the SE estimates:
within <- mean(ses) within
To calculate the between-imputation variance, we calculate the sum of squared deviations of each imputed mean from the grand mean estimate:
between <- sum((means - grandm)^2) / (length(imp) - 1) between
Then we sum the within- and between-imputation variances (multiply the latter by a small correction):
grandvar <- within + ((1 + (1 / length(imp))) * between) grandse <- sqrt(grandvar) grandse
The resulting standard error is interesting because we increase the precision of our estimate by using 10 rather than 7 values (and standard errors are proportionate to sample size), but is larger than our original standard error because we have to account for uncertainty due to imputation. Thus, if our missing values are truly missing at random, we can get a better estimate that is actually representative of our original population.
Most multiple imputation algorithms are, however, applied to multivariate
data rather than a single data vector and thereby use additional information about the relationship between observed values and missingness to reach even more precise estimates of target parameters.
There are three main R packages that offer multiple imputation techniques. Several other packages - described in the OfficialStatistics Task View - supply other imputation techniques, but packages Amelia (by Gary King and collaborators), mi (by Andrew Gelman and collaborators), and mice (by Stef van Buuren and collaborators) provide more than enough to work with. Let's start by installing these packages:
install.packages(c("Amelia", "mi", "mice"), repos = "http://cran.r-project.org")
Now, let's consider an imputation situation where we plan to conduct a regression analysis predicting y by two covariates: x1 and x2 but we have missing data in x1 and x2. Let's start by creating the dataframe:
x1 <- runif(100, 0, 5) x2 <- rnorm(100) x3 <- rnorm(100) y <- x1 + x2 + x3 + rnorm(100) mydf <- cbind.data.frame(x1, x2, x3, y)
Now, let's randomly remove some of the observed values of the independent variables:
mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA mydf$x3[sample(1:nrow(mydf), 15, FALSE)] <- NA
The result is the removal of thirty values, 20 from x1 and 10 from x2:
summary(mydf)
If we estimate the regression on these data, R will force casewise deletion of 28 cases:
lm <- lm(y ~ x1 + x2, data = mydf) summary(lm)
We should thus be quite skeptical of our results given taht we're discarding a substantial portion of our observations (28%, in fact). Let's see how the various multiple imputation packages address this and affect our inference.
library(Amelia)
imp.amelia <- amelia(mydf)
Once we've run our multiple imputation, we can see where are missing data lie:
missmap(imp.amelia)
We can also run our regression model on each imputed dataset. We'll use the lapply function to do this quickly on each of the imputed dataframes:
lm.amelia.out <- lapply(imp.amelia$imputations, function(i) lm(y ~ x1 + x2, data = i))
If we look at lm.amelia.out we'll see the results of the model run on each imputed dataframe separately:
# each of the imputations is a data frame 100x3 with their own imputted values. str(imp.amelia$imputations)
lm.amelia.out
To aggregate across the results is a little bit tricky because we have to extract the coefficients and standard errors from each model, format them in a particular way, and then feed that structure into the mi.meld function:
coefs.amelia <- do.call(rbind, lapply(lm.amelia.out, function(i) coef(summary(i))[, 1])) ses.amelia <- do.call(rbind, lapply(lm.amelia.out, function(i) coef(summary(i))[, 2])) mi.meld(coefs.amelia, ses.amelia)
Now let's compare these results to those of our original model:
t(do.call(rbind, mi.meld(coefs.amelia, ses.amelia)))
coef(summary(lm))[, 1:2] # original results
library(mi)
mp.plot(mydf)
# A function that plots missingness # requires `reshape2` library(reshape2) library(ggplot2) ggplot_missing <- function(x){ x %>% is.na %>% melt %>% ggplot(data = ., aes(x = Var2, y = Var1)) + geom_raster(aes(fill = value)) + scale_fill_grey(name = "", labels = c("Present","Missing")) + theme_minimal() + theme(axis.text.x = element_text(angle=45, vjust=0.5)) + labs(x = "Variables in Dataset", y = "Rows / observations") }
par(ask=F) library(dplyr) ggplot_missing(mydf)
RepDataPeerAssessment1::ggplot_missing(mydf)
par(ask = F) missing.pattern.plot(mydf, clustered = FALSE)
library(extracat) visna(mydf)
visdf(mydf)
mdf <- missing_data.frame(mydf)
mdf <- change(mdf, y = "y", what = "transformation", to = "identity")
summary(mdf)
imputations <- mi(mdf)
par(ask = FALSE) plot(imputations)
# ============================================================================== # missing pattern plot # ============================================================================== mp.plot <- missing.pattern.plot <- function ( data, y.order = FALSE, x.order = FALSE, clustered = TRUE, xlab = "Index", ylab = "Variable", main = NULL, gray.scale = FALSE, obs.col = "blue", mis.col = "red", ... ) { if (is.null(main)) { main <- deparse( substitute( data ) ) } index <- seq(nrow(data)) x.at = 1:nrow( data ) x.lab = index if( y.order ) { data <- data[ ,order( colSums( is.na( data ) ), decreasing = TRUE ) ] ylab = "Ordered by number of missing items" } if( x.order ) { data <- data[order( rowSums( is.na( data ) ), decreasing = FALSE ), ] index<- row.names( data ) xlab = "Ordered by number of missing items" x.at = NULL x.lab= FALSE } missingIndex <- as.matrix(is.na(data))*1 if(clustered){ orderIndex <- order.dendrogram(as.dendrogram(hclust(dist(missingIndex), method="mcquitty"))) missingIndex <- missingIndex[orderIndex,] } col <- if( gray.scale ){ gray(c(0, 1)) } else { c(obs.col, mis.col) } # par( mar = c( 4.5, 11, 3, 1 ) ) # par( mgp = c( 1, .3, 0 ) ) # par( cex.lab = 0.7 ) image(x = 1:nrow(data), y = 1:ncol(data), z = missingIndex, ylab = "", xlab = xlab, main = main, col = col ,yaxt = "n", tck = -0.05, xaxt="n", ...) box( "plot" ) axis( side = 2, at = 1:ncol( data ), labels = names( data ), las = 1, tick = FALSE, yaxs = "r", tcl = 0.3, xaxs ="i", yaxs = "i" ) mtext( ylab, side = 3 , line = 10, cex=0.7) if( x.order ) { axis( side = 1, at =x.at, labels = x.lab, tick = FALSE, xaxs = "i", las = 1 ) } }
We can then see some summary information about the dataset and the nature of the missingness:
mi.info(mydf)
With that information confirmed, it is incredibly issue to conduct our multiple imputation using the mi function:
imp.mi <- mi(mydf)
imp.mi
The results above report how many imputed datasets were produced and summarizes some of the results we saw above. For linear regression (and several other common models), the mi package includes functions that automatically run the model on each imputed dataset and aggregate the results:
library(stats) lm.mi.out <- lm.mi(y ~ x1 + x2, imp.mi)
library(mice)
imp.mice <- mice(mydf)
summary(imp.mice)
lm.mice.out <- with(imp.mice, lm(y ~ x1 + x2)) summary(lm.mice.out)
pool.mice <- pool(lm.mice.out)
summary(pool.mice) # multiply imputed results
coef(summary(lm))[, 1:2] # original results
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.