knitr::opts_chunk$set(echo = TRUE, comment = NA, error = TRUE)

This tutorial covers techniques of multiple imputation. Multiple imputation is a strategy for dealing with missing data. Whereas we typically (i.e., automatically) deal with missing data through casewise deletion of any observations that have missing values on key variables, imputation attempts to replace missing values with an estimated value.

In single imputation, we guess that missing value one time (perhaps based on the means of observed values, or a random sampling of those values). In multiple imputation, we instead draw multiple values for each missing value, effectively building multiple datasets, each of which replaces the missing data in a different way.

There are numerous algorithms for this, each of which builds those multiple datasets in different ways. We're not going to discuss the details here, but instead focus on executing multiple imputation in R. The main challenge of multiple imputation is not the analysis (it simply proceeds as usual on each imputed dataset) but instead the aggregation of those separate analyses. The examples below discuss how to do this.

To get a basic feel for the process, let's imagine that we're trying to calculate the mean of a vector of values that contains missing values. We can impute the missing values by drawing from the observed values, repeat the process several times, and then average across the estimated means to get an estimate of the mean with a measure of uncertainty that accounts for the uncertainty due to imputation. Let's create a vector of ten values, seven of which we observe and three of which are missing, and imagine that they are random draws from the population whose mean we're trying to estimate:

Source: http://thomasleeper.com/Rcourse/Tutorials/mi.html

n <- 10            # "n" values; vector of "n" length
bad <- 3           # number of NAs
good <- n - bad    # valid values

set.seed(10)

x <- c(sample(1:n, good, replace = TRUE), rep(NA, bad))  # build vector of "n" lements
x
plot(x, col = "blue")

We can find the mean using case deletion:

mean(x, na.rm = TRUE)

Our estimate of the sample standard error is then:

# Standard Error of only the elements that are not NA
sum(!is.na(x))
sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))

impute with random values

Now let's impute several times to generate a list of imputed vectors:

# Generate 15 samples of new data replacing the NA elements with values 
# extracted from the GOOD values.

na.count <- length(x[is.na(x) == TRUE])    # calculate the # of NAs

imp <- replicate(15, 
                 c(x[!is.na(x)],                           # get the 7 non-NA elements
                   sample(x[!is.na(x)], na.count, TRUE)),  # get 3 elements from non-NA vec
                 simplify = FALSE)                         # output to be a vector
imp
plot(imp[[1]])
library(zoo)

St <- x

Sti <- na.locf(x)
nas.count(Sti)
par(mfrow=c(1,1))
limRow <- length(x)

colvec <- ifelse(is.na(St), mdc(2), mdc(1))    # missing data coloring (Mdc)

plot(Sti[1:limRow], 
     col = colvec, 
     type="l", 
     xlab="Observation", 
     ylab="Value")

points(Sti[1:limRow], col = colvec, pch=20, cex=2)
# points(imp[[1]], col = colvec, pch=21, cex=1.5)
# points(imp[[2]], col = colvec, pch=22, cex=1.5)
# points(imp[[3]], col = colvec, pch=23, cex=1.5)
# points(imp[[4]], col = colvec, pch=24, cex=1.5)
# points(imp[[5]], col = colvec, pch=25, cex=1.5)

for (i in 1:length(imp)) {
  points(imp[[i]], col = colvec, pch=i, cex=1)
}

The result is a list of fifteen vectors. The first seven values of each is the same as our original data, but the three missing values have been replaced with different combinations of the observed values.

To get our new estimated mean, we simply take the mean of each vector, and then average across them:

means <- sapply(imp, mean)
means
# the grand mean or the mean of all means
grandm <- mean(means)
grandm

The result is 4.147, about the same as our original estimate. To get the standard error of our multiple imputation estimate, we need to combine the standard errors of each of our estimates, so that means we need to start by getting the SEs of each imputed vector:

# standard error
ses <- sapply(imp, sd) / sqrt(10)
ses

Aggregating the standard errors is a bit complicated, but basically sums the mean of the SEs (i.e., the “within-imputation variance”) with the variance across the different estimated means (the “between-imputation variance”). To calculate the within-imputation variance, we simply average the SE estimates:

within <- mean(ses)
within

To calculate the between-imputation variance, we calculate the sum of squared deviations of each imputed mean from the grand mean estimate:

between <- sum((means - grandm)^2) / (length(imp) - 1)
between

Then we sum the within- and between-imputation variances (multiply the latter by a small correction):

grandvar <- within + ((1 + (1 / length(imp))) * between)
grandse <- sqrt(grandvar)
grandse

The resulting standard error is interesting because we increase the precision of our estimate by using 10 rather than 7 values (and standard errors are proportionate to sample size), but is larger than our original standard error because we have to account for uncertainty due to imputation. Thus, if our missing values are truly missing at random, we can get a better estimate that is actually representative of our original population.

Most multiple imputation algorithms are, however, applied to multivariate data rather than a single data vector and thereby use additional information about the relationship between observed values and missingness to reach even more precise estimates of target parameters.

Packages for multiple imputation

There are three main R packages that offer multiple imputation techniques. Several other packages - described in the OfficialStatistics Task View - supply other imputation techniques, but packages Amelia (by Gary King and collaborators), mi (by Andrew Gelman and collaborators), and mice (by Stef van Buuren and collaborators) provide more than enough to work with. Let's start by installing these packages:

install.packages(c("Amelia", "mi", "mice"), repos = "http://cran.r-project.org")

Now, let's consider an imputation situation where we plan to conduct a regression analysis predicting y by two covariates: x1 and x2 but we have missing data in x1 and x2. Let's start by creating the dataframe:

x1 <- runif(100, 0, 5)
x2 <- rnorm(100)
x3 <- rnorm(100)
y <- x1 + x2 + x3 + rnorm(100)
mydf <- cbind.data.frame(x1, x2, x3, y)

Now, let's randomly remove some of the observed values of the independent variables:

mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA
mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA
mydf$x3[sample(1:nrow(mydf), 15, FALSE)] <- NA

The result is the removal of thirty values, 20 from x1 and 10 from x2:

summary(mydf)

If we estimate the regression on these data, R will force casewise deletion of 28 cases:

lm <- lm(y ~ x1 + x2, data = mydf)
summary(lm)

We should thus be quite skeptical of our results given taht we're discarding a substantial portion of our observations (28%, in fact). Let's see how the various multiple imputation packages address this and affect our inference.

Amelia

library(Amelia)
imp.amelia <- amelia(mydf)

Once we've run our multiple imputation, we can see where are missing data lie:

missmap(imp.amelia)

We can also run our regression model on each imputed dataset. We'll use the lapply function to do this quickly on each of the imputed dataframes:

lm.amelia.out <- lapply(imp.amelia$imputations, function(i) lm(y ~ x1 + x2, data = i))

If we look at lm.amelia.out we'll see the results of the model run on each imputed dataframe separately:

# each of the imputations is a data frame 100x3 with their own imputted values.
str(imp.amelia$imputations)
lm.amelia.out

To aggregate across the results is a little bit tricky because we have to extract the coefficients and standard errors from each model, format them in a particular way, and then feed that structure into the mi.meld function:

coefs.amelia <- do.call(rbind, 
                        lapply(lm.amelia.out, function(i) coef(summary(i))[, 1]))
ses.amelia <- do.call(rbind, 
                      lapply(lm.amelia.out, function(i) coef(summary(i))[, 2]))

mi.meld(coefs.amelia, ses.amelia)

Now let's compare these results to those of our original model:

t(do.call(rbind, mi.meld(coefs.amelia, ses.amelia)))
coef(summary(lm))[, 1:2]  # original results

mi

library(mi)
mp.plot(mydf)
# A function that plots missingness
# requires `reshape2`

library(reshape2)
library(ggplot2)

ggplot_missing <- function(x){

  x %>% 
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = Var2,
               y = Var1)) +
    geom_raster(aes(fill = value)) +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme_minimal() + 
    theme(axis.text.x  = element_text(angle=45, vjust=0.5)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")
}
par(ask=F)
library(dplyr)
ggplot_missing(mydf)
RepDataPeerAssessment1::ggplot_missing(mydf)
par(ask = F)
missing.pattern.plot(mydf, clustered = FALSE)
library(extracat)

visna(mydf)
visdf(mydf)
mdf <- missing_data.frame(mydf) 
mdf <- change(mdf, y = "y", what = "transformation", to = "identity")
summary(mdf)
imputations <- mi(mdf)
par(ask = FALSE)
plot(imputations)
# ==============================================================================
# missing pattern plot
# ==============================================================================
mp.plot <- missing.pattern.plot <- function ( data, y.order = FALSE, x.order = FALSE, 
                                    clustered = TRUE, 
                                    xlab = "Index", ylab = "Variable", 
                                    main = NULL, gray.scale = FALSE,
                                    obs.col = "blue", mis.col = "red", ... ) 
{

  if (is.null(main)) {
    main <- deparse( substitute( data ) )
  }
  index <- seq(nrow(data))
  x.at =  1:nrow( data )
  x.lab = index
  if( y.order ) { 
    data <- data[ ,order( colSums( is.na( data ) ), decreasing = TRUE ) ] 
    ylab = "Ordered by number of missing items" 
  }
  if( x.order ) { 
    data <- data[order( rowSums( is.na( data ) ), decreasing = FALSE ), ] 
    index<- row.names( data )
    xlab = "Ordered by number of missing items" 
    x.at = NULL
    x.lab= FALSE
  }
  missingIndex <- as.matrix(is.na(data))*1
  if(clustered){
    orderIndex <-  order.dendrogram(as.dendrogram(hclust(dist(missingIndex), method="mcquitty")))
    missingIndex <- missingIndex[orderIndex,]
  }
  col <- if( gray.scale ){ 
           gray(c(0, 1)) 
         } 
         else { 
           c(obs.col, mis.col) 
         }
#  par( mar = c( 4.5, 11, 3, 1 ) )
#  par( mgp = c( 1, .3, 0 ) )
#  par( cex.lab = 0.7 )
  image(x = 1:nrow(data), y = 1:ncol(data), z = missingIndex, 
        ylab = "", xlab = xlab, main = main, col = col ,yaxt = "n",
        tck = -0.05, xaxt="n", ...)
  box( "plot" )
  axis( side = 2, at = 1:ncol( data ), labels = names( data ), las = 1, 
         tick = FALSE, yaxs = "r", tcl = 0.3, xaxs ="i", yaxs = "i" )
  mtext( ylab, side = 3 , line = 10, cex=0.7)
  if( x.order ) { 
    axis( side = 1, at =x.at, labels = x.lab, tick = FALSE, 
          xaxs = "i", las = 1 )   
  } 
}

We can then see some summary information about the dataset and the nature of the missingness:

mi.info(mydf)

With that information confirmed, it is incredibly issue to conduct our multiple imputation using the mi function:

imp.mi <- mi(mydf)
imp.mi

The results above report how many imputed datasets were produced and summarizes some of the results we saw above. For linear regression (and several other common models), the mi package includes functions that automatically run the model on each imputed dataset and aggregate the results:

library(stats)
lm.mi.out <- lm.mi(y ~ x1 + x2, imp.mi)

mice

library(mice)
imp.mice <- mice(mydf)
summary(imp.mice)
lm.mice.out <- with(imp.mice, lm(y ~ x1 + x2))
summary(lm.mice.out)
pool.mice <- pool(lm.mice.out)
summary(pool.mice)  # multiply imputed results
coef(summary(lm))[, 1:2]  # original results


AlfonsoRReyes/RepDataPeerAssessment1 documentation built on May 5, 2019, 4:53 a.m.