Using Multiple Hot Deck Data Sets for Inference"
In hot.deck: Multiple Hot Deck Imputation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(useFancyQuotes=FALSE, width=100)

This document will walk you through some of the methods you could use to generate pooled model results that account for both sampling variability and across imputation variability. The package hot.deck does not come with a set of functions to do inference, so we will show you how you could use the data generated by hot.deck in combination with glm.mids (and similarly lm.mids) from the mice package, zelig from the Zelig package and by using MIcombine from the mitools package on a list of model objects.

Generating Imputations

The data we will use come from @Poeetal1999 dealing with democracy and state repression. First we need to call the hot.deck routine on the dataset.

library(hot.deck)
data(isq99)
out <- hot.deck(isq99, sdCutoff=3, IDvars = c("IDORIGIN", "YEAR"))

This shows us that there are still 45 observations with fewer than 5 donors. Using a different method or further widening the sdCutoff parameter may alleviate the problem. If you want to see the frequency distribution of the number of donors, you could look at:

numdonors <- sapply(out$donors, length)
numdonors <- sapply(out$donors, length)
numdonors <- ifelse(numdonors > 5, 6, numdonors)
numdonors <- factor(numdonors, levels=1:6, labels=c(1:5, ">5"))
table(numdonors)

Before running a model, three variables have to be created from those existing. Generally, if variables are deterministic functions of other variables (e.g., transformations, lags, etc...) it is advisable to impute the constituent variables of the calculations and then do the calculations after the fact. Here, we need to lag the AI variable and create percentage change variables for both population and per-capita GNP. First, to create the lag of AI, PCGNP and LPOP. To do this, we will make a little function.

tscslag <- function(dat, x, id, time){
  obs <- apply(dat[, c(id, time)], 1, paste, collapse=".")
  tm1 <- dat[[time]] - 1
  lagobs <- apply(cbind(dat[[id]], tm1), 1, paste, collapse=".")
  lagx <- dat[match(lagobs, obs), x]
}
for(i in 1:length(out$data)){
  out$data[[i]]$lagAI <- tscslag(out$data[[i]], "AI", "IDORIGIN", "YEAR")
  out$data[[i]]$lagPCGNP <- tscslag(out$data[[i]], "PCGNP", "IDORIGIN", "YEAR")
  out$data[[i]]$lagLPOP <- tscslag(out$data[[i]], "LPOP", "IDORIGIN", "YEAR")
}

Now, we can use the lagged values of PCGNP and LPOP, to create percentage change variables:

for(i in 1:length(out$data)){
  out$data[[i]]$pctchgPCGNP <- with(out$data[[i]], c(PCGNP-lagPCGNP)/lagPCGNP)
  out$data[[i]]$pctchgLPOP <- with(out$data[[i]], c(LPOP-lagLPOP)/lagLPOP)
}

Using MIcombine

You can use the MIcombine command from the mitools package to generate inferences, too. Here, you have to produce a list of model estimates and the function will combine across the different results.

# initialize list
out <- hd2amelia(out)
results <- list()
# loop over imputed datasets
for(i in 1:length(out$imputations)){
    results[[i]] <- lm(AI ~ lagAI + pctchgPCGNP + PCGNP + pctchgLPOP + LPOP + MIL2 + LEFT +
    BRIT + POLRT + CWARCOW + IWARCOW2, data=out$imputations[[i]])
}
summary(mitools::MIcombine(results))

Using mids

The final method for combining results is to convert the data object returned by the hot.deck function to an object of class mids. This can be done with the datalist2mids function from the miceadds package.

out.mids <- miceadds::datalist2mids(out$imputations)
s <- summary(mice::pool(mice::lm.mids(AI ~ lagAI + pctchgPCGNP + PCGNP + pctchgLPOP + LPOP + MIL2 + LEFT +
BRIT + POLRT + CWARCOW + IWARCOW2, data=out.mids)))
print(s, digits=4)