catImp: Imputation for categorical variables using log linear models

View source: R/catimp.R

catImpR Documentation

Imputation for categorical variables using log linear models

Description

This function performs multiple imputation under a log-linear model as described by Schafer (1997), using his cat package, either with or without posterior draws.

Usage

catImp(
  obsData,
  M = 10,
  pd = FALSE,
  type = 1,
  margins = NULL,
  steps = 100,
  rseed
)

Arguments

obsData

The data frame to be imputed. Variables must be coded such that they take consecutive positive integer values, i.e. 1,2,3,...

M

Number of imputations to generate.

pd

Specify whether to use posterior draws (TRUE) or not (FALSE).

type

An integer specifying what type of log-linear model to impute using. type=1, the default, allows for all two-way associations in the log-linear model. type=2 allows for all three-way associations (plus lower). type=3 fits a saturated model.

margins

An optional argument that can be used instead of type to specify the desired log-linear model. See the documentation for the margins argument in ecm.cat and Schafer (1997) on how to specify this.

steps

If pd is TRUE, the steps argument specifies how many MCMC iterations to perform in order to generate the model parameter value for each imputation.

rseed

The value to set the cat package's random number seed to, using the rngseed function of cat. This function must be called at least once before imputing using cat. If the user wishes to set the seed using rngseed before calling catImp, set rseed=NULL.

Details

By default catImp will impute using a log-linear model allowing for all two-way associations, but not higher order associations. This can be modified through use of the type and margins arguments.

With pd=FALSE, all imputed datasets are generated conditional on the MLE of the model parameter, referred to as maximum likelihood multiple imputation by von Hippel and Bartlett (2021).

With pd=TRUE, regular 'proper' multiple imputation is used, where each imputation is drawn from a distinct value of the model parameter. Specifically, for each imputation, a single MCMC chain is run, iterating for steps iterations.

Imputed datasets can be analysed using withinBetween, scoreBased, or for example the bootImpute package.

Value

A list of imputed datasets, or if M=1, just the imputed data frame.

References

Schafer J.L. (1997). Analysis of incomplete multivariate data. Chapman & Hall, Boca Raton, Florida, USA.

von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/20-STS793")}.

Examples

#simulate a partially observed categorical dataset
set.seed(1234)
n <- 100

#for simplicity we simulate completely independent variables
temp <- data.frame(x1=ceiling(3*runif(n)), x2=ceiling(2*runif(n)), x3=ceiling(2*runif(n)))

#make some data missing
for (i in 1:3) {
  temp[(runif(n)<0.25),i] <- NA
}

#impute using catImp, assuming two-way associations in the log-linear model
imps <- catImp(temp, M=10, pd=FALSE, rseed=4423)

#impute assuming a saturated log-linear model
imps <- catImp(temp, M=10, pd=FALSE, type=3, rseed=4423)

jwb133/mlmi documentation built on June 4, 2023, 9:39 a.m.