catImp: Imputation for categorical variables using log linear models
In jwb133/mlmi: Maximum Likelihood Multiple Imputation

View source: R/catimp.R

catImp

R Documentation

Imputation for categorical variables using log linear models

Description

This function performs multiple imputation under a log-linear model as described by Schafer (1997), using his cat package, either with or without posterior draws.

Usage

catImp(
  obsData,
  M = 10,
  pd = FALSE,
  type = 1,
  margins = NULL,
  steps = 100,
  rseed
)

Arguments

`obsData`	The data frame to be imputed. Variables must be coded such that they take consecutive positive integer values, i.e. 1,2,3,...
`M`	Number of imputations to generate.
`pd`	Specify whether to use posterior draws (`TRUE`) or not (`FALSE`).
`type`	An integer specifying what type of log-linear model to impute using. `type=1`, the default, allows for all two-way associations in the log-linear model. `type=2` allows for all three-way associations (plus lower). `type=3` fits a saturated model.
`margins`	An optional argument that can be used instead of `type` to specify the desired log-linear model. See the documentation for the `margins` argument in `ecm.cat` and Schafer (1997) on how to specify this.
`steps`	If `pd` is `TRUE`, the `steps` argument specifies how many MCMC iterations to perform in order to generate the model parameter value for each imputation.
`rseed`	The value to set the `cat` package's random number seed to, using the `rngseed` function of `cat`. This function must be called at least once before imputing using `cat`. If the user wishes to set the seed using `rngseed` before calling `catImp`, set `rseed=NULL`.

Details

By default catImp will impute using a log-linear model allowing for all two-way associations, but not higher order associations. This can be modified through use of the type and margins arguments.

With pd=FALSE, all imputed datasets are generated conditional on the MLE of the model parameter, referred to as maximum likelihood multiple imputation by von Hippel and Bartlett (2021).

With pd=TRUE, regular 'proper' multiple imputation is used, where each imputation is drawn from a distinct value of the model parameter. Specifically, for each imputation, a single MCMC chain is run, iterating for steps iterations.

Imputed datasets can be analysed using withinBetween, scoreBased, or for example the bootImpute package.

Value

A list of imputed datasets, or if M=1, just the imputed data frame.

References

Schafer J.L. (1997). Analysis of incomplete multivariate data. Chapman & Hall, Boca Raton, Florida, USA.

von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/20-STS793")}.

Examples

#simulate a partially observed categorical dataset
set.seed(1234)
n <- 100

#for simplicity we simulate completely independent variables
temp <- data.frame(x1=ceiling(3*runif(n)), x2=ceiling(2*runif(n)), x3=ceiling(2*runif(n)))

#make some data missing
for (i in 1:3) {
  temp[(runif(n)<0.25),i] <- NA
}

#impute using catImp, assuming two-way associations in the log-linear model
imps <- catImp(temp, M=10, pd=FALSE, rseed=4423)

#impute assuming a saturated log-linear model
imps <- catImp(temp, M=10, pd=FALSE, type=3, rseed=4423)

jwb133/mlmi documentation built on June 4, 2023, 9:39 a.m.