smcfcs for coarsened factor covariates

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

smcfcs was originally created to create multiple imputations of missing values of covariates in regression models. As of 2025, it has functionality to impute unobserved values of factor variables which are 'coarsened', based on the developments in van der Burg et al (2025). By coarsened, we mean that for some of the missing values, some partial information about the value is known - we know that the value belongs to some subset of the possible values. In this vignette we demonstrate the functionality of smcfcs for imputing such variables.

To demonstrate how to do this, we illustrate using the dataset ex_coarsening that is in the smcfcs package:

library(smcfcs)
summary(ex_coarsening)
head(ex_coarsening)

The variable x is a factor variable which has r sum(is.na(ex_coarsening$x)) missing values. The variable xobs gives the known information about (some of) the missing values:

table(ex_coarsening$x,ex_coarsening$xobs,useNA = "ifany")

From this we can see that among the r sum(is.na(ex_coarsening$x)) missing values in x, for r sum(ex_coarsening$xobs=="a/c") individuals we know that their value for x was either a or c, as indicated by the string 'a/c', r sum(ex_coarsening$xobs=="b/c") individuals we know that their value for x was either b or c, as indicated by the string 'b/c', while for the remainder we have no further information, indicated by the character string "NA".

Note: the variable xobs is a character variable, and for rows where x is (plain) missing, xobs takes the character value "NA", rather than R's missing value indicator NA. This is important, since if we used the missing value indicator NA, smcfcs would refused to run as we have not told it how to impute the missing values in xobs.

In order to impute the missing values in x using smcfcs we have to define a value for the restrictions argument. For this we must pass a list of length equal to the number of variables in the data frame. For the element in this list corresponding to x we must give a vector of formula typ expressions to specify the possible values for x when xobs equals a/c or b/c. To achieve this we use:

restrictionsX = c("xobs = a/c ~ a + c",
                  "xobs = b/c ~ b + c")
restrictions = append(list(restrictionsX), as.list(c("", "", "")))

We can then impute the missing values accounting for the partial information with:

set.seed(68204812)
imps <- smcfcs(originaldata=ex_coarsening,
               smtype="lm",
               smformula = "y~z+x",
               method = c("mlogit","", "", ""),
               restrictions = restrictions
)

To check that smcfcs has correctly used the partial information about the missing values in x, first we check the first few rows in the first imputed dataset:

head(imps$impDatasets[[1]])

This looks fine - when xobs=a/c we have imputed values either of a or c, whereas when xobs=b/c we have imputed values of b or c. To check properly, we can repeat the earlier cross-tabulation:

table(imps$impDatasets[[1]]$x,imps$impDatasets[[1]]$xobs,useNA = "ifany")

This shows that (at least in the first imputed dataset) the imputed values respect the partial information contained in xobs, as desired.

The restrictions argument can also be used for ordered factor variables in the same way.



Try the smcfcs package in your browser

Any scripts or data that you put into this service are public.

smcfcs documentation built on April 4, 2025, 1:58 a.m.