gerbil: General Efficient Regression-Based Imputation with Latent...

gerbilR Documentation

General Efficient Regression-Based Imputation with Latent processes

Description

Coherent multiple imputation of general multivariate data as implemented through the GERBIL algorithm described by Robbins (2020). The algorithm is

  • coherent in that imputations are sampled from a valid joint distribution, ensuring MCMC convergence;

  • general in that data of general structure (binary, categorical, continuous, etc.) may be allowed;

  • efficient in that computational performance is optimized using the SWEEP operator for both modeling and sampling;

  • regression-based in that the joint distribution is built through a sequence of conditional regression models;

  • latent in that a latent multivariate normal process underpins all variables; and

  • flexible in that the user may specify which dependencies are enabled within the conditional models.

Usage

gerbil(
  dat,
  m = 1,
  mcmciter = 25,
  predMat = NULL,
  type = NULL,
  visitSeq = NULL,
  ords = NULL,
  semi = NULL,
  bincat = NULL,
  cont.meth = "EMP",
  num.cat = 12,
  r = 5,
  verbose = TRUE,
  n.cores = NULL,
  cl.type = NULL,
  mass = rep(0, length(semi)),
  ineligible = NULL,
  trace = TRUE,
  seed = NULL,
  fully.syn = FALSE
)

Arguments

dat

The dataset that is to be imputed. Missing values must be coded with NA.

m

The number of multiply imputed datasets to be created. By default, m = 1.

mcmciter

The number of iterations of Markov chain Monte Carlo that will be used to create each imputed dataset. By default, m = 25.

predMat

A numeric matrix of ncol(dat) columns and no more than nrow(dat) rows, containing 0/1 data specifying the set of predictors to be used for each target row. Each row corresponds to a variable. A value of 1 means that the column variable is used as a predictor for the variable in the target row. By default, predMat is a square matrix of ncol(dat) rows and columns with 1's below the diagonal and 0's on and above the diagonal. Any non-zero value on or above the diagonal will be set to zero.

type

A named vector that gives the type of each variable contained in dat. Possible types include 'binary', 'categorical', 'ordinal', 'semicont' (semi-continuous), and 'continuous'. The vector type should be named where the names indicate the corresponding column of dat. Types for variables not listed in type will be determined by default, in which case a variable with no more than num.cat possible values will be set as binary/categorical and is set as continuous otherwise.

visitSeq

A vector of variable names that has (at least) contains all names of each column of dat that has missing values. Within the I-Step and P-Step of gerbil, the variables will be modeled and imputed in the sequence given by visitSeq. If visitSeq = TRUE, visitSeq is reset as being equal to the columns of dat ordered from least to most missingness. If visitSeq = NULL (the default) or visitSeq = FALSE variables are ordered in accordance with the order of the rows of predMat or (if unavailable) the order in which they appear in the dat.

ords

A character string giving a set of the column names of dat that indicate which variables are to be treated as ordinal. Elements of ords are overridden by any conflicting information in type. By default, ords = NULL.

semi

A character string giving a set of the column names of dat that indicate which variables are to be treated as semi-continuous. Elements of semi are overridden by any conflicting information in type. By default, semi = NULL.

bincat

A character string giving a set of the column names of dat that indicate which variables are to be treated as binary or unordered categorical. Elements of bincat are overridden by any conflicting information in type. By default, bincat = NULL.

cont.meth

The type of marginal transformation used for continuous variables. Set to "EMP" by default for the empirical distribution transformation of Robbins (2014). The current version also includes an option for no transformation (cont.meth = "none"). Other transformation types will be available in future versions of gerbil. .

num.cat

Any variable that does not have a type specified by any of the other parameters will be treated as categorical if it takes on no more than num.cat possible values and as continuous if it takes on more than num.cat possible values. By default, num.cat = 12.

r

The number of pairwise completely observed cases that must be available for any pair of variables to have dependencies enabled within the conditional models for imputation. By default, r = 5.

verbose

If TRUE (the default), history is printed on console. Use verbose = FALSE for silent computation.

n.cores

The number of CPU cores to use for parallelization. If n.cores is not specified by the user, it is guessed using the detectCores function in the parallel package. If TRUE (the default), it is set as detectCores(). If NULL, it is set as floor((detectCores()+1)/2). If FALSE, it is set as 1, in which case parallelization is not invoked. Note that the documentation for detectCores makes clear that it is not failsafe and could return a spurious number of available cores. By default, n.cores is set as floor((n + 1)/2), where n is the number of available clusters.

cl.type

The cluster type that is passed into the makeCluster() function in the parallel package. Defaults to 'PSOCK'.

mass

A named vector of the same length as the number of semi-continuous variables in dat that gives the location (value) of the point mass for each such variable. The point of mass for each semicontinuous variable is set to zero by default.

ineligible

Either a scalar or a matrix that is used to determined which values are to be considered missing but ineligible for imputation. Such values will be imputed internally within gerbil to ensure a coherent imputation model but will be reset as missing after imputations have been created. If ineligible is a scalar, all data points that take on the respective value will be considered missing but ineligible for imputation. If ineligible is a matrix (with the same number of rows as dat and column names that overlap with dat), entries of TRUE or 1 in ineligible indicate values that are missing but ineligible for imputation. If ineligible = NULL (the default), all missing values will be considered eligible for imputation.

trace

A logical that, if TRUE, implies that means and variances of variables are tracked across iterations. Set to FALSE to save computation time. However, trace plots and R hat statistics are disabled for gerbil objects created with trace = FALSE. Defaults to TRUE.

seed

An integer that, when specified, is used to set the random number generator via set.seed().

fully.syn

A logical that, if TRUE, implies that a fully synthetic dataset will be created (although variables without missingness are not altered).

Details

gerbil is designed to handle the following classes of variables:

  • 'continuous': Variables are transformed to be (nearly) standard normal prior to imputation. The default transformation method is based on empirical distributions (see Robbins, 2014) and ensures that imputed values of a variable are sampled from the observed values of that variable.

  • 'binary': Dichotomous variables are handled through probit-type models in that they are underpinned by a unit-variance normally distributed random variable.

  • 'categorical': Unordered categorical variables are handled by creating nested binary variables that underpin the categorical data. Missingness is artificially imposed in the nested variables in order to ensure conditional independence between them. See Robbins (2020) for details.

  • 'ordinal': Ordered categorical variables (ordinal) are handled through a probit-type model in that a latent normal distribution is assumed to underpin the ordinal observations. See Robbins (2020) for details.

  • 'semicont': Mixed discrete/continuous (semi-continuous) variables are assumed to observe a mass at a specific value (most often zero) and are continuous otherwise. A binary variable is created that indicates whether the semi-continuous variable takes on the point-mass value; the continuous portion is set as missing when the observed semi-continuous variable takes on the value at the point-mass. See Robbins et al. (2013) for details.

The parameter type allows the user to specify the class for each variable. Routines are in place to establish the class by default for variables not stated in type. Note that it is not currently possible for a variable to be assigned a class of semi-continuous by default.

gerbil uses a joint modeling approach to imputation that builds a joint model using a sequence of conditional models, as outlined in Robbins et al. (2013). This approach differs from fully conditional specification in that the regression model for any given variable is only allowed to depend upon variables that preceed it in an index ordering. The order is established by the parameter visitSeq. gerbil contains the flexibility to allow its user to establish which of the permissible dependencies are enabled within the conditional models. Enabled dependencies are stated within the parameter predMat. Note that the data matrix used for imputation is an expanded version of the data that are fed into the algorithm (variables are created that underpin unordered categorical and semi-continuous variables). Note also that conditional dependencies between the nested binary variables of a single undordered categorical variables or the discrete and continuous portions of a semi-continuous variable are not permitted.

The output of gerbil is an object of class gerbil which is a list that contains the imputed datasets (imputed), missingness indicators (missing and missing.latent), summary information (summary), output used for MCMC convergence diagostics (chainSeq and R.hat), and modeling summaries (visitSeq.initial, visitSeq.final, predMat.initial, predMat.final, drops, and forms). Some output regarding convergence diagnostics and modeling regards the expanded dataset used for imputation (the expanded dataset includes binary indicators for unordered categorical and semi-continuous variables). Note that the nested binary variables corresponding to an unordered categorical variable X with categories labeled a, b, c, etc., are named X.a, X.b, X.c, and so forth in the expanded dataset. Likewise, the binary variable indicating the point mass of a semi-continuous variable Y is named Y.B in the expanded dataset, and the positive portion (with missingness imposed) is left as being named Y.

gerbil automatically checks each regression model for perfect collinearities and reduces the model as needed. Variables that have been dropped from a given model are listed in the element named 'drops' in a gerbil object.

Value

gerbil() returns an object the class gerbil that contains the following slots:

imputed

A list of length m that contains the imputed datasets.

missing

A matrix 0s, 1s, 2s, and 4s of the same dimension as dat that indicates which values were observed or missing. A 0 indicates a fully observed value, a 1 indicates a missing value that was imputed, and a 4 indicates a missing value that was ineligible for imputation.

summary

A matrix with ncol(dat) number of rows that contains summary information, including the type of each variable and missingness rates. Note that for continuous variables, the type listed indicates the method of transformation used.

chainSeq

A list of six elements. Each element is a matrix with mcmciter columns and up to ncol(dat) rows. Objects means.all and means.mis give the variables means of data process across iterations of MCMC when all observations are incorporated and when only imputed values are incorporated, respectively. (Means of continuous variables are given on the transformed scale.) Similar objects are provided to track variances of variables. Variables are listed in the order provided by the gerbil object visitSeq.latent. Variables reported in this output are those contained in the dataset that has been expanded to include binary indicators for categorical and semi-continuous variables.

R.hat

The value of the R hat statistics of Gelman and Rubin (1992) for the means and variances of each variable. The R hat statistic is also provided for mean of binary variables. Variables include those contained in the expanded dataset and are listed in the order provided by object visitSeq.latent. Only calculated if m > 2 and mcmciter >= 4.

missing.latent

A matrix of the same dimensions as the expanded dataset, but used to indicate missingness in the expanded dataset. In this matrix, 0s indicate fully observed values, 1s indicate fully missing values, 3s indicate values that have imposed missingness (for binary indicators corresponding to categorical or semi-continuous variables), and 4 indicates a missing value that is ineligible for imputation (as determined by the input 'ineligible')..

visitSeq.initial

A vector of variable names giving the sequential ordering of variables that is used for imputation prior to expanding the dataset include nested binary and point-mass indicators. Variables without missing values are excluded.

visitSeq.final

A vector of variable names giving the sequential ordering of variables in the expanded dataset that is used for imputation. Variables without missing values are excluded.

predMat.initial

A matrix of ones and zeros indicating the dependencies enabled in the conditional models used for imputation. This matrix is determined from the input 'predMat'. Rows corresponding to variables with no missing values are removed.

predMat.final

A matrix of ones and zeros indicating the dependencies enabled in the conditional models used for imputation. This is of a similar format to the input 'predMat' but pertains to the expanded dataset. Rows corresponding to variables with no missing values are removed.

drops

A list of length equal to the number of variables in the expanded dataset that have missing values. Elements of the list indicate which variables were dropped from the conditional model for the corresponding variable due to either insufficient pairwise complete observations (see the input 'r') or perfect collinearities.

forms

A list of length equal to the number of variables in the expanded dataset that have missing values. Elements of the list indicate the regression formula used for imputation of the respective variable.

mass.final

The final version of the input parameter mass.

ineligibles

A logical matrix with the same number of rows and columns as dat that indicates which elements are considered missing but ineligible for imputation.

nams.out

A vector used to link column names in the expanded data to corresponding names in the original data.

References

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457-472.

Robbins, M. W. (2014). The Utility of Nonparametric Transformations for Imputation of Survey Data. Journal of Official Statistics, 30(4), 675-700.

Robbins, M. W. (2020). A flexible and efficient algorithm for joint imputation of general data. arXiv preprint arXiv:2008.02243.

Robbins, M. W., Ghosh, S. K., & Habiger, J. D. (2013). Imputation in high-dimensional economic data as applied to the Agricultural Resource Management Survey. Journal of the American Statistical Association, 108(501), 81-95.

Examples

#Load the India Human Development Survey-II dataset
data(ihd_mcar) 

# Gerbil without types specified
imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 10)

# Gerbil with types specified (method #1)
types.gerbil <- c(
       sex = "binary", age = "continuous", 
       marital_status = "binary", job_field = "categorical", 
       farm_labour_days = "semicont", own_livestock = "binary", 
       education_level = "ordinal", income = "continuous")
imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil)

# Gerbil with types specified (method #2)
imps.gerbil <- gerbil(ihd_mcar, m = 1, ords = "education_level", semi = "farm_labour_days", 
       bincat = c("sex", "marital_status", "job_field", "own_livestock"))

# Gerbil with types specified (method #3)
types.gerbil <- c("binary", "continuous", "binary", "categorical", "semicont", 
       "binary", "ordinal", "continuous")
imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil)

# Variables of class factor are treated as binary/categorical by default
ihd.fac <- ihd_mcar
ihd.fac$sex <- factor(ihd_mcar$sex)
ihd.fac$marital_status <- factor(ihd_mcar$marital_status)
ihd.fac$job_field <- factor(ihd_mcar$job_field)
ihd.fac$own_livestock <- factor(ihd_mcar$own_livestock)
ihd.fac$education_level <- ordered(ihd_mcar$education_level)
imps.gerbil <- gerbil(ihd.fac, m = 1)

# Univariate plotting of one variable
plot(imps.gerbil, type = 1, y = "job_field")

# gerbil with predMat specified (method #1)
predMat <- matrix(c(1, 0, 0, 1), 2, 2)
dimnames(predMat) <- list(c("education_level", "income"), c("sex", "job_field"))
imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil, predMat = predMat)

# gerbil with predMat specified (method #2)
predMat <- rbind(
       c(0, 0, 0, 0, 0, 0, 0, 0), 
       c(1, 0, 0, 0, 0, 0, 0, 0), 
       c(1, 1, 0, 0, 0, 0, 0, 0), 
       c(1, 1, 1, 0, 0, 0, 0, 0), 
       c(1, 1, 1, 1, 0, 0, 0, 0), 
       c(1, 1, 1, 1, 1, 0, 0, 0), 
       c(1, 1, 1, 0, 1, 1, 0, 0), 
       c(0, 1, 1, 1, 1, 1, 1, 0) 
       )
imps.gerbil <- gerbil(ihd_mcar, type = types.gerbil, predMat = predMat)

# Multiple imputation with more iterations
imps.gerbil.5 <- gerbil(ihd_mcar, m = 5, mcmciter = 100, ords = "education_level", 
       semi = "farm_labour_days", bincat = "job_field", n.cores = 1)

plot(imps.gerbil.5, type = 1, y = "job_field", imp = 1:5) 

# Extract the first imputed dataset
imputed.gerb <- imputed(imps.gerbil.5, imp = 1)

# Write all imputed datasets to an Excel file
write.gerbil(imps.gerbil.5, file = file.path(tempdir(), "gerbil_example.xlsx"), imp = 1:5)


## Not run: 
if(requireNamespace('mice')){
# Impute using mice for comparison

types.mice <- c("logreg", "pmm", "logreg", "polyreg", "pmm", "logreg", "pmm", "pmm")
imps.mice <- mice(ihd.fac, m = 1, method = types.mice, maxit = 100)

imps.mice1 <- mice(ihd.fac, m = 1, method = "pmm", maxit = 100)

imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 100, ords = "education_level", 
    semi = "farm_labour_days", bincat = "job_field")

# Compare the performance of mice and gerbil

# Replace some gerbil datasets with mice datasets
imps.gerbil.m <- imps.gerbil.5
imps.gerbil.m$imputed[[2]] <- complete(imps.mice, action = 1)
imps.gerbil.m$imputed[[3]] <- complete(imps.mice1, action = 1)

# Perform comparative correaltion analysis
cor_gerbil(imps.gerbil.m, imp = 1, log = "income")
cor_gerbil(imps.gerbil.m, imp = 2, log = "income")
cor_gerbil(imps.gerbil.m, imp = 3, log = "income")

# Perform comparative univariate goodness-of-fit testing
gof_gerbil(imps.gerbil.m, type = 1, imp = 1)
gof_gerbil(imps.gerbil.m, type = 1, imp = 2)
gof_gerbil(imps.gerbil.m, type = 1, imp = 3)

# Perform comparative bivariate goodness-of-fit testing
gof_gerbil(imps.gerbil.m, type = 2, imp = 1)
gof_gerbil(imps.gerbil.m, type = 2, imp = 2)
gof_gerbil(imps.gerbil.m, type = 2, imp = 3)

# Produce univariate plots for comparisons 
plot(imps.gerbil.m, type = 1, file = file.path(tempdir(), "gerbil_vs_mice_univariate.pdf"), 
     imp = c(1, 2, 3), log = "income", lty = c(1, 2, 4, 5), col = c("blue4", "brown2", 
     "green3", "orange2"), legend = c("Observed", "gerbil", "mice: logistic", "mice: pmm"))

### Produce bivariate plots for comparisons 
plot(imps.gerbil.m, type = 2, file = file.path(tempdir(), "gerbil_vs_mice_bivariate.pdf"), 
    imp = c(1, 2, 3), log = "income", lty = c(1, 2, 4, 5), col = c("blue4", "brown2", 
    "green3", "orange2"), pch = c(1, 3, 4, 5), legend = c("Observed", "gerbil", 
    "mice: logistic", "mice: pmm"))

}

## End(Not run)

gerbil documentation built on Jan. 12, 2023, 5:10 p.m.