knitr::opts_chunk$set(echo = TRUE)

\newcommand{\bX}{\mathbf{X}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bigf}{\mathbf{f}} \newcommand{\bo}{\mathbf{0}}

Introduction

The package gcimputeR is for imputing missing values of continuous and ordinal mixed dataset. The imputation is done by fitting a Gaussian copula model to the incomplete mixed observation. More details can be found in our paper. In this vignette, we reproduce a simulation setting used in our paper.

Simulate mixed type data observation

We first generate rows of $\bZ\in \mathbb{R}^{n\times p}$ as $\bz^1,\ldots,\bz^{n}\overset{i.i.d.}{\sim} \mathcal{N}(\bo,\Sigma)$, then generate $\bX=\bigf(\bZ)$ through coordinate-wise monotone functions $\bigf$. $\bX$ consists of $n$ observations and $p$ variables. We let $p=15$ and $n=2000$ and randomly generate $\Sigma$. Use $\bigf$ such that $\bX_1,\ldots,\bX_5$ have exponential distributions with rate parameter $1/3$, $\bX_6,\ldots,\bX_{10}$ are binary and $\bX_{11},\ldots,\bX_{15}$ are ordinal with 5 levels. Then we randomly remove $50\%$ of the entries of $\bX$.

library(gcimputeR)
set.seed(410)
# Generate random correlation matrix
Sigma = generate_sigma(p = 15) 
# Specify variable types
var_types = list('cont'=1:5, 'ord'=6:10, 'bin'=11:15) 
X = generate_GC(corr = Sigma, var_types = var_types, n = 2000)
 # mask 50% of the original observation
X_obs = mask_MCAR(X, mask_fraction = 0.5)

Fit Gaussian copula model from incomplete mixed observation

Implementation

Implementing our algorithm is easy, since it does not involve any tuning parameter. Just pass your observation into function impute_GC.

fit = impute_GC(X_obs)# takes around 20 secs

Imputation evaluation

We then evaluate the imputation error for each data type using the scaled mean absolute error (SMAE) proposed in our paper to measure the imputation error on columns in $I$: [ \mbox{SMAE}:=\frac{1}{|I|}\sum_{j\in I}\frac{||\hat {\bX_j} - \bX_j||_1}{||\bX^{\mbox{med}}_j - \bX_j||_1} ] where $\hat {\bX_j}, \bX^{\mbox{med}}_j$ are the imputed values and observed median for $j$-th column, respectively. For each data type, we compute the SMAE on corresponding columns. The estimator's SMAE is smaller than $1$ if it outperforms column median imputation. We now compute the SMAE for each data type.

# SMAE for each column
err_imp = cal_smae(xhat = fit$Ximp, xobs = X_obs, xtrue = X, reduce = FALSE) 
for (t in names(var_types)){
  err = round(mean(err_imp[var_types[[t]]]), 4)
  print(paste0('SMAE of ', var_types[t], ' : ', err))
}

Fitted copula correlation matrix

You can also extract the fitted copula correlation matrix to describe the correlation relationship among columns. To evaluate the estimated correlation, we use relative error $||\hat{\Sigma}-\Sigma||{F}/||\Sigma||{F}$, where $\hat{\Sigma}$ is the estimated correlation matrix.

# relative Frobeinus error of imputed correlation matrix
norm(fit$corr - Sigma, type = 'F')/norm(Sigma, type = 'F') 

Monitor the convergence

The likelihood function or objective function we are maximizing is hard to evaluate. We provide an approximation of the likelihood evaluation during iterations, which can be used to monitor the fitting process. If a Gaussian copula model is well fitted to the data, we expect to see monotonically increasing likelihood values during iterations.

plot(fit$loglik, ylab = 'value', type = 'b', pch = 20,
     xlab = 'iterations', main = 'approximated likelihood achieved during iterations')


udellgroup/mixedgcImp documentation built on Jan. 25, 2023, 7:55 p.m.