knitr::opts_chunk$set(echo = TRUE)
\newcommand{\bX}{\mathbf{X}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bigf}{\mathbf{f}} \newcommand{\bo}{\mathbf{0}}
The package gcimputeR
is for imputing missing values of continuous and ordinal mixed dataset. The imputation is done by fitting a Gaussian copula model to the incomplete mixed observation. More details can be found in our paper. In this vignette, we reproduce a simulation setting used in our paper.
We first generate rows of $\bZ\in \mathbb{R}^{n\times p}$ as $\bz^1,\ldots,\bz^{n}\overset{i.i.d.}{\sim} \mathcal{N}(\bo,\Sigma)$, then generate $\bX=\bigf(\bZ)$ through coordinate-wise monotone functions $\bigf$. $\bX$ consists of $n$ observations and $p$ variables. We let $p=15$ and $n=2000$ and randomly generate $\Sigma$. Use $\bigf$ such that $\bX_1,\ldots,\bX_5$ have exponential distributions with rate parameter $1/3$, $\bX_6,\ldots,\bX_{10}$ are binary and $\bX_{11},\ldots,\bX_{15}$ are ordinal with 5 levels. Then we randomly remove $50\%$ of the entries of $\bX$.
library(gcimputeR) set.seed(410) # Generate random correlation matrix Sigma = generate_sigma(p = 15) # Specify variable types var_types = list('cont'=1:5, 'ord'=6:10, 'bin'=11:15) X = generate_GC(corr = Sigma, var_types = var_types, n = 2000) # mask 50% of the original observation X_obs = mask_MCAR(X, mask_fraction = 0.5)
Implementing our algorithm is easy, since it does not involve any tuning parameter. Just pass your observation into function impute_GC
.
fit = impute_GC(X_obs)# takes around 20 secs
We then evaluate the imputation error for each data type using the scaled mean absolute error (SMAE) proposed in our paper to measure the imputation error on columns in $I$: [ \mbox{SMAE}:=\frac{1}{|I|}\sum_{j\in I}\frac{||\hat {\bX_j} - \bX_j||_1}{||\bX^{\mbox{med}}_j - \bX_j||_1} ] where $\hat {\bX_j}, \bX^{\mbox{med}}_j$ are the imputed values and observed median for $j$-th column, respectively. For each data type, we compute the SMAE on corresponding columns. The estimator's SMAE is smaller than $1$ if it outperforms column median imputation. We now compute the SMAE for each data type.
# SMAE for each column err_imp = cal_smae(xhat = fit$Ximp, xobs = X_obs, xtrue = X, reduce = FALSE) for (t in names(var_types)){ err = round(mean(err_imp[var_types[[t]]]), 4) print(paste0('SMAE of ', var_types[t], ' : ', err)) }
You can also extract the fitted copula correlation matrix to describe the correlation relationship among columns. To evaluate the estimated correlation, we use relative error $||\hat{\Sigma}-\Sigma||{F}/||\Sigma||{F}$, where $\hat{\Sigma}$ is the estimated correlation matrix.
# relative Frobeinus error of imputed correlation matrix norm(fit$corr - Sigma, type = 'F')/norm(Sigma, type = 'F')
The likelihood function or objective function we are maximizing is hard to evaluate. We provide an approximation of the likelihood evaluation during iterations, which can be used to monitor the fitting process. If a Gaussian copula model is well fitted to the data, we expect to see monotonically increasing likelihood values during iterations.
plot(fit$loglik, ylab = 'value', type = 'b', pch = 20, xlab = 'iterations', main = 'approximated likelihood achieved during iterations')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.