knitr::opts_chunk$set(echo = TRUE)
\newcommand{\bX}{\mathbf{X}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bigf}{\mathbf{f}} \newcommand{\bo}{\mathbf{0}}
The package gcimputeR is for imputing missing values of continuous and ordinal mixed dataset. The imputation is done by fitting a Gaussian copula model to the incomplete mixed observation. More details can be found in our paper Zhao, Y., & Udell, M. (2019). In this vignette, we reproduce a simulation setting used in our paper.
We first generate rows of $\bZ\in \mathbb{R}^{n\times p}$ as $\bz^1,\ldots,\bz^{n}\overset{i.i.d.}{\sim} \mathcal{N}(\bo,\Sigma)$, then generate $\bX=\bigf(\bZ)$ through coordinate-wise monotone functions $\bigf$. $\bX$ consists of $n$ observations and $p$ variables. We let $p=15$ and $n=2000$ and randomly generate $\Sigma$. Use $\bigf$ such that $\bX_1,\ldots,\bX_5$ have exponential distributions with rate parameter $1/3$, $\bX_6,\ldots,\bX_{10}$ are binary and $\bX_{11},\ldots,\bX_{15}$ are ordinal with 5 levels. Then we randomly remove $50\%$ of the entries of $\bX$.
library(gcimputeR) library(MASS) set.seed(410) n = 2000 p = 15 Sigma = matrix(rnorm(p*p), nrow = p) Sigma = cov2cor(Sigma %*% t(Sigma)) # generate random correlation matrix Z = mvrnorm(n, mu = rep(0,p), Sigma = Sigma) # generate copula observation X = Z # generate observation with specified marginals # five continuous columns with exponential marginal X[,1:5] = qexp(pnorm(Z[,1:5]), rate = 1/3) # five binary columns for (i in 6:10) X[,i] = continuous2ordinal(Z[,i], k = 2) # five ordinal columns with five levels for (i in 11:15) X[,i] = continuous2ordinal(Z[,i], k = 5) # mask 50% of the original observation X_obs = X loc = sample(1:prod(n*p), size = floor(prod(n*p)*0.5)) X_obs[loc] = NA
Implementing our algorithm is easy, since it does not involve any tuning parameter. Just pass your observation into function impute_mixedgc.
fit = impute_mixedgc(X_obs)# takes around 20 secs
We then evaluate the imputation error for each data type using the scaled mean absolute error (SMAE) proposed in our paper to measure the imputation error on columns in $I$: [ \mbox{SMAE}:=\frac{1}{|I|}\sum_{j\in I}\frac{||\hat {\bX_j} - \bX_j||_1}{||\bX^{\mbox{med}}_j - \bX_j||_1} ] where $\hat {\bX_j}, \bX^{\mbox{med}}_j$ are the imputed values and observed median for $j$-th column, respectively. For each data type, we compute the SMAE on corresponding columns. The estimator's SMAE is smaller than $1$ if it outperforms column median imputation. We now compute the SMAE for each data type.
err_imp = cal_mae_scaled(xhat = fit$Ximp, xobs = X_obs, xtrue = X) # SMAE for each column mean(err_imp[1:5]) # SMAE across continuous columns mean(err_imp[6:10]) # SMAE across binary columns mean(err_imp[11:15]) # SMAE across ordinal columns
You can also extract the fitted copula correlation matrix to describe the correlation relationship among columns. To evaluate the estimated correlation, we use relative error $||\hat{\Sigma}-\Sigma||{F}/||\Sigma||{F}$, where $\hat{\Sigma}$ is the estimated correlation matrix.
# relative Frobeinus error of imputed correlation matrix norm(fit$R - Sigma, type = 'F')/norm(fit$R, type = 'F')
The likelihood function or objective function we are maximizing is hard to evaluate. We provide an approximation of the likelihood evaluation during iterations, which can be used to monitor the fitting process. If a Gaussian copula model is well fitted to the data, we expect to see monotonically increasing likelihood values during iterations.
plot(fit$loglik, ylab = 'value', type = 'b', pch = 20, xlab = 'iterations', main = 'approxiamted likelihood values achieved during iterations')
In empirical study, we find early stop can help improve the imputation performace sometimes, since it can help avoid overfitting. Our suggestion is to first implement with default tolerence threshold $1e-3$, which usually returns good performance in our experiments. With fitted model, one can check the provided likelihood values during iterations. If the likelihood values increase slowly for many iterations at last, we recommand fit again with larger tolerance threshold. In the future, we will add explicit regularization to our algorithm to avoid overfitting.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.