README.md
In ltanecon/BayesImp: Bayesian Imputation Method

BayesImp

The Bayesian Imputation Method (Preliminary Version)

Step 1: Install the dependent package REBayes

Step 1-a: Install required system environment for package REBayes

• Install software Mosek • Obtain Mosek license (free academic license available) • Install the R to Mosek interface package Rmosek

Step 1-b: Install package REBayes from CRAN

Step 2: Install package BayesImp from Github with the following R code:

library(devtools)
install_github("ltanecon/BayesImp")

In this R package, I focus on imputing top-coded income observations in longitudinal surveys. The standard imputation approaches in the literature originate from cross-sectional applications, in which top-coded observations are handled on a wave-by-wave basis. Although the standard approaches are frequently employed in longitudinal applications, they are not designed to handle the extra dimension of complexity presented in longitudinal data, in which the same individual is tracked across many periods. Ignoring this extra information will lead to many unfavorable consequences, including the over-prediction of income volatility within individuals.

I develop two new imputation methods to tackle this problem. First, I show that the quality of imputed income values for top earners in longitudinal surveys can be improved significantly by incorporating information from multiple time periods into the imputation process in a simple way, which I refer to as the rank-based method. The additional model complexity introduced by the rank-based method is very modest compared to the standard approaches, but the imputation accuracy is considerably improved both at the distributional and individual level. With the 1996 SIPP data, I show the rank-based method can reduce root mean squared error (RMSE) by 9 to 40% relative to the standard approaches. Moreover, I further improve on the rank-based method by developing an innovative, nonparametric empirical Bayes based method, which works even better empirically. It closely recovers the distributions of income levels and volatility, and at the same time has better imputation accuracy at the individual level: it reduces RMSE by 19 to 46% relative to the standard approaches.

For more details, see my job market paper:

Tan, Li (2017), Imputing Top-Coded Income Data in Longitudinal Surveys, Working Paper. (link)