In gbasulto/kerdec: Kernel deconvolution methods

rm(list=ls()) ### To clear namespace
library(knitr)

# opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
#                echo=TRUE, warning=FALSE, message=FALSE)
opts_chunk$set(fig.width= 7, fig.height= 5,
               echo=TRUE, warning=FALSE, message=FALSE)

General Description

This package is about kernel deconvolution density estimation. Several sampling scenarios are considered and it works (so far) for univariate and bivariate samples.

Kernel deconvolution density estimators (KDDEs) are nonparametric estimators of densities based on independent and identically distributed (i.i.d.) samples which have been subject to some measurement error. Kernel density estimation is perhaps the most popular nonparametric density estimator but it may not be used on samples that have been contaminated with non-negligible error. The KDDEs is a modified version of the usual kernel density estimator which is, unlike its the usual version, suitable for contaminated samples.

There are two popular R packages that provide functions to do this, decon (@Wang2011DeconvolutionDecon) and deamer (@Stirnemann2012Deamer:Error). The former package performs kdde with known error distribution allowing the suer to pick the bandwidth selection method between several methods summarized in @Delaigle2004PracticalEstimation. The latter package covers the univariate case and it allows known error case, repeated observations and pure error case, the bandwidth selection method follows @Comte2011Data-drivenDistribution.

The package kerdec provides functions to handle univariate and bivariate kdde with unknown error distribution, by using the empirical characteristic function, or known errors, which can be Laplace or Gaussian. Also functions are provided to estimate the parameters based on a sample of errors or panel data structured data.

Considerations for future developments of this package include - Development to handle higher sample dimensions. At this moment, this package is intended for univariate and bivariate samples. - Including other bandwidth selection methods. Besides giving the user the possibility of manually selecting the bandwidth, two methods are currently implemented. - Consider other kernel methods for measurement error, such as local polynomial regression with measurement error.

Kerdel deconvolution density estimation

A kernel deconvolution density estimator is a nonparametric approximation to the density of a random vector $\boldsymbol{X}$ based on samples that have been contaminated with additive noise, say, samples from $$\boldsymbol{Y} = \boldsymbol{X} + \boldsymbol{\epsilon},$$ where $\boldsymbol{\epsilon}$ has mean zero and it is independent of $\boldsymbol{X}$.

Let $\boldsymbol{Y}1, \ldots, \boldsymbol{Y}_n$ be a sample of independent and identically distributed (_iid) random vectors from the latter model. Assume (for the moment) that the error distribution ($\boldsymbol{\epsilon}$) is perfectly known. The kernel deconvolution density estimator (KDDE) is given by the KDDE formula: $$ \hat{f}{\boldsymbol{X}}(\boldsymbol{x}) = \frac{1}{(2\pi)^d} \int e^{-\imath \langle \boldsymbol{x}, \boldsymbol{t} \rangle} \hat{\phi}{\boldsymbol{Y}, n}(\boldsymbol{t}) \frac{K^{Ft}(H^{-1/2}\boldsymbol{t})} {\phi_{\boldsymbol{\epsilon}}(\boldsymbol{t})} d\boldsymbol{t}, $$ where $\langle \cdot, \cdot \rangle$ is the usual inner product in $\mathbb{R}^d$; $$\hat{\phi}{\boldsymbol{Y}, n}(\boldsymbol{t}) = n^{-1} \sum_j \exp \left( \imath \langle \boldsymbol{t}, \boldsymbol{Y}_j \rangle\right)$$ is the empirical characteristic function of the contaminated sample; $K:\mathbb{R}^d \rightarrow \mathbb{R}$ is a symmetric kernel and $$K^{Ft}(\boldsymbol{s}) = \int e^{\imath \langle \boldsymbol{s}, \boldsymbol{x} \rangle}K(\boldsymbol{x}) d\boldsymbol{x}$$ its Fourier transform; $$\phi{\boldsymbol{\epsilon}}(\boldsymbol{t}) = \int e^{\imath \langle \boldsymbol{t}, \boldsymbol{x} \rangle} f_\boldsymbol{\epsilon}(\boldsymbol{x}) d\boldsymbol{x}$$ is the characteristic function of the error, and $H$ is a positive definite bandwidth matrix.

At this point, we have assumed that the error distribution is perfectly known (so it is its characteristic function). This is, however, very rarely known. More realistic sampling scenarios that can allow to approximate $\phi_{\boldsymbol{\epsilon}}(\cdot)$ include (a) having an independent sample of errors besides the contaminated sample and (b) having repeated (contaminated) measurements per subject so differences can be obtained to approximate the characteristic function of the error. Both options are available in kerdec and they are discussed with detail in "Sampling Scenarios" sections.

As it can be seen in the KDDE formula, a kernel function $K(\cdot)$ must be selected. For theoretical and numerical reasons, it is more convenient to select a kernel function whose Fourier transform, $K^{Ft}(\cdot)$, has compact support. Kernel functions used in traditional (i.e. error free) settings, like Gaussian, uniform and Epanechnikov kernels, do not usually meet this condition. The kerdec package offers 5 choices for kernel, including the sinc kernel, which is perhaps the most popular kernel in deconvolution. Go to the section "Kernels" to learn the details and usage of these kernels in KDDE.

We will consider two approaches to provide bandwidth. The first of them consists on minimizing the AMISE by plugging in a normal reference estimator to R(f''). The second is my crossed-validation. We will only consider cases with one smoothing parameter.