knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Given realizations $x_1, \dots, x_n$ of $n$ independent and identically distributed (i.i.d.) random variables $X_1, \dots, X_n \sim \mathbb{P}$ with density $f$ w.r.t. the Lebesgue measure $\lambda$, a kernel density estimator (KDE) for the density $f$ is defined as
\begin{align} \hat f_h(x) = \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right) \end{align}
The estimator depends on the kernel $K: \mathbb{R} \rightarrow \mathbb{R}, K \in \mathscr{L}1(\mathbb{R}, \lambda)$ with $\lambda(K) = 1$ and the bandwidth $h \in \mathbb{R}^+{\setminus 0}$.
This package aims to provide the necessary tools for practical kernel density estimation and bandwidth estimation. The estimator can be constructed using kernel_density_estimator
. The bandwidth-selection algorithms are accessible through cross_validation
, goldenshluger_lepski
and pco_method
.
We start off by loading the package
knitr::opts_chunk$set(dpi=300,fig.width=7) set.seed(1)
library(KDE)
The main objects in the context of kernel density estimation are kernels and densities. Since many algorithms rely on the properties of these objects, KDE
deploys the S3 classes Density
and Kernel
(inheriting from IntegrableFunction
) to numerically ensure some of mathematical conditions (cf. ?IntegrableFunction
).
These objects require a numerically compact support for the represented function. This may seem like strong restriction at first, but as we'll see such a numerically compact support can be found easily (even algorithmically) in most cases.
Let's consider these objects in practice.
One can simply construct Density
objects from the Base-R stats::dDISTR
functions.
Density(dnorm, mean=2)
The output may seem complex at first. The underlying function is printed as well as the support (which was found numerically). In particular we'd like to draw attention to the numerically compact support
all.equal(dnorm(-10), 0)
as well as the subdivisions.
The subdivisions
entry is a core component of the KDE
package and you'll encounter it many times throughout this vignette, as it specifies the accuracy for many of the implemented functions.
In this example the subdivisions indicate, that integrating dnorm
using the included integrate_primitive
with 1000 subdivisions yields a result sufficiently close to 1 to validate dnorm
as Density
(cf. ?integrate_primitive
).
Constructing a custom density function could look like this
dens_fun <- function(x) { (1 + sin(2*pi*x))*(0 <= x)*(x <= 1) } # automatic construction does not work in this case sin_den <- Density(dens_fun) # specifying the support by hand or increase the number of subdivisions works sin_den <- Density(dens_fun, subdivisions = 1500L) sin_den <- Density(dens_fun, support=c(0,1)) sin_den
In a similar manner KDE
gives you the possibility to build Kernel
objects
# the sigmoid function sigmoid_function <- function(x) 2/pi * (1/(exp(x) + exp(-x))) all.equal(sigmoid_function(-20), 0) # constructing the sigmoid kernel sigmoid_kernel <- Kernel(sigmoid_function, support=c(-20, 20))
For the exact requirements for constructing a Kernel
object and further information consult ?Kernel
.
This package also provides a selection of various built-in kernels known from the literature
# using a built-in Kernel gaussian_kernel <- gaussian print(gaussian_kernel)
Run ?Kernel
to see the full list of the built-in kernels.
In many situations it is favorable to create tests based on simulated data. With rejection_sampling
the package provides a way to draw samples from custom densities (cf. ?rejection_sampling
).
# create sampler from Density object sin_den custom_sampler <- rejection_sampling(sin_den, Density(dunif), runif, 2) # sample from custom sampler set.seed(1) num_samples <- 100 samples <- custom_sampler(num_samples)
Now that we have created a kernel and generated a set of samples, it is time to choose a suitable bandwidth for the kernel density estimator.
We consider a bandwidth $h$ optimal, if it minimizes the mean integrated squared error (MISE), also called risk
\begin{align} \text{MISE} = \mathbb{E}_f\left|\left|\hat f_h - f\right|\right|_2^2. \end{align}
Since the MISE still depends on the unknown density $f$ the general idea behind the implemented bandwidth selection methods is to estimate the MISE or an corresponding upper bound. Subsequently the bandwidth (from a finite bandwidth set $\mathcal{H}_n$) minimizing this estimation is selected.
The KDE
package is equipped with three different solutions to this task.
Frequently the finite bandwidth collection $\mathcal{H}_n$ to choose the optimal bandwidth from is constructed as linearly or logarithmically spaced sequence. Analogous to the Base-R seq
function for linearly spaced sequence, the package provides logarithmic_bandwidth_set
to construct logarithmically spaced sequences.
# setting up bandwidth sets bandwidth_set <- logarithmic_bandwidth_set(from=1/length(samples), to=1, length.out=10)
Selecting the bandwidth considered optimal by (leave-one-out) cross-validation, we run (cf. ?cross_validation
)
# cross validation method bw_cv_gaussian <- cross_validation(gaussian_kernel, samples, bandwidth_set) bw_cv_sigmoid <- cross_validation(sigmoid_kernel, samples, bandwidth_set) cat("bandwidth for gaussian kernel: ", bw_cv_gaussian,"\nbandwidth for sigmoid kernel: ", bw_cv_sigmoid)
The second implemented bandwidth selection algorithm Goldenshluger-Lepski method (cf. ?goldenshluger_lepski
) can be used as
# goldenshluger-lepski method bw_gl_gaussian <- goldenshluger_lepski(gaussian_kernel, samples, bandwidth_set) bw_gl_sigmoid <- goldenshluger_lepski(sigmoid_kernel, samples, bandwidth_set) cat("bandwidth for gaussian kernel: ", bw_gl_gaussian, "\nbandwidth for sigmoid kernel: ", bw_gl_sigmoid)
In the documentation you'll notice an optional tuning parameter kappa
. More information about this parameter can be found in the literature or the simulation study.
Penalized Comparison to Overfitting (PCO) is the third and last implementation of a bandwidth selection algorithm in the KDE
package. Again the interface looks similar (cf. ?pco_method
).
# Penalized Comparison to Overfitting bw_pco_gaussian <- pco_method(gaussian_kernel, samples, bandwidth_set) bw_pco_sigmoid <- pco_method(sigmoid_kernel, samples, bandwidth_set) cat("bandwidth for gaussian kernel: ", bw_pco_gaussian, "\nbandwidth for sigmoid kernel: ", bw_pco_sigmoid)
Again, there exists an optional tuning parameter lambda
. For additional information consider the simulation study and literature.
With these algorithmic bandwidth suggestions, we can finally estimate density $f$ using the kernel density estimator $\hat f_h$.
kde_cv_gaussian <- kernel_density_estimator(gaussian_kernel, samples, bw_cv_gaussian) kde_gl_gaussian <- kernel_density_estimator(gaussian_kernel, samples, bw_gl_gaussian) kde_pco_gaussian <- kernel_density_estimator(gaussian_kernel, samples, bw_pco_gaussian) kde_cv_sigmoid <- kernel_density_estimator(sigmoid_kernel, samples, bw_cv_sigmoid) kde_gl_sigmoid <- kernel_density_estimator(sigmoid_kernel, samples, bw_gl_sigmoid) kde_pco_sigmoid <- kernel_density_estimator(sigmoid_kernel, samples, bw_pco_sigmoid)
The function kernel_density_estimator
returns an object of S3 class IntegrableFunction
. Therefore we have to explicitly consider the underlying function for our plots.
support <- sin_den$support + c(-1/4, 1/4) grid <- seq(support[1], support[2], by=0.01) par(mar=c(0, 0, 0, 0)) plot(grid, dens_fun(grid), type="l", col="dark red", lty="dashed", lwd=2, xlab="", ylab="", ylim=c(0, 2.5)) lines(grid, kde_pco_gaussian$fun(grid), type="l", col="dark green") legend("topright", legend = c("density", "kde"), col = c("dark red","dark green"), lty = c(2,1))
As numeric comparison of the selected bandwidths, we calculate the integrated squared error (ISE)
# comparing the ISE ISE <- function(kde, den, subdivisions=100L){ support <- range(kde$support, den$support) integrate(function(x){(kde$fun(x)-den$fun(x))^2}, lower=support[1], upper=support[2], subdivisions=subdivisions)$value }
# ISE for the gaussian kernel cat("ISE for cross validation / gaussian kernel: ", ISE(kde_cv_gaussian, sin_den)) cat("ISE for Goldenshluger-Lepski method / gaussian kernel: ", ISE(kde_gl_gaussian, sin_den)) cat("ISE for PCO method / gaussian kernel: ", ISE(kde_pco_gaussian, sin_den))
Using the gaussian kernel, cross validation produced the best estimation in this example, while the Goldenshluger-Lepski method gave the worst estimation. A graphic comparison validates the numeric error evaluation.
# plotting KDE and functionx_lim x_lim_lower <- -0.5 x_lim_upper<- 1.5 x <- seq(from = x_lim_lower, to = x_lim_upper, length.out=1000) par(mar=c(0, 0, 2, 0)) plot(x, sin_den$fun(x), xlim = c(x_lim_lower, x_lim_upper), ylim = c(-0.5, 2.5), main = "KDE using the gaussian kernel", xlab = "", ylab = "", col = "dark red", type = "l", lwd = 2, lty="dashed" ) legend("topright", legend = c("density", "samples", "PCO method", "Crossvalidation", "Goldenshluger-Lepski"), col = c("dark red","royal blue", "dark green","violet", "steelblue2"), lty = c(2,NA,1,1,1), lwd = c(2,1,1,1,1), cex = 0.75, pch=c(NA, ".", NA, NA, NA)) lines(x, kde_cv_gaussian$fun(x), col = "violet") lines(x, kde_gl_gaussian$fun(x), col = "steelblue2") lines(x, kde_pco_gaussian$fun(x), col = "dark green") points(samples, integer(length(samples)), pch = ".", col = "blue")
# ISE for the sigmoid kernel cat("ISE for cross validation method / sigmoid kernel: ", ISE(kde_cv_sigmoid, sin_den)) cat("ISE for Goldenshluger-Lepski method method / sigmoid kernel: ", ISE(kde_gl_sigmoid, sin_den)) cat("ISE for PCO method method / sigmoid kernel: ", ISE(kde_pco_sigmoid, sin_den))
In the example using the sigmoid kernel, the PCO method and Goldenshluger-Lepski return the same estimated optimal bandwidth. Again, cross validation selects a bandwidth yielding a more accurate density approximation.
par(mar=c(0, 0, 2, 0)) plot(x, sin_den$fun(x), xlim = c(x_lim_lower, x_lim_upper), ylim = c(-0.5, 2.5), main = "KDE using the sigmoid kernel", xlab = "", ylab = "", col = "dark red", type = "l", lwd = 2, lty="dashed") legend("topright", legend = c("density", "samples", "PCO method", "Crossvalidation", "Goldenshluger-Lepski"), col = c("dark red","royal blue", "dark green","violet", "steelblue2"), lty = c(2,NA,1,1,1), lwd = c(2,1,1,1,1), cex = 0.75, pch=c(NA, ".", NA, NA, NA)) lines(x, kde_cv_sigmoid$fun(x), col = "violet") lines(x, kde_gl_sigmoid$fun(x), col = "steelblue2") lines(x, kde_pco_sigmoid$fun(x), col = "dark green") points(samples, integer(length(samples)), pch = ".", col = "blue")
At last, a side by side comparison of the above calculated ISE's:
table <- data.frame(method=c("cross_validation", "goldenshluger_lepski", "pco_method"), gaussian=c(ISE(kde_cv_gaussian, sin_den), ISE(kde_gl_gaussian, sin_den), ISE(kde_pco_gaussian, sin_den)), sigmoid=c(ISE(kde_cv_sigmoid, sin_den), ISE(kde_gl_sigmoid, sin_den), ISE(kde_pco_sigmoid, sin_den))) knitr::kable(table)
The KDE
package is equipped with a small web application, built with the shiny
package.
Use shiny_kde()
to open the shiny
interface.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.