ks: Kernel density estimation for bivariate data
In ks: Kernel Smoothing

knitr::opts_chunk$set(global.par=TRUE, collapse=TRUE, comment="#>", fig.width=5, fig.height=5, fig.align="center", dpi=96)
options(tibble.print_min=3L, tibble.print_max=3L)

Introduction

Kernel density estimation is a popular tool for visualising the distribution of data. See Chacon \& Duong (2018), for example, for an overview. When multivariate kernel density estimation is considered it is usually in the constrained context with diagonal bandwidth matrices, e.g. in the R packages sm and KernSmooth. In contrast, the ks package implements both diagonal and unconstrained data-driven bandwidth matrices for kernel density estimation.

Bivariate data example

For our target density, we use the `dumbbell' density, given by the normal mixture $$ \frac{4}{11} N \bigg( \begin{bmatrix}-2 \ 2\end{bmatrix}, \begin{bmatrix}1 & 0 \ 0 & 1 \end{bmatrix} \bigg)+ \frac{3}{11} N \bigg( \begin{bmatrix}0 \ 0\end{bmatrix}, \begin{bmatrix}0.8 & -0.72 \ -0.72 & 0.8\end{bmatrix} \bigg)+ \frac{4}{11} N \bigg( \begin{bmatrix}2 \ -2\end{bmatrix}, \begin{bmatrix}1 & 0 \ 0 & 1 \end{bmatrix} \bigg). $$ This density is unimodal and we draw a sample of 2000 data points.

library(ks)
mus <- rbind(c(-2,2), c(0,0), c(2,-2))
Sigmas <- rbind(diag(2), matrix(c(0.8, -0.72, -0.72, 0.8), nrow=2), diag(2))
cwt <- 3/11
props <- c((1-cwt)/2, cwt, (1-cwt)/2)
plotmixt(mus=mus, Sigmas=Sigmas, props=props, display="filled.contour", xlim=c(-4,4), ylim=c(-4,4))
set.seed(88192)
samp <- 2000
x <- rmvnorm.mixt(n=samp, mus=mus, Sigmas=Sigmas, props=props)
colnames(x) <- c("x","y")
plot(x, pch=16, cex=0.5, xlim=c(-4,4), ylim=c(-4,4))

We use Hpi for unconstrained plug-in selectors and Hpi.diag for diagonal plug-in selectors.

Hpi1 <- Hpi(x=x)
Hpi2 <- Hpi.diag(x=x)

To compute a kernel density estimate, the command is kde, which creates a kde class object

fhat.pi1 <- kde(x=x)         ## same as Hpi(x=x, H=Hpi1)
fhat.pi2 <- kde(x=x, H=Hpi2)

We use the plot method for kde objects to display these kernel density estimates. The default is a contour plot with the upper 25%, 50% and 75% contours of the (sample) highest density regions. The diagonal bandwidth matrix constrains the smoothing to be performed in directions parallel to the co-ordinate axes, so it is not able to apply accurate levels of smoothing to the obliquely oriented central portion. The result is a multimodal density estimate. The unconstrained bandwidth matrix correctly produces a unimodal density estimate.

plot(fhat.pi1, main="Plug-in", display="filled.contour", xlim=c(-4,4), ylim=c(-4,4))
plot(fhat.pi2, main="Plug-in diagonal", display="filled.contour", xlim=c(-4,4), ylim=c(-4,4))

The unconstrained SCV (Smoothed Cross Validation) selector is Hscv and its diagonal version is Hscv.diag. The most reasonable density estimate below is from the unconstrained SCV selector.

Hscv1 <- Hscv(x=x)
Hscv2 <- Hscv.diag(x=x)
fhat.cv1 <- kde(x=x, H=Hscv1)
fhat.cv2 <- kde(x=x, H=Hscv2)
plot(fhat.cv1, main="SCV", display="filled.contour", xlim=c(-4,4), ylim=c(-4,4))
plot(fhat.cv2, main="SCV diagonal", display="filled.contour", xlim=c(-4,4), ylim=c(-4,4))

The unconstrained bandwidth selectors will be better than their diagonal counterparts when the data have large mass oriented obliquely to the co-ordinate axes, like for the dumbbell data. The unconstrained plug-in and the SCV selectors can be viewed as generally recommended selectors.