Simulation Study for the getSamplesDPmeasure function
In nimble: MCMC, Particle Filtering, and Programmable Hierarchical Modeling

knitr::opts_chunk$set(echo = TRUE)

Description

We want to test the getSamplesDPmeasure function. The idea is to test whether the obtained samples are close to the true function or not, in a small Monte Carlo (MC) study. For this we are going to look at the discrepancies between the samples of the random measure generated by the getSamplesDPmeasure function and the true random measure (actually, we will look at functions of them!), in repetitions of the same ``experiment''. Discrepancies are measured by a slightly modify version of the Kolmogorov-Smirnov (K-S) statistic, which compares cumulative distribution functions (cdfs).

True distribution

In this test, the true random measure, denoted $G_0$, is a mixture of three points masses at -2.5, 2.5, and 8, all with the same weights, and the true cdf, denoted $F_0$, is a mixture of three normal distributions with mean -2.5, 2.5, and 8, all with the same weights and same variance and equal to 0.5. More specifically, \begin{align} \label{cdf0} F_0(y) &= \int \Phi(y \mid \theta,0.5)G_0(d\theta),\ &= (1/3)\Phi(y\mid -2.5, 0.5) + (1/3)\Phi(y\mid 2.5, 0.5) + (1/3)\Phi(y\mid 8, 0.5), \quad (1)\ G_0(\theta) &= (1/3)\delta_{(-2.5)}(\theta) + (1/3)\delta_{(2.5)}(\theta) + (1/3)\delta_{(8)}(\theta), \end{align} where $\delta_x$ denotes the Dirac measure at $x$ and $\Phi(y\mid \mu, \sigma^2)$ denotes the cdf of a normal distribution with mean $\mu$ and variance $\sigma^2$.

Monte Carlo Study

As mentioned before, we perform a small MC study in which we simulate 10 data sets of size 2,001 from (1). See the appendix for more details on the simulated data sets and the NIMBLE model.

Comparison criteria

For each data set, we look at the posterior distribution of a modify K-S statistic. For continuous cdfs, the modify K-S statistic is given by $K(F, F_0)=\sqrt{n}\Vert F(y) - F_0(y) \Vert_{TV}$, where $n$ is the sample size, $\Vert \cdot \Vert_{TV}$ denotes the total variation norm, and $F$ is an approximation to the cdf $F_0$ (in the K-S statistic, $F$ is the empirical cdf, in our modify K-S, $F$ is a posterior sample of the cdf). We approximate the modify K-S statistic by $K(F, F_0)\approx \sqrt{n}\max_{l=1, \ldots, L}\vert F(y_l) - F_0(y_l)\vert$, where $y_l$ are points in an equally spaced grid (with separation of 0.02) of size $L=901$ that ranges from -6 to 12. Posterior samples of the modify statistic are computed as $K(F^{(t)}, F_0)$, $t=1, \ldots, T$, where $F^{(t)}$ denotes posterior sample $t$ of the cdf, which is computed based on the output of the getSampleDPmeasure function, as follows \begin{align}\label{getDP1} F^{(t)}(y_l) &= \sum_{j=1}^{N}w_{j}^{(t)}\Phi(y_l\mid \mu_j^{(t)}, 0.5), \end{align} where $N$ is the truncation level and ${w_{j}^{(t)}, \mu_j^{(t)} }_{j=1}^N$ denote the weights and point masses of the posterior sample $t$ of the random measure. The getSamplesDPmeasure function computes posterior samples of the random measure based on the output of a 5,000-long Markov chain Monte Carlo (MCMC) output, obtained after a burn-in period of 5,000 iterations.

For large sample size, the K-S statistic converges (in distribution) to the Kolmogorov distribution, which does not depend on $F_0$ and whose 0.95 quantile is approximately $K_{\alpha}=1.3582$. So, to show that the getSamplesDPmeasure function works well, we want to see that the posterior samples of the modify K-S statistics, $K(F^{(t)}, F_0)$, $t=1, \ldots, 5,000$, are, in general, smaller than $K_{\alpha}$. We also compute the proportion of samples of the modify K-S statistic that are smaller than $K_{\alpha}$.

Results

The following table summarizes the proportion of modify K-S statistics that are smaller than the 0.95 Kolmogorov quantile, in the MC study.

R min $q_{0.1}$ mean median $q_{0.9}$ max -- --- ---------- ---- ------ ---------- --- 10 0.971 0.975 0.981 0.982 0.987 0.988

For every the data set (10), at least 97,46\% of the posterior samples of the cdf show (nearly) negligible discrepancies with respect to the true cdf. The value 97,46\% is higher than what we would expect, but maybe this is because the Kolmogorov distribution is a distributional result for the K-S statistic (based on the empirical cdf) and here we are using a modify version of it. See the appendix for plots of the posterior distribution of the modify K-S statistic for each data set and code.

Code

library(nimble)

#------------------------------------------------------
# K-S probabilities
pK <- function(x) {
  ks <- 1:5000
  aux <- ((2*ks-1)^2*pi^2) / (8*x^2)
  aux <- sum(exp(-aux))
  aux <- sqrt(2*pi) * aux / x 
  return(aux)
}
gridk <- seq(10^(-6), 3, len=10000)
alpha <- 0.05
# K-S quantile
kAlpha <- gridk[which.max(sapply(gridk, pK) >= 1 - alpha)] # quantile to be compared to
#------------------------------------------------------

consts <- list(var0 = 0.5, n=2001)
var0 <- consts$var0

# true CDF
grid <- seq(-6, 12, by=0.02) 
trueNormalCDF <- sapply(grid, function(x) pnorm(x,-2.5,sqrt(var0))/3 + pnorm(x,2.5,sqrt(var0))/3 + pnorm(x,8,sqrt(var0))/3 )

#------------------------------------------------------
# Monte Carlo Study
proportions <- c()

for(rep in 1:10) {
  set.seed(rep)
  code <- nimbleCode({
    for(i in 1:n) {
      y[i] ~ dnorm( thetaTilde[xi[i]] , var = var0) 
    }
    for(i in 1:50) {
      thetaTilde[i] ~ dnorm(0, var = 100)
    }
    xi[1:n] ~ dCRP(1, size=n)
  })

  Inits <- list(xi = rep(1, consts$n), thetaTilde = rep(0,50))
  y1 <- c(rnorm(667, -2.5, sqrt(var0)), rnorm(667, 2.5, sqrt(var0)), rnorm(667, 8, sqrt(var0)) )
  Data <- list(y = y1)
  model <- nimbleModel(code, data=Data, inits=Inits, constants = consts,  calculate=TRUE)
  cmodel<-compileNimble(model)
  mConf <- configureMCMC(model, monitors = c('xi','thetaTilde'))
  mMCMC <- buildMCMC(mConf)
  cMCMC <- compileNimble(mMCMC, project = model)


  nsave <- 5000 
  nburn <- 5000
  out <- runMCMC(cMCMC, niter = (nsave+nburn), nburnin = nburn)

  sampleG <- getSamplesDPmeasure(cMCMC)


  #--------------------------
  # total variation norm
  totalVarNormNormal <- c()

  for(iter in 1:nsave) {
    weights <- sampleG[[iter]][, 1]
    atoms <- sampleG[[iter]][, 2]

    sampNormalCDF <- sapply(grid, function(x)sum(weights * pnorm(x, atoms, sqrt(var0))))

    totalVarNormNormal[iter] <- max(abs(trueNormalCDF - sampNormalCDF))
  }

  #--------------------------
  # K-S statistic
  DnNormal <- sqrt(consts$n) * totalVarNormNormal

  #hist(DnNormal, freq=FALSE, xlim=xlim0, xlab="", ylab="", main="")
  #abline(v = kAlpha, col="red", lwd=2)

  proportions[rep] <- mean(DnNormal <= kAlpha)
}

# values showed in the table
length(proportions)
min(proportions)
quantile(proportions, 0.1)
mean(proportions)
quantile(proportions, 0.5)
quantile(proportions, 0.9)
max(proportions)