knitr::opts_chunk$set(echo = TRUE)
We want to test the getSamplesDPmeasure function. The idea is to test whether the obtained samples are close to the true function or not, in a small Monte Carlo (MC) study. For this we are going to look at the discrepancies between the samples of the random measure generated by the getSamplesDPmeasure function and the true random measure (actually, we will look at functions of them!), in repetitions of the same ``experiment''. Discrepancies are measured by a slightly modify version of the Kolmogorov-Smirnov (K-S) statistic, which compares cumulative distribution functions (cdfs).
In this test, the true random measure, denoted $G_0$, is a mixture of three points masses at -2.5, 2.5, and 8, all with the same weights, and the true cdf, denoted $F_0$, is a mixture of three normal distributions with mean -2.5, 2.5, and 8, all with the same weights and same variance and equal to 0.5. More specifically, \begin{align} \label{cdf0} F_0(y) &= \int \Phi(y \mid \theta,0.5)G_0(d\theta),\ &= (1/3)\Phi(y\mid -2.5, 0.5) + (1/3)\Phi(y\mid 2.5, 0.5) + (1/3)\Phi(y\mid 8, 0.5), \quad (1)\ G_0(\theta) &= (1/3)\delta_{(-2.5)}(\theta) + (1/3)\delta_{(2.5)}(\theta) + (1/3)\delta_{(8)}(\theta), \end{align} where $\delta_x$ denotes the Dirac measure at $x$ and $\Phi(y\mid \mu, \sigma^2)$ denotes the cdf of a normal distribution with mean $\mu$ and variance $\sigma^2$.
As mentioned before, we perform a small MC study in which we simulate 10 data sets of size 2,001 from (1). See the appendix for more details on the simulated data sets and the NIMBLE model.
For each data set, we look at the posterior distribution of a modify K-S statistic. For continuous cdfs, the modify K-S statistic is given by $K(F, F_0)=\sqrt{n}\Vert F(y) - F_0(y) \Vert_{TV}$, where $n$ is the sample size, $\Vert \cdot \Vert_{TV}$ denotes the total variation norm, and $F$ is an approximation to the cdf $F_0$ (in the K-S statistic, $F$ is the empirical cdf, in our modify K-S, $F$ is a posterior sample of the cdf). We approximate the modify K-S statistic by $K(F, F_0)\approx \sqrt{n}\max_{l=1, \ldots, L}\vert F(y_l) - F_0(y_l)\vert$, where $y_l$ are points in an equally spaced grid (with separation of 0.02) of size $L=901$ that ranges from -6 to 12. Posterior samples of the modify statistic are computed as $K(F^{(t)}, F_0)$, $t=1, \ldots, T$, where $F^{(t)}$ denotes posterior sample $t$ of the cdf, which is computed based on the output of the getSampleDPmeasure function, as follows \begin{align}\label{getDP1} F^{(t)}(y_l) &= \sum_{j=1}^{N}w_{j}^{(t)}\Phi(y_l\mid \mu_j^{(t)}, 0.5), \end{align} where $N$ is the truncation level and ${w_{j}^{(t)}, \mu_j^{(t)} }_{j=1}^N$ denote the weights and point masses of the posterior sample $t$ of the random measure. The getSamplesDPmeasure function computes posterior samples of the random measure based on the output of a 5,000-long Markov chain Monte Carlo (MCMC) output, obtained after a burn-in period of 5,000 iterations.
For large sample size, the K-S statistic converges (in distribution) to the Kolmogorov distribution, which does not depend on $F_0$ and whose 0.95 quantile is approximately $K_{\alpha}=1.3582$. So, to show that the getSamplesDPmeasure function works well, we want to see that the posterior samples of the modify K-S statistics, $K(F^{(t)}, F_0)$, $t=1, \ldots, 5,000$, are, in general, smaller than $K_{\alpha}$. We also compute the proportion of samples of the modify K-S statistic that are smaller than $K_{\alpha}$.
The following table summarizes the proportion of modify K-S statistics that are smaller than the 0.95 Kolmogorov quantile, in the MC study.
R min $q_{0.1}$ mean median $q_{0.9}$ max -- --- ---------- ---- ------ ---------- --- 10 0.971 0.975 0.981 0.982 0.987 0.988
For every the data set (10), at least 97,46\% of the posterior samples of the cdf show (nearly) negligible discrepancies with respect to the true cdf. The value 97,46\% is higher than what we would expect, but maybe this is because the Kolmogorov distribution is a distributional result for the K-S statistic (based on the empirical cdf) and here we are using a modify version of it. See the appendix for plots of the posterior distribution of the modify K-S statistic for each data set and code.
library(nimble) #------------------------------------------------------ # K-S probabilities pK <- function(x) { ks <- 1:5000 aux <- ((2*ks-1)^2*pi^2) / (8*x^2) aux <- sum(exp(-aux)) aux <- sqrt(2*pi) * aux / x return(aux) } gridk <- seq(10^(-6), 3, len=10000) alpha <- 0.05 # K-S quantile kAlpha <- gridk[which.max(sapply(gridk, pK) >= 1 - alpha)] # quantile to be compared to #------------------------------------------------------ consts <- list(var0 = 0.5, n=2001) var0 <- consts$var0 # true CDF grid <- seq(-6, 12, by=0.02) trueNormalCDF <- sapply(grid, function(x) pnorm(x,-2.5,sqrt(var0))/3 + pnorm(x,2.5,sqrt(var0))/3 + pnorm(x,8,sqrt(var0))/3 ) #------------------------------------------------------ # Monte Carlo Study proportions <- c() for(rep in 1:10) { set.seed(rep) code <- nimbleCode({ for(i in 1:n) { y[i] ~ dnorm( thetaTilde[xi[i]] , var = var0) } for(i in 1:50) { thetaTilde[i] ~ dnorm(0, var = 100) } xi[1:n] ~ dCRP(1, size=n) }) Inits <- list(xi = rep(1, consts$n), thetaTilde = rep(0,50)) y1 <- c(rnorm(667, -2.5, sqrt(var0)), rnorm(667, 2.5, sqrt(var0)), rnorm(667, 8, sqrt(var0)) ) Data <- list(y = y1) model <- nimbleModel(code, data=Data, inits=Inits, constants = consts, calculate=TRUE) cmodel<-compileNimble(model) mConf <- configureMCMC(model, monitors = c('xi','thetaTilde')) mMCMC <- buildMCMC(mConf) cMCMC <- compileNimble(mMCMC, project = model) nsave <- 5000 nburn <- 5000 out <- runMCMC(cMCMC, niter = (nsave+nburn), nburnin = nburn) sampleG <- getSamplesDPmeasure(cMCMC) #-------------------------- # total variation norm totalVarNormNormal <- c() for(iter in 1:nsave) { weights <- sampleG[[iter]][, 1] atoms <- sampleG[[iter]][, 2] sampNormalCDF <- sapply(grid, function(x)sum(weights * pnorm(x, atoms, sqrt(var0)))) totalVarNormNormal[iter] <- max(abs(trueNormalCDF - sampNormalCDF)) } #-------------------------- # K-S statistic DnNormal <- sqrt(consts$n) * totalVarNormNormal #hist(DnNormal, freq=FALSE, xlim=xlim0, xlab="", ylab="", main="") #abline(v = kAlpha, col="red", lwd=2) proportions[rep] <- mean(DnNormal <= kAlpha) } # values showed in the table length(proportions) min(proportions) quantile(proportions, 0.1) mean(proportions) quantile(proportions, 0.5) quantile(proportions, 0.9) max(proportions)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.