Markov random field models are common for spatial data. Rather than specifying a joint distribution directly, a model is specified through a set of full conditional distributions for each spatial location. The R package conclique provides a way to simulate from a MRF using a Gibbs sampler based on concliques as well as implements a formal goodness-of-fit (GOF) test for model assessment. For reference, see [@kaiser2012goodness].
This vignette shows an example of how to test for Goodness-of-Fit (GOF) using a conclique-based approach in the package conclique. This example is preceeded by an example of simulating spatial data using conclique-based Gibbs sampling in the vignette Conclique-based Gibbs sampling
library(conclique) library(dplyr) library(tidyr) library(ggplot2) set.seed(300) N <- 6 grid <- matrix(1:(N*N), nrow = N) lattice <- lattice_4nn_torus(c(N, N)) concliques <- min_conclique_cover(lattice) neighbors <- get_neighbors(lattice) params <- list(rho = 1, kappa = 0, eta = .24) inits <- matrix(0, nrow = N, ncol = N) num_sample <- 10000 conclique_simulation <- run_conclique_gibbs(concliques, neighbors, inits, "gaussian_single_param", params, num_sample) sequential_simulation <- run_sequential_gibbs(neighbors, inits, "gaussian_single_param", params, num_sample)
Let $F(y|\boldsymbol y(N_i), \boldsymbol \theta)$ be the conditional cdf of the data, $Y(\boldsymbol s_i)$ under the model. We can define generlized spatial residuals through substitution of random variables, $Y(\boldsymbol s_i)$ and neighbors ${Y(\boldsymbol s_j): \boldsymbol s_j \in N_i }$, into (continuous) conditional cdf:
$$ R(\boldsymbol s_i) = F(Y(\boldsymbol s_i)|{Y(\boldsymbol s_j): \boldsymbol s_j \in N_i }, \boldsymbol \theta). $$
These residuals, $R(\boldsymbol s_i)$, can be defined for non-continuous $F$ too.
It then holds that within a conclique, the generalized spatial residuals are iid Uniform$(0, 1)$-distributed.
The spatial residuals are obtainable using the conclique package. We can continue with our example from the vignette Conclique-based Gibbs sampling using data generated from a Gaussian MRF with $\rho = 1, \kappa = 0, \eta = .24$. Use the last dataset we generated as our "data".
conclique_dat <- conclique_simulation[num_sample, ] conclique_resids <- spatial_residuals(conclique_dat, neighbors, "gaussian_single_param", params)
ecdf_vals <- lapply(concliques, function(conc) {fn <- ecdf(conclique_dat[conc]); fn(conclique_dat[conc])}) names(ecdf_vals) <- 1:length(concliques) ecdf_vals <- do.call(cbind, ecdf_vals) %>% data.frame() ecdf_vals %>% data.frame() %>% gather(conclique, ecdf) %>% separate(conclique, into = c("junk", "conclique"), 1) -> ecdf_vals ecdf_vals[, "u"] <- do.call(c, lapply(concliques, function(conc) conclique_resids[conc])) ecdf_vals %>% ggplot() + geom_point(aes(u, ecdf, colour = conclique)) + geom_abline(aes(slope = 1, intercept = 0)) + theme_bw() + theme(aspect.ratio = 1)
We could also look at the residuals from a poorly fit model, say with $\eta = 0.05$.
conclique_resids_err <- spatial_residuals(conclique_dat, neighbors, "gaussian_single_param", list(rho = 1, kappa = 0, eta = 0.05))
ecdf_vals_err <- lapply(concliques, function(conc) {fn <- ecdf(conclique_dat[conc]); fn(conclique_dat[conc])}) names(ecdf_vals_err) <- 1:length(concliques) ecdf_vals_err <- do.call(cbind, ecdf_vals_err) %>% data.frame() ecdf_vals_err %>% data.frame() %>% gather(conclique, ecdf) %>% separate(conclique, into = c("junk", "conclique"), 1) -> ecdf_vals_err ecdf_vals_err[, "u"] <- do.call(c, lapply(concliques, function(conc) conclique_resids_err[conc])) ecdf_vals_err %>% ggplot() + geom_point(aes(u, ecdf, colour = conclique)) + geom_abline(aes(slope = 1, intercept = 0)) + theme_bw() + theme(aspect.ratio = 1)
We can see that for both concliques, all of the residuals are below the Uniform$(0,1)$ cdf, showing this may be a poor fit (which in fact it is by design).
In order to combine generalized spatial residuals across concliques to form a single test statistic, we look at the empirical cdf of the residuals and its difference to the Uniform$(0,1)$ cdf using some discrepancy measure (like the Kolmogorov-Smirnov statistic) and combine using an aggregation function (like maximum).
Both the Kolmogorov-Smirnov statistic and the Cramèr-von-Mises criterion are implemented in conclique, and any user specified aggregation can be supplied; mean and max are recommended.
conclique_gof <- gof_statistics(conclique_resids, concliques, "ks", "max") conclique_gof_err <- gof_statistics(conclique_resids_err, concliques, "ks", "max")
data.frame(True = conclique_gof, Incorrect = conclique_gof_err) %>% knitr::kable()
We can see that the incorrectly specified model has a higher value of the statistic, which points to potentially rejecting this model, however without a reference distribution, it's impossible to say. Luckily, we have the ability to sample from the model and bootstrap this distribution using our simulation functions in conclique.
concliqueOnce again, one of the key advantages to using conclique-based approaches for simulation and GOF tests is the ability to extend to models beyond the Gaussian MRF with 4-nearest neighbor structure. In order to extend conclique to use an arbitrary conditional distibution for the generalized spatial residuals, the user must specify a cdf function which takes as parameters
data A list containing two elements, sums and nums, which contain the sum of the data in each neighborhood as well as the number of locations in the neighborhood for each point in the conclique. params A named list of parameter values, that parameterize the distribution.For example, the cdf function for the single dependency parameter Gaussian model is reproduced below.
gaussian_single_param_cdf <- function(data, params) { rho <- params$rho kappa <- params$kappa eta <- params$eta mean_structure <- kappa + eta*(data$sums - data$nums*kappa) pnorm(data$data, mean = mean_structure, sd = rho) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.