spmeshed: Posterior sampling for models based on MGPs
In meshed: Bayesian Regression with Meshed Gaussian Processes

spmeshed

R Documentation

Posterior sampling for models based on MGPs

Description

Fits Bayesian multivariate spatial or spatiotemporal regression models with latent MGPs via Markov chain Monte Carlo.

Usage

spmeshed(y, x, coords, k=NULL,
       family = "gaussian",
       axis_partition = NULL, 
       block_size = 30,
       grid_size = NULL,
       grid_custom = NULL,
       n_samples = 1000,
       n_burnin = 100,
       n_thin = 1,
       n_threads = 4,
       verbose = 0,
       predict_everywhere = FALSE,
       settings = list(adapting=TRUE, forced_grid=NULL, 
                          cache=NULL, ps=TRUE, saving=TRUE, low_mem=FALSE, hmc=4),
       prior = list(beta=NULL, tausq=NULL, sigmasq = NULL,
                          phi=NULL, a=NULL, nu = NULL,
                          toplim = NULL, btmlim = NULL, set_unif_bounds=NULL),
       starting = list(beta=NULL, tausq=NULL, theta=NULL, lambda=NULL, v=NULL, 
                       a=NULL, nu = NULL,
                       mcmcsd=.05, 
                       mcmc_startfrom=0),
       debug = list(sample_beta=TRUE, sample_tausq=TRUE, 
                    sample_theta=TRUE, sample_w=TRUE, sample_lambda=TRUE,
                    verbose=FALSE, debug=FALSE),
       indpart=FALSE
)

Arguments

`y`	matrix of multivariate outcomes with n rows and q columns. Each row of `y` corresponds to a row of `coords`. `NA` values are accepted in any combination and will be predicted via MCMC.
`x`	matrix of covariates with n rows and p columns.
`coords`	matrix of coordinates with n rows and d=2 or d=3 columns for spatial or spacetime regression, respectively.
`k`	integer k≤q q, number of latent processes to use for the linear model of coregionalization. If unspecified, this is set to q`=ncol(y)`.
`family`	a vector with length 1 or q whose elements corresponds to the data types of columns of `y`. Available choices are `gaussian`, `poisson`, `binomial`, `beta` for outcomes that are continuous, count, binary, or (0,1) proportions.
`axis_partition`	integer vector of size d: number of intervals each coordinate axis is split into
`block_size`	integer approximate size of the blocks after domain partitioning. Only used if `axis_partition` is not specified.
`grid_size`	integer vector of size d: number of 'knots' of the reference grid along each axis. This grid is then partitioned using either `axis_partition` or `block_size`. If unspecified, this is set so that the eventual grid size is close to n. This parameter is ignored if `settings$forced_grid=FALSE` in which case the data are assumed to be on a grid.
`grid_custom`	list with elements `grid` and `axis_interval_partition`. `grid` is a data.frame with the user supplied grid of knots. It is possible to include covariate values for the grid locations as additional columns, as long as their number matches `ncol(x)` - this is useful to make raster images of predictions. `axis_interval_partition` is the user supplied set of cuts for each coordinate axis (Note: these are the actual cutpoints along the axes, not the number of cuts). If left empty, `axis_partition` will be used to partition the custom grid. No checks are made on the validity of this grid. This parameter is ignored if `settings$forced_grid=FALSE` in which case the data are assumed to be on a grid.
`n_samples`	integer number of MCMC samples at which all the unknowns are stored (including the latent effects).
`n_burnin`	integer number of MCMC samples to discard at the beginning of the chain.
`n_thin`	integer thinning parameter for the MCMC chain. Only the chain of latent effects (w) is thinned to save memory in big data problems. Chains for other unknowns are not thinned and thus will be of length `n_thin * n_samples`.
`n_threads`	integer number of OpenMP threads. This is ineffective if `meshed` was not compiled with OpenMP support.
`verbose`	integer. If `verbose<=20`, then this is the number of times a message is displayed during MCMC. If `verbose>20`, then this is the number of MCMC iterations to wait until the next message update. If `verbose=Inf`, then a message will be printed at each MCMC iteration.
`predict_everywhere`	bool used if settings$forced_grid=T. Should predictions be made at the reference grid locations? If not, predictions will be made only at the supplied NA values of Y.
`settings`	list: `settings$adapting` turns the adaptation of MCMC on/off, `settings$forced_grid` determines whether or not to use the data grid or a forced grid; if unspecified, the function will try to see what the data look like. Note: if `forced_grid=FALSE` and n is very large and coords are irregularly spaced, then expect slowdowns in preprocessing and consider using `forced_grid=TRUE` instead. `settings$saving` will save model data if set to `TRUE`. `settings$low_mem` will only save `beta_mcmc`, `lambda_mcmc`, `v_mcmc`, `tausq_mcmc` (and not `w_mcmc` and `lp_mcmc`, which can be recovered from the others), thereby using less memory. All fitted predictions remain available in `yhat_mcmc` for convenience. `settings$ps` (default `TRUE`) determines whether to use the PS parametrization (Peruzzi et al 2021). `settings$hmc`, used if any outcome is not Gaussian, (1: MALA, 2: NUTS, 3: RM-MALA, 4: Simplified manifold preconditioning (default))
`prior`	list: setup for priors of unknown parameters. `prior$phi` needs to be specified as the support of the Uniform prior for φ. There is currently limited functionality here and some inputs are currently ignored. Defaults are: a vague Gaussian for β, τ^2_i \sim IG(2,1), θ_j \sim IG(2,2), all subject to change.
`starting`	list: setup for starting values of unknown parameters. `starting$mcmcsd` is the initial standard deviation of proposals. `starting$mcmc_startfrom` is input to the adaptive MCMC and can be used to manually restart MCMC. There is currently limited functionality here and some parameters may be ignored.
`debug`	list: setup for debugging things. Some parts of MCMC can be turned off here.
`indpart`	bool defaults to `FALSE`. If `TRUE`, this computes an independent partition model.

Details

This function targets the following model:

y(s) = x(s)^\top β + Λ v(s) + ε(s),

where y(s) is a q-dimensional vector of outcomes at spatial location s, x(s) is a p-dimensional vector of covariates with static coefficients β, Λ is a matrix of factor loadings of size (q, k), v(s) is a k-dimensional vector which collects the realization of independent Gaussian processes v_j \sim spmeshed(0, C_j) for j=1, …, k and where C_j(s, s') is a correlation function. s is a coordinate in space (d=2) or space plus time (d=3). The Meshed GP implemented here associates an axis-parallel tessellation of the domain to a cubic directed acyclic graph (mesh).

Value

`coordsdata`	data.frame including the original n coordinates plus the n_g knot coordinates if the model was run on a forced grid. The additional column `forced_grid` has value 1 if the corresponding coordinate is a knot in the forced grid. See examples.
`savedata`	Available if `settings$saving==TRUE`. Needed for making predictions using `predict()` after MCMC. Note: `NA` values of the output are automatically and more efficiently predicted when running `spmeshed`.
`yhat_mcmc`	list of length `n_samples` whose elements are matrices with n + n_g rows and q columns. Each matrix in the list is a posterior predictive sample of the latent spatial process. n_g = 0 if the data grid is being used. Given the possibly large n, only the thinned chain is output for y.
`v_mcmc`	list of length `n_samples` whose elements are matrices with n + n_g rows and k columns. Each matrix in the list is a posterior sample of the k latent spatial process. n_g = 0 if the data grid is being used. Given the possibly large n, only the thinned chain is output for v.
`w_mcmc`	list of length `n_samples` whose elements are matrices with n + n_g rows and q columns. Each matrix in the list is a posterior sample of w = Λ v. n_g = 0 if the data grid is being used. Given the possibly large n, only the thinned chain is output for w.
`lp_mcmc`	list of length `n_samples` whose elements are matrices with n + n_g rows and q columns. Each matrix in the list is a posterior sample of the linear predictor Xβ + Λ v. n_g = 0 if the data grid is being used. Given the possibly large n, only the thinned chain is output for w.
`beta_mcmc`	array of size `(p, q, n_thin*n_samples)` with the posterior sample for the static regression coefficients β. The jth column of each matrix (p rows and q columns) corresponds to the p linear effects on the jth outcome. The full chain minus burn-in is returned NOT thinned since `p` and `q` are relatively small.
`tausq_mcmc`	matrix of size `(q, n_thinn_samples)`. Each row corresponds to the full MCMC chain for the nugget τ^2_j* of the jth outcome in the coregionalization/factor model. The full chain minus burn-in is returned NOT thinned since `q` is relatively small.
`theta_mcmc`	array of size `(h, k, n_thinn_samples)` with the posterior sample for the correlation function parameters θ. `h` is 2 for spatial data (corresponding to the spatial decay of the exponential covariance (φ_i, i=1, …, k), and the variance σ^2_i, i=1, …, k), 4 for spacetime data (corresponding to temporal decay, spatial decay, and separability – these are referred to as a_i, φ_i, and β_i, i=1, …, k, in Gneiting (2002), see doi: 10.1198/016214502760047113, plus the variance σ^2, i=1, …, k). The full chain minus burn-in is returned NOT thinned since `h` and `k` are relatively small. If `settings$ps=TRUE`, the MCMC output for σ^2_i* (last row of `theta_mcmc`) should be discarded and Λ used instead.
`lambda_mcmc`	array of size `(q, k, n_thinn_samples)`. Each matrix (of size (q,k)*) is a posterior sample for Λ in the coregionalization/factor model. In univariate models, this is usually called σ. The full chain minus burn-in is returned NOT thinned since `q` and `k` are relatively small.
`paramsd`	Cholesky factorization of the proposal covariance for adaptive MCMC, after adaptation.
`mcmc`	Total number of MCMC iterations performed.
`mcmc_time`	Time in seconds taken for MCMC (not including preprocessing).

Author(s)

Michele Peruzzi michele.peruzzi@duke.edu

References

Peruzzi, M., Banerjee, S., and Finley, A.O. (2022) Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, 117(538):969-982. doi: 10.1080/01621459.2020.1833889

Peruzzi, M., Banerjee, S., Dunson, D.B., and Finley, A.O. (2021) Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis. https://arxiv.org/abs/2101.03579

Peruzzi, M. and Dunson, D.B. (2022) Spatial meshing for general Bayesian multivariate models. https://arxiv.org/abs/2201.10080

Examples


# toy example with tiny dataset and short MCMC
# on a univariate outcome
library(magrittr)
library(dplyr)
library(ggplot2)
library(meshed)

set.seed(2021)

SS <- 12
n <- SS^2 # total n. locations, including missing ones

coords <- expand.grid(xx <- seq(0,1,length.out=SS), xx) %>% 
  as.matrix()

# generate data
sigmasq <- 2.3
phi <- 6
tausq <- .1
B <- c(-1,.5,1)

CC <- sigmasq * exp(-phi * as.matrix(dist(coords)))
LC <- t(chol(CC))
w <- LC %*% rnorm(n)
p <- length(B)
X <- rnorm(n * p) %>% matrix(ncol=p)
y_full <- X %*% B + w + tausq^.5 * rnorm(n)

set_missing <- rbinom(n, 1, 0.1)

simdata <- data.frame(coords,
                      y_full = y_full,
                      w_latent = w) %>%
  mutate(y_observed = ifelse(set_missing==1, NA, y_full))

# MCMC setup
mcmc_keep <- 500
mcmc_burn <- 100
mcmc_thin <- 2

y <- simdata$y_observed
ybar <- mean(y, na.rm=TRUE)

meshout <- spmeshed(y-ybar, X, coords,
                    axis_partition=c(4,4),
                    n_samples = mcmc_keep, 
                    n_burn = mcmc_burn, 
                    n_thin = mcmc_thin, 
                    prior=list(phi=c(1,15)),
                    verbose = 0,
                    n_threads = 1)

# posterior means
best_post_mean <- meshout$beta_mcmc %>% apply(1:2, mean)
  
# process means
wmesh <- data.frame(w_mgp = meshout$w_mcmc %>% summary_list_mean())
# predictions
ymesh <- data.frame(y_mgp = meshout$yhat_mcmc %>% summary_list_mean())

outdf <- 
  meshout$coordsdata %>% 
  cbind(ymesh, wmesh) 

# plot predictions
pred_plot <- outdf %>% 
  ggplot(aes(Var1, Var2, color=y_mgp)) +
  geom_point() +
  scale_color_viridis_c()

# plot latent process
latent_plot <- outdf %>% 
  ggplot(aes(Var1, Var2, color=w_mgp)) +
  geom_point() + 
  scale_color_viridis_c()

# estimation of regression coefficients
plot(density(meshout$beta_mcmc[1,1,]))
abline(v=B[1], col="red")

meshed documentation built on Sept. 20, 2022, 1:06 a.m.