spamtree: Bayesian Spatial Multivariate Tree GP Regression
In spamtree: Spatial Multivariate Trees

Description Usage Arguments Details Value Author(s) References Examples

View source: R/spamtree_fit.R

Bayesian linear multivariate spatial regression using SpamTrees.

spamtree(y, x, coords,
        mv_id = rep(1, length(y)),
        cell_size = 25,
        K = rep(2, ncol(coords)),
        start_level = 0,
        tree_depth = Inf,
        last_not_reference = TRUE,
        limited_tree = FALSE,
        cherrypick_same_margin = TRUE,
        cherrypick_group_locations = TRUE,
        mvbias = 0,
        mcmc = list(keep = 1000, burn = 0, thin = 1),
        num_threads = 4,
        verbose = FALSE, 
        settings = list(adapting = TRUE, mcmcsd = 0.01, 
          debug = FALSE, printall = FALSE),
        prior = list(set_unif_bounds = NULL, 
          btmlim = NULL, toplim = NULL, vlim = NULL),
        starting = list(beta = NULL, tausq = NULL, theta = NULL, w = NULL),
        debug = list(sample_beta = TRUE, sample_tausq = TRUE, 
          sample_theta = TRUE, sample_w = TRUE, 
          sample_predicts = TRUE)
)

`y`	vector of outcomes of size n. Correspondingly, `y[i]` is the observation of outcome `mv_id[i]` at location `coords[i,]` and with covariates `x[i,]`. This means that if the number of outcomes is q>1 then these are all stacked in a vector, and their integer ID is stored in `mv_id`.
`x`	matrix of covariates with dimension (n, p).
`coords`	matrix of coordinates with dimension (n, 2).
`mv_id`	integer vector of outcome IDs of size n with values in \{1, …, q\}.
`cell_size`	integer number of knots for each node in the treed DAG. Defaults to 25. This is a target number and some nodes may include more or less locations. Here, knots can only be chosen among observed locations.
`K`	integer vector of dimension 2 indicating the number of intervals for axis-parallel recursive partitioning. Each tree level will thus have `prod(K)` times as many partitions as the previous. Defaults to `c(2,2)`, leading to a recursive quadtree.
`start_level`	integer indicating the root level. Example: `start_level=0` means there is 1 root node. `start_level=1` means there are `prod(K)` root nodes.
`tree_depth`	integer indicating the number of branching steps in the tree. Defaults to `Inf` meaning that observed locations will be placed on tree nodes as much as possible.
`last_not_reference`	bool indicating whether to treat the last level of the tree as a reference set. The default value `TRUE` is recommended when `tree_depth=Inf` or whenever only a very small number of observed locations remain at the last level.
`limited_tree`	bool determining whether to use a recursive tree. If `TRUE`, each non-root node has 1 parent and `prod(K)` children. Otherwise, each node at level L (where root nodes have L=0) has L parents.
`cherrypick_same_margin`	bool used only for multivariate outcomes. This determines how to assign parents to leaf nodes. In a SpamTree, outcome j at a new spatial location is assigned the same parent of its nearest neighbor. If `cherrypick_same_margin=TRUE` then the nearest neighbor is searched within the subset of locations for which outcome j was observed. Otherwise, the nearest neighbor is searched within all locations at which any outcome was observed. If outcomes are aligned (all observed at the same locations) and `cherrypick_group_locations=TRUE`, then this setting has minimal or no effect.
`cherrypick_group_locations`	bool used in multivariate settings to determine whether the allocation of knots to tree nodes should treat the q outcomes at location s \in D as either (1) a q dimensional vector observed at 1 location, or (2) one observed outcome at each of q locations (i.e. same spatial location but different outcome index).
`mvbias`	parameter used in settings of multivariate misalignment in which one or more outcomes are observed at a number of locations that is much smaller than others. `mvbias` can be used to disproportionately place the more sparsely observed outcomes near root nodes. This is justified by Prop. 1 in Peruzzi and Dunson (2021).
`mcmc`	list for setting up MCMC. `mcmc$keep` is the number of MCMC samples to be saved, `mcmc$burn` is the number of iterations for burn-in, `mcmc$thin` is the thinning level for the chain. The total number of iterations that will be run is `burn + thin*keep`.
`num_threads`	integer number of OpenMP threads to use within MCMC. Ineffective if source is compiled without OpenMP support.
`verbose`	level of verbosity. All messages are suppressed if `verbose=FALSE` (default). It is useful to set `verbose=TRUE` for data of medium size or long MCMC chains.
`settings`	list with additional settings. `settings$adapting` determines whether to use Robust Adaptive Metropolis algorithm of Vihola (2012). `settings$mcmcsd` is the initial standard deviation for the MCMC proposals before adaptation. `settings$debug` prints some debug messages. `settings$printall` determines whether to print to console at each iteration.
`prior`	setup for prior on θ, which currently only allows to specify the support of independent uniform distributions. See examples. (subject to change).
`starting`	list with starting values for all unknowns. Compatibility checks with `prior` are minimal and incompatible values may result in crashes.
`debug`	list with debug settings. Can be used to turn off parts of MCMC.

This implements the following model (in stacked vector form):

y = X β + w + ε,

where y is a n-dimensional vector of outcomes, X is a matrix of covariates of dimension (n, p) associated to coefficients β, w is a n-dimensional vector storing the realization of a spatial multivariate Gaussian tree w(\cdot) \sim SpamTree_{G}(0, C_{θ}) where G is a treed directed acyclic graph, and where C_{θ}(s, s') is a matrix-valued non-separable cross-covariance function on latent dimensions (see Peruzzi and Dunson (2021), equation 18, and CrossCovarianceAG10) where θ is a vector of unknown parameters. SpamTrees Gaussian processes are a scalable alternative to a spatial multivariate GP. Conditional independence across domain locations is assumed to be determined by the treed graph G, whose sparsity enables more efficient computations for the Gibbs sampler computed with this function. The graph architecture can be customized using inputs of the spamtree function. The example below computes SpamTrees on univariate data. A vignette exists with bivariate misaligned spatial data.

`coords`	reordered spatial coordinates
`coordsinfo`	reordered spatial coordinates plus partitioning information.
`mv_id`	reordered outcome IDs.
`w_mcmc`	posterior sample of the spatial random effect. This is a list of length `mcmc$thin` whose elements are `n`-dimensional vectors of multivariate spatial random effects whose q margins are listed in `mv_id` as output here.
`yhat_mcmc`	posterior predictive sample. This is a list of length `mcmc$thin` whose elements are `n`-dimensional vectors of predictions whose q margins are listed in `mv_id` as output here.
`beta_mcmc`	array of size `c(p, mcmc$keep, q)` with posterior samples of the regression coefficients on each outcome. Example: `beta_mcmc[2,,1]` is the posterior sample for the second regressor on the first outcome.
`tausq_mcmc`	matrix with posterior samples of the q nuggets, one for each outcome.
`theta_mcmc`	matrix with posterior samples of the cross-covariance parameters. These include the latent distance between outcomes which may be poorly identifiable.
`mcmc_time`	elapsed clock time for MCMC.

Michele Peruzzi michele.peruzzi@duke.edu,
David B. Dunson dunson@duke.edu

Peruzzi, M. and Dunson, D. B. (2021) Spatial Multivariate Trees for Big Data Bayesian Regression. https://arxiv.org/abs/2012.00943

Vihola, M. (2012) Robust adaptive Metropolis algorithm with coerced acceptance rate. Statistics and Computing, 22:997-1008. doi: 10.1007/s11222-011-9269-5

# toy example with tiny dataset and short MCMC
# on a univariate outcome

library(magrittr)
library(dplyr)
library(ggplot2)
library(spamtree)

set.seed(2021)

SS <- 15 
n <- SS^2 # total n. locations, including missing ones

coords <- data.frame(Var1=runif(n), Var2=runif(n)) %>%
  as.matrix()

# generate data
sigmasq <- 2.3
phi <- 6
tausq <- .1
B <- c(-1,.5,1)

CC <- sigmasq * exp(-phi * as.matrix(dist(coords)))
LC <- t(chol(CC))
w <- LC %*% rnorm(n)
p <- length(B)
X <- rnorm(n * p) %>% matrix(ncol=p)
y <- X %*% B + w + tausq^.5 * rnorm(n)

set_missing <- rbinom(n, 1, 0.1)

simdata <- data.frame(coords,
                      y_full = y,
                      w_latent = w) %>%
  mutate(y_observed = ifelse(set_missing==1, NA, y_full))

# MCMC setup
mcmc_keep <- 300
mcmc_burn <- 300
mcmc_thin <- 1

ybar <- mean(y, na.rm=TRUE)

# fit spamtree with defaults
spamtree_done <- spamtree(y - ybar, X, coords, 
          mcmc = list(keep=mcmc_keep, burn=mcmc_burn, thin=mcmc_thin), 
          num_threads = 1)

# predictions
y_out <- spamtree_done$yhat_mcmc %>% 
  abind::abind(along=3) %>% `[`(,1,) %>% 
  add(ybar) %>% apply(1, mean)
w_out <- spamtree_done$w_mcmc %>% 
  abind::abind(along=3) %>% `[`(,1,) %>% 
  apply(1, mean)

outdf <- spamtree_done$coordsinfo %>% 
  cbind(data.frame(w_spamtree = w_out, 
                   y_spamtree = y_out)) %>%
  left_join(simdata)

# plot predictions
pred_plot <- outdf %>% 
  ggplot(aes(Var1, Var2, color=y_spamtree)) +
  geom_point() +
  scale_color_viridis_c()

# plot latent process
latent_plot <- outdf %>% 
  ggplot(aes(Var1, Var2, color=w_spamtree)) +
  geom_point() + 
  scale_color_viridis_c()

# estimation of regression coefficients
plot(density(spamtree_done$beta_mcmc[1,,1]))
abline(v=B[1], col="red")