In pbs-assess/sdmTMB: Spatial and Spatiotemporal SPDE-Based GLMMs with 'TMB'

ggplot_installed <- require("ggplot2", quietly = TRUE)
dplyr_installed <- require("dplyr", quietly = TRUE)
spatstat_installed <- require("spatstat.data", quietly = TRUE)
# FIXME copy in these data? otherwise, need in suggests in description
pkgs <- ggplot_installed && spatstat_installed && dplyr_installed

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.asp = 0.618,
  eval = identical(Sys.getenv("NOT_CRAN"), "true") && pkgs
)

library(sdmTMB)
library(dplyr)
library(ggplot2)
library(spatstat.data)

Here we will cover using sdmTMB for forecasting data in time or extrapolating spatially to unsampled areas. These forecasting approaches have multiple applications, including,

predicting for future years
interpolating over missed years
extrapolating in space to unsampled areas (e.g., an area beyond the existing domain)
interpolating in space within the existing spatial domain

Forecasting: predicting for future time and interpolating over missed time slices

Predicting for future time and interpolating over missed time requires a similar method, so we will cover them both here. To forecast in time, either future or missed time, we need a model for time. As an example, we can't predict with years as factors below because the model won't know what value to assign to years without data.

The options for including time in the model include:

AR(1) or random walk random fields
Random walk intercepts
Smoothers on the time variable (e.g., s(year))
Ignoring time (fixed)
Some combination of these

We will use the Pacific cod data to show how to implement each of these options.

First, we need to make our mesh.

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 20)

Next, we need to create a list of the years that we want to forecast or interpolate. The DFO survey in this region only includes years 2003, 2004, 2005, 2007, 2009, 2011, 2013, 2015, and 2017.

data <- pcod
years <- unique(data$year)

To use the model to fill in Pacific cod density for the unsampled years, we will create a list of the years we want added to fill in from 2003--2017. To predict for future years, we will also add in years for after the observed data (i.e., after 2017) for however many years we want to forecast into the future. For this example, we will predict on years 2018--2025. We will name the vector of extra years extra_years.

extra_years <- c(
  2006, 2008, 2010, 2012, 2014, 2016, # missing years
  2018:2025 # predicted future years
)

Then, we will fit a model of Pacific cod density that includes depth variables. The argument extra_time in the sdmTMB() function is how we will add in interpolation and forecasting. We also will need to set the argument time to time = "year".

In this example, we will choose to turn off spatial random fields (spatial = "off"), so we are only including spatiotemporal random fields.

We then have different options for including time in the model.

AR(1) spatiotemporal field

To include spatiotemporal variation as an AR(1) process, we can specify spatiotemporal = "AR1":

fit_ar1 <- sdmTMB(
  density ~ depth_scaled + depth_scaled2,
  time = "year",
  extra_time = extra_years, #<< our list of extra years to be included
  spatiotemporal = "AR1", #<< setting an AR(1) spatiotemporal process
  data = pcod,
  mesh = mesh,
  family = tweedie(link = "log"),
  spatial = "off",
  silent = FALSE #< monitor progress
)

Random walk spatiotemporal field

Or, we can set spatiotemporal variation to a random walk with spatiotemporal = "RW":

fit_rw <- sdmTMB(
  density ~ depth_scaled + depth_scaled2,
  time = "year",
  extra_time = extra_years, #<<
  spatiotemporal = "RW", #<<
  data = pcod,
  mesh = mesh,
  family = tweedie(link = "log"),
  spatial = "off",
  silent = FALSE
)

Random walk intercept + AR(1) fields

We can also model the intercept as a random walk by removing the intercept from the main formula (adding 0 to the model equation) and including the argument time_varying = ~1:

fit_rw_ar1 <- sdmTMB(
  density ~ 0 + depth_scaled + depth_scaled2, #<< remove intercept with 0
  time = "year",
  time_varying = ~1, #<< instead include the intercept here as a random walk
  extra_time = extra_years,
  spatiotemporal = "AR1", #<<
  data = pcod,
  mesh = mesh,
  family = tweedie(link = "log"),
  spatial = "off",
  silent = FALSE
)

Smoother on year + AR(1) fields

We can also add a smoother on year as a variable in the model equation with s(year) in the model equation and keeping spatiotemporal="AR1":

fit_sm <- sdmTMB(
  density ~ s(year) + depth_scaled + depth_scaled2, #<< add smoother on year
  time = "year",
  extra_time = extra_years, #<<
  spatiotemporal = "AR1", #<<
  data = pcod,
  mesh = mesh,
  family = tweedie(link = "log"),
  spatial = "off",
  silent = FALSE
)

Deciding between methods

In deciding which method (AR(1), RW, etc) to use for including time in the model, it is important to know that

AR(1) field processes revert towards mean
Random walk processes (in the mean or time varying parameters) do not revert towards the mean
Smoothers should be used with caution, because they continue whatever the basis functions were doing
Uncertainties in prediction for random walks, AR(1) processes, and smoothers (here, p-splines) increase the further away we get from data

We can visualize the differences between these methods by comparing predicted density at a single point in space over all forecasted years.

newdf <- data.frame(
  year = unique(fit_ar1$data$year),
  X = mean(pcod$X),
  Y = mean(pcod$Y),
  depth_scaled = mean(pcod$depth_scaled),
  depth_scaled2 = mean(pcod$depth_scaled)^2
)
fits <- list(fit_ar1, fit_rw, fit_rw_ar1, fit_sm)
names(fits) <- c("AR1-fields", "RW-fields", "RW-time, AR1-fields", "s(year), AR1-fields")
set.seed(123)
preds <- lapply(fits, function(.x) predict(.x, newdf, nsim = 50L))
preds_df <- lapply(seq_along(preds), function(i) {
  .x <- preds[[i]]
  reshape2::melt(.x, value.name = "est") %>%
    dplyr::rename(year = Var1, iter = Var2) %>% 
    dplyr::mutate(type = names(fits)[i])
})
pred_df <- lapply(seq_along(fits), function(i) {
  .x <- fits[[i]]
  dplyr::mutate(predict(.x, newdf, se_fit = TRUE), type = names(fits)[i])
})
preds_df <- do.call("rbind", preds_df)
pred_df <- do.call("rbind", pred_df)

set.seed(1922)
iters <- sample(1:max(preds_df$iter), 12L)
ggplot(pred_df, aes(year, exp(est), ymin = exp(est - 2 * est_se), ymax = exp(est + 2 * est_se))) +
  geom_ribbon(alpha = 0.2) +
  geom_line(lwd = 1) +
  scale_y_log10() +
  facet_wrap(vars(type)) +
  geom_vline(xintercept = unique(pcod$year), lty = 2, alpha = 0.2) +
  theme(panel.grid = element_blank()) +
  geom_line(aes(year, exp(est), group = iter),
    alpha = 0.4,
    data = dplyr::filter(preds_df, iter %in% iters), inherit.aes = FALSE, lwd = 0.5
  ) +
  ylab("Fish density (log-distributed axis)") +
  xlab("")

We can see how AR(1) spatiotemporal fields evolve towards mean zero by mapping the magnitude of the spatio-temporal term epsilon over the entire period 2003--2025

one_yr <- dplyr::filter(qcs_grid, year == 2017)
grid <- purrr::map_dfr(unique(fit_ar1$data$year), function(i) { # FIXME remove purrr or add to suggests!
  one_yr$year <- i
  one_yr
})
p_ar1 <- predict(fit_ar1, newdata = grid)
ggplot(p_ar1, aes(X, Y, fill = epsilon_st)) +
  geom_raster() +
  facet_wrap(~year) +
  scale_fill_gradient2() +
  coord_fixed()

While random walk fields do not evolve towards the mean

p_rw <- predict(fit_rw, newdata = grid)
ggplot(p_rw, aes(X, Y, fill = epsilon_st)) +
  geom_raster() +
  facet_wrap(~year) +
  scale_fill_gradient2() +
  coord_fixed()

We can also visualize how the spatiotemporal field uncertainty grows without data by mapping the standard error of the spatio-temporal field. We see how the standard error increases in predicting future years (2018--2025), but is less for interpolated years between years that do have data (e.g. 2006, 2008). Here we show with the AR(1) process, but increasing uncertainty with less data similarly occurs in both the AR(1) and random walk methods.

eps_est <- predict(fit_ar1, newdata = grid, nsim = 100)
grid$se <- apply(eps_est, 1, sd)
ggplot(grid, aes(X, Y, fill = se)) +
  geom_raster() +
  facet_wrap(~year) +
  coord_fixed() +
  labs(fill = "Standard error\nof spatiotemporal field")

Forecasting space: Interpolating in space to unsampled areas within

We can also interpolate predicted values to unsampled areas within the geographic extent of the data. For this example, we will use the data on the locations of 3605 trees in a 1000 by 500 m rectangular sampling region from the the spatst.data package

First we will create a data frame of the x and y coordinates from the tree dataset, and we can map the locations

dat <- data.frame(
  x = spatstat.data::bei$x,
  y = spatstat.data::bei$y
)
ggplot(dat, aes(x, y)) +
  geom_point(col = "darkblue", alpha = 0.1) +
  coord_cartesian(expand = FALSE)

We first re-format the data to create density observations. We re-scale the x and y coordinates, using the size of the scale value to control the resolution (i.e., increasing the scale value will decrease the resolution). Then we can add a column in our data frame of tree density by adding the number of trees in each location that we created with the scale function. Then, we create the mesh and can visualize it by plotting.

# scale controls resolution
scale <- 50
dat$x <- scale * floor(dat$x / scale)
dat$y <- scale * floor(dat$y / scale)

dat <- dplyr::group_by(dat, x, y) %>%
  dplyr::summarise(n = n())

mesh <- make_mesh(
  dat,
  xy_cols = c("x", "y"),
  cutoff = 80 # min. distance between knots in X-Y units
)
plot(mesh)

Then, we can fit the model of tree density, with only an intercept and only one time slice

fit <- sdmTMB(n ~ 1,
  data = dat,
  mesh = mesh,
  family = truncated_nbinom2(link = "log"),
)

Next, we can predict to unsampled areas within the geographic extent of our data. We first expand the grid by adding in x and y coordinates between existing coordinates in our dataset. Here, we will add in points at intervals of 5 for x and y. This value controls the resolution of predicted data. Increasing the value will decrease the resolution of spatial predictions.

In this example, we include se_fit = TRUE in the predict function to generate standard errors, though this can slow down computation time.

We can map the predicted tree density at each of our interpolated points compared to the locations of our data to see the increased resolution by forecasting with this method

# makes all combinations of x and y:
newdf <- expand.grid(
  x = seq(min(dat$x), max(dat$x), 5),
  y = seq(min(dat$y), max(dat$y), 5)
)
p <- predict(fit,
  newdata = newdf, se_fit = TRUE
)

ggplot(p, aes(x, y)) +
  geom_raster(data = p, aes(x, y, fill = est)) +
  geom_point(data = dat, aes(x, y)) +
  labs(fill = "tree density")

We can also use add the argument nsim = 500 when predicting and then summarize predicted densities from all simulations in a matrix

p2 <- predict(fit, newdata = newdf, nsim = 500)
newdf$p2 <- apply(p2, 1, mean)
ggplot(newdf, aes(x, y)) +
  geom_raster(data = newdf, aes(x, y, fill = p2)) +
  geom_point(data = dat, aes(x, y)) +
  labs(fill = "tree density")

We can also visualize uncertainty in the forecasts by mapping the standard error of predicted densities at each point in space. We see that uncertainty is higher at vertices. This is because there are fewer neighbors, e.g. this tutorial

ggplot() +
  inlabru::gg(mesh$mesh) +
  geom_point(data = p, aes(x = x, y = y, col = est_se)) +
  coord_equal() +
  labs(col = "Standard error\nof spatiotemporal field")

Extrapolating outside the survey domain

We can also forecast spatially to outside of the geographic extent of the data. For instance, we can predict into a border area. To do so, we expand the x and y coordinates to values above and below the extent of the coordinates in the data. Here, we expand the geographic domain by 100 in all directions, and keep the resolution at 5.

Then, we can use the same model fit to predict to the expanded geographic domain.

# makes all combinations of x and y:
newdf <- expand.grid(
  x = seq(min(dat$x) - 100, max(dat$x) + 100, 5),
  y = seq(min(dat$y) - 100, max(dat$y) + 100, 5)
)
p3 <- predict(fit,
  newdata = newdf, se_fit = TRUE
)
ggplot(p3, aes(x, y)) +
  geom_raster(data = p3, aes(x, y, fill = est)) +
  geom_point(data = dat, aes(x, y)) +
  labs(fill = "tree density")