\newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\tx}[1]{\mathrm{#1}} \newcommand{\xx}{{\bm{x}}} \newcommand{\yy}{{\bm{y}}} \newcommand{\XX}{{\bm{X}}} \newcommand{\YY}{{\bm{Y}}} \newcommand{\ZZ}{{\bm{Z}}} \newcommand{\bbe}{{\bm{\beta}}} \newcommand{\tth}{{\bm{\theta}}} \newcommand{\pps}{{\bm{\psi}}} \newcommand{\uu}{{\bm{u}}} \newcommand{\BB}{{\bm{B}}} \newcommand{\TTh}{{\bm{\Theta}}} \newcommand{\SSi}{{\bm{\Sigma}}} \newcommand{\VV}{{\bm{V}}} \newcommand{\iid}{{\overset{iid}{\sim}}} \newcommand{\ind}{{\overset{ind}{\sim}}} \newcommand{\cov}{{\operatorname{Cov}}} \newcommand{\obs}{{\tx{observed}}} \newcommand{\normal}{\operatorname{Normal}}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.cap = "",
  fig.path = "SpatialGEV_figures/",
  cache.path = "SpatialGEV_cache/",
  cache = TRUE,
  tidy = "styler"
)

Introduction to the GEV-GP Model

The generalized extreme value (GEV) distribution is often used to analyze sequences of maxima within non-overlapping time periods. An example of this type of data is the monthly maximum rainfall levels recorded over years at a weather station. Since there are typically a large number of weather stations within a country or a state, it is more ideal to have a model that can borrow information from nearby weather stations to increase inference and prediction accuracy. Such spatial information is often pooled using the Gaussian process.

The GEV-GP model is a hierarchical model with a data layer and a spatial random effects layer. Let $\xx_1, \ldots, \xx_n \in \mathbb{R}^2$ denote the geographical coordinates of $n$ locations, and let $y_{ik}$ denote the extreme value measurement $k$ at location $i$, for $k = 1, \ldots, n_i$. The data layer specifies that each observation $y_{ik}$ has a generalized extreme value distribution, denoted by $y \sim \tx{GEV}(a, b, s)$, whose CDF is given by \begin{equation} F(y\mid a, b, s) = \begin{cases} \exp\left{-\left(1+s\frac{y-a}{b}\right)^{-\frac{1}{s}}\right} \ \ &s\neq 0,\ \exp\left{-\exp\left(-\frac{y-a}{b}\right)\right} \ \ &s=0, \end{cases} \label{eqn:gev-distn} \end{equation} where $a\in\mathbb{R}$, $b>0$, and $s\in\mathbb{R}$ are location, scale, and shape parameters, respectively. The support of the GEV distribution depends on the parameter values: $y$ is bounded below by $a-b/s$ when $s>0$, bounded above by $a-b/s$ when $s<0$, and unbounded when $s=0$. To capture the spatial dependence in the data, we assume some or all of the GEV parameters in the data layer are spatially varying. Thus they are introduced in the model as Gaussian process (GP) random effects.

A Gaussian process $z(\xx)\sim \mathcal{GP}(\mu(\xx), k(\xx, \xx'))$ is fully characterized by its mean $\mu(\xx)$ and kernel function $k(\xx, \xx') = \cov( z(\xx), z(\xx') )$, which captures the strength of the spatial correlation between locations. The mean function is typically modelled as $\mu(\xx) = \bm{c}(\xx)^T \bm{\beta}$, where $\bm{c}(\xx)=(c_1(\xx),\ldots,c_p(\xx))$ is a known function of the spatial covariates. We assume that given the locations, the data follow independent GEV distributions each with their own parameters. The complete GEV-GP hierarchical model then becomes \begin{equation} \begin{aligned} y_{ik} \mid a(\xx_i), b(\xx_i), s & \ind \tx{GEV}\big( a(\xx_i), \exp( b(\xx_i) ), \exp(s(\xx_i))\big)\ a(\xx) \mid \bm{\beta}_a, \tth_a &\sim \mathcal{GP}\big( \bm{c}_a(\xx)^T \bm{\beta}_a, k(\xx, \xx' \mid \tth_a) \big)\ log(b)(\xx) \mid \bm{\beta}_b, \tth_b &\sim \mathcal{GP}\big( \bm{c}_b(\xx)^T \bm{\beta}_b, k(\xx, \xx' \mid \tth_b) \big)\ s(\xx) \mid \bm{\beta}_s, \tth_s &\sim \mathcal{GP}\big( \bm{c}_s(\xx)^T \bm{\beta}_s, k(\xx, \xx' \mid \tth_a) \big). \end{aligned} \label{eqn:gev-gp-model} \end{equation} SpatialGEV supports three types of kernel functions:

What Does SpatialGEV Do?

The package provides an interface to estimate the approximate joint posterior distribution of the spatial random effects $a$, $b$ and $s$ in the GEV-GP model. The main package functions are:

A fundamental application of GEV-GP modeling is the prediction of return levels. For a given spatial location $\xx$, the return level $p$ is defined as the $(1-p) \times 100\%$ quantile of the GEV distribution at that location: \begin{equation} \begin{aligned} z_p(\xx) & = z_p(a(\xx), b(\xx), s(\xx)) \ & = F^{-1}(1-p \mid a(\xx), b(\xx), s(\xx)) \ & = a(\xx) - \exp(b(\xx) - s(\xx)) \left{1 - [-\log(1-p)]^{-\exp(s(\xx))}\right}. \end{aligned} \label{eq:return} \end{equation} Each of the functions above can be used to estimate the posterior distribution of return levels $p(z_p(\XX^\star) \mid \YY)$ at arbitary locations $\XX^\star$ (observed or new).

Details about the approximate posterior inference can be found in @chen-etal21.

Installation

SpatialGEV depends on the package TMB to perform the Laplace approximation. Make sure you have TMB installed as described here before installing SpatialGEV. Moreover, SpatialGEV uses several functions from the INLA package for approximating the Matérn covariance with the SPDE representation and for creating meshes on the spatial domain. If the user would like to use the SPDE approximation (highly recommended for larger datasets), please first install INLA. Since INLA is not on CRAN, it needs to be installed as described here.

Once TMB and INLA have been installed, one can install the CRAN version of SpatialGEV via the following command:

install.packages("SpatialGEV")

To install the latest version of SpatialGEV, run the following:

devtools::install_github("meixichen/SpatialGEV")

Using the SpatialGEV Package

Exploratory analysis

We now demonstrate how to use this package through a simulation study. The simulated data used in this example is shipped with the package as a list variable simulatedData2, which contains the following:

a, logb and logs are simulated from Gaussian random fields using the R package SpatialExtremes. Using the simulated GEV parameters, we generate 50 to 70 observed data $\yy_i$ at each location $i, \ i=1,\ldots,400$.

library(SpatialGEV)
library(fields) # for plots
library(evd) # for rgev() and qgev()
source("vignette-helper-functions.R") # a few plotting functions for vignette
set.seed(123)
a <- simulatedData2$a
logb <- simulatedData2$logb
logs <- simulatedData2$logs
locs <- simulatedData2$locs
n_loc <- nrow(locs)
y <- Map(rgev, n=sample(50:70, n_loc, replace=TRUE),
         loc=a, scale=exp(logb), shape=exp(logs))

Spatial variation of $a$, $\log(b)$ and $\log(s)$ can be viewed by plotting them on regular lattices:

par(mfrow=c(1,3))
grid_plot(
  x = unique(locs$x), y = unique(locs$y), z = matrix(a, ncol=sqrt(n_loc)), 
  title = "Spatial variation of a"
)
grid_plot(
  x = unique(locs$x), y = unique(locs$y), z = matrix(logb, ncol=sqrt(n_loc)),  
  title = "Spatial variation of log(b)",
  y_lab=""
)
grid_plot(
  x = unique(locs$x), y = unique(locs$y), z = matrix(logs, ncol=sqrt(n_loc)), 
  title = "Spatial variation of log(s)",
  y_lab=""
)

Number of observations at each location is shown in the figure below.

barplot(
  height = sapply(y, length), 
  xlab = "Location",
  ylab = "Number of observations at each location",
  main = "Summary of number of observations per location"
)

Below are histograms of observations at $8$ randomly sampled locations.

set.seed(123)
n_sam <-8
sam_inds <- sample(1:n_loc, n_sam, replace=FALSE)
par(mfrow=c(2, n_sam/2))
for(i in sam_inds) {
  obs_i <- y[[i]]
  hist(
    x = obs_i,
    breaks=8,
    xlab="Observation value",
    main=paste("Observations at location", i)
  )
}

Model fitting

SpatialGEV adopts a Bayesian approach based on the Laplace approximation for model inference, which is described in @chen-etal21. The function spatialGEV_fit() in the SpatialGEV package is used to find the mode and quadrature of the Laplace approximation to the marginal hyperparameter posterior, which is the basis of a Normal approximation to the posterior of both hyperparameters and random effects referred to as Laplace-MQ in @chen-etal21. The function spatialGEV_sample() is then used to sample from this distribution via a sparse representation of its precision matrix.

SpatialGEV uses a Lebesgue prior $\pi(\bm{B}, \TTh) \propto 1$ on the GP hyperparameters by default. However, informative priors for these parameters can be specified as detailed below.

To fit the GEV-GP model to this simulated dataset, the first step is calling the spatialGEV_fit() function, for which several arguments must be provided:

There are other arguments which user might want to specify to override the defaults:

The code below fits a GEV-GP model with Matérn SPDE kernel to the simulated data. The shape parameter is constrained to be positive. No covariates are included in this model, so by default an intercept parameter $\beta_0$ is estimated for the GP of each spatial random effect. We also specify the return levels of interest at 0.5 and 0.9.

fit <- spatialGEV_fit(
  data = y,
  locs = locs,
  random = "abs",
  init_param = list(
    a = rep(60, n_loc),
    log_b = rep(2,n_loc),
    s = rep(-3,n_loc),
    beta_a = 60, beta_b = 2, beta_s = -2,
    log_sigma_a = 1.5, log_kappa_a = -2,
    log_sigma_b = 1.5, log_kappa_b = -2,
    log_sigma_s = -1, log_kappa_s = -2
  ),
  reparam_s = "positive", kernel="spde", 
  return_levels = c(0.5, 0.9),
  silent = TRUE
)
class(fit)
print(fit)

The point estimates and the associated standard errors (calculated via the Delta method) for the first 5 locations can be obtained as follows.

fit_summary <- summary(fit)
fit_summary$return_levels[1:5,]

Posterior sampling

To obtain posterior samples of $a(\XX)$, $b(\XX)$, and $s(\XX)$, we pass the fitted model object fit to spatialGEV_sample(), which takes in three arguments:

The following line of code draws 2000 samples from the posterior distribution and the posterior predictive distribution:

sam <- spatialGEV_sample(model = fit, n_draw = 2000, observation = TRUE)
print(sam)

Then use summary() to view the summary statistics of the posterior samples.

pos_summary <- summary(sam)
pos_id <- c(
  1:5, 
  (n_loc+1):(n_loc+5), 
  (2*n_loc+1):(2*n_loc+5),
  (3*n_loc):(nrow(pos_summary$param_summary))
)
pos_summary$param_summary[pos_id,]
pos_summary$y_summary[1:5,]

Model checking

Since we know the true values of $a(\XX)$, $b(\XX)$, and $s(\XX)$ in this simulation study, we are able to compare the posterior mean with the true values. The posterior means of $a$, $b$ and $s$ at different locations are plotted against the true values below.

par(mfrow=c(1,3))
plot(
  x = a,
  y = pos_summary$param_summary[1:n_loc,"mean"], 
  xlim=c(55, 63), ylim=c(55,63),
  xlab="True a",
  ylab="Posterior mean of a",
  main="True vs Posterior Mean of a"
)
abline(0, 1, col="blue", lty="dashed")
plot(
  x = exp(logb),
  y = exp(pos_summary$param_summary[(n_loc+1):(2*n_loc),"mean"]), 
  xlim=c(17, 24), ylim=c(17, 24),
  xlab="True b",
  ylab="Posterior mean of b",
  main="True vs Posterior Mean of b"
)
abline(0, 1, col="blue", lty="dashed")
plot(
  x = exp(logs),
  y = exp(pos_summary$param_summary[(2*n_loc+1):(3*n_loc),"mean"]), 
  xlim=c(0, 0.5), ylim=c(0, 0.5),
  xlab="True s",
  ylab="Posterior mean of s",
  main="True vs Posterior Mean of s"
)
abline(0, 1, col="blue", lty="dashed")

Including covariates

The example below demonstrates how to include covariates in the model. The simulated data used here comes with the package as a list variable simulatedData_withcov, where $s$ is a fixed effect, $a_i$ and $b_i$ at each location $i$ are generated from functions that depend linearly on coefficients $\boldsymbol{\beta}_a$ and $\boldsymbol{\beta}_b$, and $y_i$ is simulated from a GEV distribution conditional on $a_i$, $b_i$ and $s$.

The simulated data are generated with two covariates, with $a_i$ depending on both covariates and $b_i$ depending on only the first covariate. The true values of $\boldsymbol{\beta}_a$ and $\boldsymbol{\beta}_b$ are given below, where the first elements both correspond to the intercept term. covar is a n_loc x 3 design matrix with the first column being all 1s. The GEV-GP model supported by SpatialGEV assumes an intercept is always included for each random effect, so it is important to make sure the first column of the design matrix (X_a, X_b, X_s) is all 1s.

y_withcov <- simulatedData_withcov$y
locs <- simulatedData_withcov$locs
covar <- simulatedData_withcov$covariates
n_loc <- nrow(locs)
beta_a <- simulatedData_withcov$beta_a
beta_b <- simulatedData_withcov$beta_b
print(dim(covar))
print(head(covar))
print(beta_a)
print(beta_b)

We include the covariate matrix for random effects $a_i$ and $b_i$ by providing the covar variable to both X_a and X_b arguments. For priors on $\boldsymbol{\beta}$, the package only supports specifying the same Normal prior for all elements of $\boldsymbol{\beta}_u$ for each of $u=a,b,s$. Here, a Normal prior $\normal(0, 10^2)$ is imposed on both $\boldsymbol{\beta}_a$ and $\boldsymbol{\beta}_b$.

fit_withcov <- spatialGEV_fit(
  data = y_withcov,
  locs = locs, 
  random = "ab",
  init_param = list(
    a = rep(0, n_loc),
    log_b = rep(0, n_loc), 
    s = 0,
    beta_a = rep(0, ncol(covar)), 
    beta_b = rep(0, ncol(covar)), 
    log_sigma_a = 1, 
    log_kappa_a = -2,
    log_sigma_b = 1, 
    log_kappa_b = -2
  ),
  X_a = covar, X_b = covar,
  beta_prior=list(beta_a=c(0,10), beta_b=c(0,10)),
  reparam_s = "positive", kernel="spde", 
  silent = TRUE
)

Next, we draw 1000 samples from the approximate posterior distribution of the parameters and display the summary statistics for $\boldsymbol{\beta}_a$ and $\boldsymbol{\beta}_b$ and their corresponding simulated (true) values below. While the credible intervals for the intercepts are wide, we see reasonably accurate and precise estimates for the coefficients corresponding to the two covariates.

sam <- spatialGEV_sample(model = fit_withcov, n_draw = 1000, observation = FALSE)
fe_summary <- summary(sam)$param_summary
beta_df <- cbind(fe_summary[rownames(fe_summary) %in% c("beta_a", "beta_b"),],
                 c(beta_a, beta_b, 0))
colnames(beta_df)[7] <- "Simulated_value"
print(beta_df)

Posterior prediction

Finally, we demonstrate how to make predictions at new locations. This is done using the spatialGEV_predict() function, which takes the following arguments:

In this dataset, the GEV parameters $a$ and $\log(b)$ are generated from the surfaces below, and the shape parameter is a constant $\exp(-2)$ across space.

a <- simulatedData$a
logb <- simulatedData$logb
logs <- simulatedData$logs
y <- simulatedData$y
locs <- simulatedData$locs
n_loc <- nrow(locs)
par(mfrow=c(1,2))
grid_plot(
  x = unique(locs$x), y = unique(locs$y), z = matrix(a, ncol=sqrt(n_loc)), 
  title = "Spatial variation of a"
)
grid_plot(
  x  = unique(locs$x), y = unique(locs$y), z = matrix(logb, ncol=sqrt(n_loc)), 
  title = "Spatial variation of log(b)",
  y_lab=""
)

We randomly sample 50 locations from the simulated dataset simulatedData as test locations which are left out. Data from the rest 350 training locations are used for model fitting. We fit a GEV-GP model with random $a$ and $b$ and log-transformed $s$ to the training dataset. The kernel used here is the SPDE approximation to the Matérn covariance function [@lindgren-etal11].

set.seed(123)
n_test <- 50
n_train <- n_loc - n_test
test_ind <- sample(1:n_loc, n_test)
train_ind <- setdiff(1:n_loc, test_ind)
# Obtain coordinate matrices and data lists
locs_test <- locs[test_ind,]
y_test <- y[test_ind]
locs_train <- locs[-test_ind,]
y_train <- y[-test_ind]

# Fit the GEV-GP model to the training set
train_fit_s <- spatialGEV_fit(
  data = y_train,
  locs = locs_train,
  random = "ab",
  init_param = list(
    a = rep(0, n_train),
    log_b = rep(0,n_train), s = 0,
    beta_a = 0, beta_b = 0,
    log_sigma_a = 1, log_kappa_a = -2,
    log_sigma_b = 1, log_kappa_b = -2
  ),
  reparam_s = "positive",
  kernel="spde",
  silent = TRUE
)

The fitted model object is passed to spatialGEV_predict() for 500 samples from the posterior predictive distributions. Note that this might take some time.

pred_s <- spatialGEV_predict(model = train_fit_s, locs_new = locs_test, n_draw = 500)
pred_s

Then we call summary() on the pred object to obtain summary statistics of the posterior predictive samples at the test locations.

pred_summary <- summary(pred_s)
pred_summary[1:5,]

Since we have the true observations at the test locations, we can compare summary statistics of the true observations to those of the posterior predictive distributions. In the figures below, each circle represents a test location.

par(mfrow=c(1,2))
plot(sapply(y_test, mean), pred_summary[,"mean"], 
     xlim=c(2.5, 6), ylim=c(2.5, 6),
     xlab="Test statistic from test data",
     ylab="Test statistic from predictive distribution",
     main="Test statistic = mean")
abline(0, 1, col="blue", lty="dashed")
plot(
  x = sapply(y_test, function(x) {
    quantile(x, probs=0.975)
  }),
  y = pred_summary[,"97.5%"], 
  xlim=c(4, 13), ylim=c(4, 13),
  xlab="Test statistic from test data",
  ylab="Test statistic from predictive distribution",
  main="Test statistic = 97.5% quantile"
)
abline(0, 1, col="blue", lty="dashed")

Case study: Yearly maximum snowfall data in Ontario, Canada

In this section, we show how to use the SpatialGEV package to analyze a real dataset. The data used here are the 1987-2021 monthly total snowfall data (in cm) obtained from Environment and Natural Resources, Government of Canada. The link to download the raw data is https://climate-change.canada.ca/climate-data/#/monthly-climate-summaries. This dataset is automatically loaded with the package and is named ONsnow.

library(dplyr)
#library(maps)
lon_range <- c(-96, -73)
lat_range <-  c(41.5, 55)
summary(ONsnow)
maps::map(xlim = lon_range, ylim = lat_range)
points(ONsnow$LONGITUDE, ONsnow$LATITUDE)

Data preprocessing

We first grid the data using cells of length $0.5^{\circ}$. By doing this, weather stations that are apart by less than $0.5^{\circ}$ in longitude/latitude are grouped together in the same grid cell. From now on, we refer to each grid cell as a location.

grid_locs <- grid_location(ONsnow$LONGITUDE, ONsnow$LATITUDE,
                           sp.resolution = 0.5)
data_grid <- cbind(grid_locs, ONsnow)
data_grid[1:5,]

For each location, we find the maximum snowfall amount each year and only keep locations where there are at least two years of records.

# Yearly max for each location
all_locs <- data_grid %>% 
  select(cell_ind, cell_lon, cell_lat) %>%
  distinct() 
yearly_max_records <- data_grid %>% 
  group_by(cell_ind, LOCAL_YEAR) %>% 
  slice(which.max(TOTAL_SNOWFALL)) %>%
  select(cell_ind, LOCAL_YEAR, LOCAL_MONTH, TOTAL_SNOWFALL) %>% 
  rename(YEARLY_MAX_SNOWFALL = TOTAL_SNOWFALL) %>%
  filter(YEARLY_MAX_SNOWFALL > 0) %>% # Remove records of 0s 
  left_join(all_locs, by="cell_ind")

# Coordinates of the locations
locs <- yearly_max_records %>% ungroup() %>% 
  select(cell_ind, cell_lon, cell_lat) %>% 
  distinct()
n_loc <- nrow(locs)

# Make data into a list in which each vector contains data from one location
Y <- vector(mode="list", length=n_loc)
for (i in 1:n_loc){
  id <- locs$cell_ind[i]
  Y[[i]] <- yearly_max_records %>% 
    ungroup() %>%
    filter(cell_ind==id) %>% 
    pull(YEARLY_MAX_SNOWFALL)
}

# Only keep locations with at least 2 years of records
chosen_loc_ind <- which(sapply(Y, length) >= 2)
Y <- Y[chosen_loc_ind]
locs <- locs %>% select(cell_lon, cell_lat) %>% slice(chosen_loc_ind)
n_loc <- nrow(locs)
maps::map(xlim = lon_range, ylim = lat_range)
points(locs$cell_lon, locs$cell_lat)

Now we fit the GEV-GP model to the data using the SPDE kernel. Both $a$ and $b$ are treated as spatial random effects. $s$ is constrained to be a positive constant. Note that here we have specified a $\normal(-5,5)$ prior on the log-transformed shape parameter. This is because we found that the shape parameter is estimated close to 0 and such a prior ensures model fitting procedure is numerically stable.

fit <- spatialGEV_fit(
  data = Y,
  locs = locs,
  random="ab",
  init_param = list(
    a=rep(40, n_loc),
    log_b=rep(3, n_loc),
    s=-5,
    beta_a=30, beta_b=3,
    log_sigma_a=1, log_kappa_a=-3,
    log_sigma_b=-2, log_kappa_b=-3
  ),
  reparam_s="positive",
  kernel="spde", 
  beta_prior=list(beta_a=c(0,100),beta_b=c(0,100)),
  s_prior=c(-5,5),
  silent=TRUE
)
fit

Next, 2000 samples are drawn from the joint posterior distribution of all parameters.

sam <- spatialGEV_sample(fit, n_draw=2000, observation=F)

Instead of using the return_levels argument in spatialGEV_fit(), we can also use the posterior samples to calculate the 5-year return level posterior samples, which are the upper $1/5$% quantiles of the extreme value distributions at these locations.

#library(evd)
p_draws <- sam$parameter_draws
# Get indices of each GEV parameter in the sample matrix
a_ind <- which(colnames(p_draws)%in% paste0("a", 1:n_loc))
logb_ind <- which(colnames(p_draws)%in% paste0("log_b", 1:n_loc))
logs_ind <- which(colnames(p_draws) == "s")

# Define a function to draw from the return level posterior distribution
rl_draw <- function(return_period) {
  apply(p_draws, 1, function(all_draw) {
    mapply(
      FUN=evd::qgev,
      p=1/return_period, 
      loc=all_draw[a_ind], 
      scale=exp(all_draw[logb_ind]),
      shape=exp(all_draw[logs_ind]),
      lower.tail=FALSE
    )
  })
}

# 5 year return-level
q5_draws <- rl_draw(5)
return5 <- apply(q5_draws, 1, mean)
return5sd <- apply(q5_draws, 1, sd)

Plotted below are the return levels.

par(mfrow=c(1,2))
map_plot(return5, title="5-year Return levels")
map_plot(return5sd, title="SD of 5-year Return levels")

References



meixichen/SpatialGEV documentation built on Nov. 10, 2024, 12:23 a.m.