In densitymodelling/dsmextra: A Toolkit for Extrapolation Assessments in Density Surface Models

Preamble

This vignette illustrates the use of the dsmextra R package for quantifying and visualising model extrapolation in environmental space. The package has its roots in density surface modelling (DSM) — as implemented in package dsm [@Miller2015dsm] — but can be easily applied to any kind of spatially-explicit ecological data. The underlying theory behind extrapolation detection is covered at length in @Mesgaran2014 and @King2007. A full description of the package is given in @Bouchet2020.

dsmextra builds upon the functions available in the ecospat [@Broennimann2016] and WhatIf [@Gandrud2017] packages to provide user-friendly tools relevant to the analysis of ecological data, including distance sampling data. Specifically, dsmextra enables an a priori evaluation of environmental extrapolation, when using a model built in a reference system to make predictions in a separate target system (e.g., a different geographical area and/or a past/future time period) [@Sequeira2018].

Here, we showcase the features of dsmextra with a case study on sperm whales (Physeter macrocephalus) in the Mid-Atlantic. The data come from two shipboard line transect surveys conducted by the National Oceanic and Atmospheric Administration (NOAA)'s Northeast and Southeast Fisheries Science Centers in 2004, and represent a subset of the survey effort shown in @Roberts2016 and @Mannocci2017. A basic level of familiarity with density surface modelling is assumed - for technical details, see @Miller2013.

Note that the vignette accompanies a detailed technical report on extrapolation in cetacean density surface models. See @Bouchet2019 for more information.

Installation

The latest development version of dsmextra can be installed from GitHub (this requires the remotes package):

remotes::install_github("densitymodelling/dsmextra", dependencies = TRUE)

The code below loads required libraries and sets some general options.

#'---------------------------------------------
# Other required libraries
#'---------------------------------------------
library(dsmextra)     # Extrapolation toolkit for ecological models
library(raster)       # Geographic data analysis and modeling
library(tidyverse)    # Packages for data science
library(magrittr)     # Pipe operator

#'--------------------------------------------------------------------
# Set tibble options
#'--------------------------------------------------------------------
options(tibble.width = Inf) # All tibble columns shown
options(pillar.neg = FALSE) # No colouring negative numbers
options(pillar.subtle = TRUE)
options(pillar.sigfig = 4)

#'--------------------------------------------------------------------
# Set knitr options
#'--------------------------------------------------------------------
knitr::opts_chunk$set(echo = TRUE)

Overview

This vignette covers all the steps required to undertake an extrapolation assessment, and explains how to:

Quantify extrapolation using compute_extrapolation.
Obtain a summary of extrapolation using summary.
Identify covariates responsible for extrapolation using compare_covariates.
Calculate a neighbourhood metric of extrapolation using compute_nearby.
Generate maps of extrapolation using map_extrapolation.
Run a full assessment using extrapolation_analysis.

The data for this example are included in dsmextra as a list containing information about both (1) the surveyed area (i.e., here, the line transects 'chopped' into segments) and (2) the prediction area. We refer to segs as the reference (or calibration) data/system/conditions, and to predgridas the target data/system/conditions [@Sequeira2018]. Note that predgrid holds the coordinates and covariate values for grid cells over which density surface model predictions are sought (although model fitting is outside the scope of this vignette).

#'---------------------------------------------
# Load and extract the data
#'---------------------------------------------
data("spermwhales")
segs <- spermwhales$segs
predgrid <- spermwhales$predgrid

#'---------------------------------------------
# Quick inspection
#'---------------------------------------------
knitr::kable(head(segs[,c(1, 3:12)]), format = "pandoc")
knitr::kable(head(predgrid), format = "pandoc")

A quick map is useful to assess the distribution of sperm whale sightings throughout the area. (Note: the transect lines (transects) and sightings (obs) are not included in the package, but are shown here for illustrative purposes).

#'---------------------------------------------
# Create an outline of the study area boundaries
#'---------------------------------------------
study_area <- predgrid[, c("x", "y")]
study_area$value <- 1
study_area <- raster::rasterFromXYZ(study_area, crs = sp::CRS("+proj=aea +lat_1=38 +lat_2=30 +lat_0=34 +lon_0=-73 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"))
study_area <- raster::rasterToPolygons(study_area, dissolve = TRUE)
study_area <- sp::spTransform(study_area, CRSobj = sp::proj4string(transects))
study_area <- smoothr::smooth(study_area, method = "ksmooth", smoothness = 5)

#'---------------------------------------------
# Produce a simple plot
#'---------------------------------------------
plot(study_area, col = "lightskyblue1") # Region boundary
plot(transects, add = TRUE, col = "skyblue3") # Survey tracks
maps::map("world", fill = TRUE, col = "grey",
          xlim = range(obs$coords.x1),
          ylim = range(obs$coords.x2), add = TRUE)
pts <- obs # Sightings
coordinates(pts) <- ~ coords.x1 + coords.x2
axis(1); axis(2); box()
points(pts, pch = 16)

First, we define the projected coordinate system appropriate to the study area (i.e., Albers Equal Area, +proj=aea). Our explanatory covariates of interest are: seabed depth (Depth), sea surface temperature (SST), primary productivity (NPP), distance to the nearest canyons and seamounts (DistToCAS), and eddy kinetic energy (EKE) -- see @Roberts2016 for details.

#'---------------------------------------------
# Define projected coordinate system
#'---------------------------------------------
aftt_crs <- sp::CRS("+proj=aea +lat_1=38 +lat_2=30 +lat_0=34 +lon_0=-73 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0")

#'---------------------------------------------
# Define environmental covariates of interest
#'---------------------------------------------
covariates.spermwhale <- c("Depth", "SST", "NPP", "DistToCAS", "EKE")

Quantifying extrapolation

To assess extrapolation in environmental space, we can run the extrapolation detection (ExDet) tool proposed by @Mesgaran2014 using the function compute_extrapolation. Note that this tool was originally implemented in the ecospat package, although with more limited capabilities.

The following arguments are required:

samples: reference data
covariate.names: names of covariates of interest
prediction.grid: target data
coordinate.system: projected coordinate system relevant to study location

Here, we store the results in an object called spermwhale.extrapolation. Values are expressed as both the number (n) of grid cells subject to extrapolation, and the proportion (%) of the target system that this represents.

spermwhale.extrapolation <- compute_extrapolation(samples = segs, 
                                   covariate.names = covariates.spermwhale, 
                                   prediction.grid = predgrid,
                                   coordinate.system = aftt_crs)

In this example, the vast majority of prediction grid cells (n = 5,175 -- i.e., 97.92%) are characterised by environmental conditions (defined along the axes of the covariates specified above) similar (aka 'analogue') to those captured in the original sample (segs). A quick check reveals that those results align with our expectations from the data in predgrid.

# Number of cells subject to univariate extrapolation (see below for definition)
predgrid %>% 
  dplyr::filter(!dplyr::between(Depth, min(segs$Depth), max(segs$Depth)) | 
                  !dplyr::between(SST, min(segs$SST), max(segs$SST)) |
                  !dplyr::between(NPP, min(segs$NPP), max(segs$NPP)) |
                  !dplyr::between(DistToCAS, min(segs$DistToCAS), max(segs$DistToCAS)) |
                  !dplyr::between(EKE, min(segs$EKE), max(segs$EKE))) %>% 
  nrow()

In practice, compute_extrapolation returns a list containing both data.frames (extrapolation values) and rasters (spatial representation of these values), with the following structure:

str(spermwhale.extrapolation, 2)

Three types of extrapolation can be identified (Figure 2):

Univariate extrapolation occurs when ExDet values < 0. This is also known as mathematical, strict, or Type 1 extrapolation, and represents conditions outside the range of individual covariates in the reference sample.
Combinatorial extrapolation occurs when ExDet values > 1. This is also known as multivariate or Type 2 extrapolation, and describes novel combinations of values encountered within the univariate range of reference covariates. Such combinations are identified based on the Mahalanobis distance metric (D^2^), a well-known and scale-invariant measure of multivariate outliers [@Rousseeuw1990].
Lastly, values between 0 and 1 denote predictions made in analogue conditions. These correspond to what is commonly referred to as interpolation, although if found in a different region in space or past/future period in time, then the terms geographical/temporal extrapolation are sometimes also used.

knitr::include_graphics("dsmextra_Figure2.png")

compute_extrapolation also determines which covariate makes the largest contribution to extrapolation for any given grid cell, i.e., the most influential covariate (mic) [@Mesgaran2014]. In univariate extrapolation, this is the covariate that leads to the highest negative univariate distance from the initial covariate range. In combinatorial extrapolation, this corresponds to the covariate whose omission (while retaining all others) makes the largest reduction in the Mahalanobis distance to the centroid of the reference data. See @Mesgaran2014 for formulae and technical explanations.

ExDet values can be retrieved from the returned list, as below:

# Example from combinatorial extrapolation
head(spermwhale.extrapolation$data$combinatorial)

Summarising extrapolation

compute_extrapolation also generates a summary table, which can be printed as follows:

summary(spermwhale.extrapolation)

Comparing covariates

The extent and magnitude of extrapolation naturally vary with the type and number of covariates considered. It may be useful, therefore, to test different combinations of covariates to inform their selection a priori, i.e., before model fitting, to support model parsimony.

The compare_covariates function is available for this purpose. It is designed to assess extrapolation iteratively for all combinations of n.covariates, and returns an overview of those that minimise/maximise extrapolation (as measured by the number of grid cells subject to extrapolation in the prediction area).

n.covariates can be supplied as any vector of integers from 1 to p, where p is the total number of covariates available. Its default value of NULL is equivalent to 1:p, meaning that all possible combinations are tested, unless otherwise specified.

Additional arguments include:

extrapolation.type: one of 'univariate', 'combinatorial' or 'both'
extrapolation.object: output list object returned by compute_extrapolation.
create.plots: if TRUE, generates boxplots of results
display.percent: if TRUE, the y-axis of the boxplots is expressed as a percentage of the total number of grid cells in prediction.grid

Optionally, one can bypass compute_extrapolation and run this function directly by specifying the following arguments and leaving extrapolation.object set as NULL:

samples: reference data
covariate.names: names of covariates of interest
prediction.grid: target data
coordinate.system: projected coordinate system relevant to study location

Below is an example for both types of extrapolation (univariate and combinatorial), and the full set of five covariates.

compare_covariates(extrapolation.type = "both",
                   extrapolation.object = spermwhale.extrapolation,
                   n.covariates = NULL,
                   create.plots = TRUE,
                   display.percent = TRUE)

The top graph summarises the extent of univariate (yellow) and combinatorial (blue) extrapolation for all combinations of $1, 2, ..., n$ covariates. The total number of combinations is given above each boxplot as $N_c$. As expected, univariate extrapolation increases with the number of covariates considered.

The bottom graph displays the same results, summarised by covariate of interest. This may help identify any covariates that consistently lead to extrapolation, and should be avoided. Here, none of the covariates stand out as driving extrapolation more than any others, although combinatorial extrapolation is marginally more prevalent with NPP and SST.

Finding nearby data

While extrapolation is often seen as a binary concept (i.e., it either does or does not take place), it is reasonable to expect that predictions made at target points situated just outside the sampled environmental space may be more reliable than those made at points far outside it. The ExDet tool available through compute_extrapolation inherently quantifies this notion of 'distance' from the envelope (solid line) of the reference data (grey circles) (Figure 3A).

However, the multivariate distribution of reference data points is often far from homogeneous. It is possible, therefore, for target points representing analogue conditions to fall within sparsely sampled regions of the reference space; or conversely, for two target points reflecting an equal degree of extrapolation to have very different amounts of reference data within their vicinity [@Mannocci2018; @Virgili2017].

An example of this is shown in Figure 3B, where three target points $x_1$, $x_2$ and $x_3$ are located equally close to the envelope of the reference data. In essence, these reflect identical degrees of extrapolation, as defined under the ExDet framework. However, given the shape of the data cloud in multivariate space, it is clear that predictions made at target point $x_1$ will far 'better informed' by the sample than those made at $x_2$ or $x_3$, as a bigger cluster of reference points lies in their vicinity (green circle around $x_1$).

knitr::include_graphics("dsmextra_Figure3.png")

The notion of neighbourhood (or percentage/proportion of data nearby, hereafter %N) captures this very idea, and provides an additional measure of the reliability of extrapolations in multivariate environmental space [@Mannocci2018; @Virgili2017]. In practice, %N for any target point is taken as the proportion of reference data within a radius of one geometric mean Gower’s distance (G^2^, calculated between all pairs of reference points) of that point [@King2007]. The Gower’s distance between two points $i$ and $j$ defined along the axes of $K$ covariates is calculated as the average absolute distance between the values of these two points in each dimension, divided by the range of the data, such that:

$$G_{ij}^2=\frac{1}{K}\sum_{k=1}^{K}\frac{\left|x_{ik}-x_{jk}\right|}{\textrm{max}(X_k)-\textrm{min}(X_k)}$$

The compute_nearby function is adapted from the code given in @Mannocci2018, and allows the calculation of Gower’s distances (G^2^) as a basis for defining the neighbourhood.

The nearby argument to this function corresponds to the radius within which neighbouhood calculations will occur. It has a default value of 1, as per @Mannocci2018 and @Virgili2017. Increasing its value (e.g., to say, 2) would mean doubling the size of the green circles in Figure 3B, and would lead to larger percentages of data nearby per target point.

Important: The sperm whale dataset used here is sufficiently small that compute_nearby should run in a matter of seconds on most machines. However, applying the function to larger datasets may burn out memory allocation and lead to an R session failure. Additional arguments can be passed to compute_nearby to accommodate such 'big' data. Their use is unnecessary here, but further details can be found in the package documentation.

#'---------------------------------------------
# Calculate Gower's distances and %N
#'---------------------------------------------
spermwhale.nearby <- compute_nearby(samples = segs, 
                                    prediction.grid = predgrid, 
                                    coordinate.system = aftt_crs,
                                    covariate.names = covariates.spermwhale,
                                    nearby = 1)

Visualising extrapolation

Lastly, we can use the map_extrapolation function to create a series of interactive html maps of extrapolation in the prediction area.

Arguments to this function are identical to those of compute_extrapolation, with the exception of:

map.type: output to be shown on the map. One of 'extrapolation' for a map of extrapolation values, 'mic' for a map of the most influential covariates, or 'nearby' for a neighbourhood map (proportion of reference data near each target grid cell, a.k.a %N).
extrapolation.object: output object from compute_extrapolation or compute_nearby.

sightings and tracks are optional arguments, and only plotted if specified. If either is provided as a matrix, then coordinates must be labelled x and y, lest a warning be returned. Tracks can also be given as a SpatialLinesDataFrame object. The resulting maps can be panned, zoomed, and all layers toggled on and off independently. However, note that maps rely on the leaflet package [@Cheng2018], and thus require an internet connection; they will not work offline. Facilities to save map tiles locally for offline viewing are currently not implemented but may be rolled into a future version of the code. Finally, note that prediction.grid is only used here to set the map extent when plotting.

#'---------------------------------------------
# Rename coordinates and convert to SpatialPointsdf
#'---------------------------------------------
obs.sp <- obs %>% 
  dplyr::rename(., x = coords.x1, y = coords.x2) %>% 
  sp::SpatialPointsDataFrame(coords = cbind(.$x, .$y), data = ., proj4string = sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")) %>% 
  sp::spTransform(., CRSobj = aftt_crs)

Here's a map of extrapolation (ExDet) values:

map_extrapolation(map.type = "extrapolation", 
                  extrapolation.object = spermwhale.extrapolation,
                  sightings = obs.sp, 
                  tracks = transects)

Most influential covariates (MIC):

map_extrapolation(map.type = "mic",
                  extrapolation.object = spermwhale.extrapolation,
                  sightings = obs.sp, 
                  tracks = transects)

Proportion of data nearby (%N):

map_extrapolation(map.type = "nearby",
                  extrapolation.object = spermwhale.nearby,
                  sightings = obs.sp, 
                  tracks = transects)

Full analysis

For simplicity, all the above analyses can be performed in a single step by calling the wrapper function extrapolation_analysis. Most arguments are derived from the individual functions presented above, and should thus be self-explanatory. Note, however, that specific parts of the analysis can be run/bypassed using the following arguments:

compare.covariates = if TRUE, run compare_covariates
nearby.compute = if TRUE, run compute_nearby
map.generate = if TRUE, run map_extrapolation

#'---------------------------------------------
# One function to rule them all, one funtion to find them, 
# One function to bring them all, and in the darkness bind them.
#'---------------------------------------------
spermwhale.analysis <- extrapolation_analysis(samples = segs,
                                          covariate.names = covariates.spermwhale,
                                          prediction.grid = predgrid,
                                          coordinate.system = aftt_crs,
                                          compare.covariates = TRUE,
                                          compare.extrapolation.type = "both",
                                          compare.n.covariates = NULL,
                                          compare.create.plots = TRUE,
                                          compare.display.percent = TRUE,
                                          nearby.compute = TRUE,
                                          nearby.nearby = 1,
                                          map.generate = TRUE,
                                          map.sightings = obs.sp,
                                          map.tracks = transects)

Results are packaged in a list object with three levels, i.e., extrapolation, nearby, and maps, representing the outputs of each step in the analysis.

Conclusion

This vignette has outlined the steps need to quantify extrapolation in environmental space using the dsmextra R package. Note that many possible models (and model types) can be fitted to distance sampling or related data, each potentially behaving differently outside reference conditions for any given level of extrapolation [@YatesBouchet2018]. Because of this, the tools presented herein are model-agnostic and only designed to provide an a priori assessment. This may, in turn, help inform modelling decisions and model interpretations, but caution should always be exercised when performing model selection, discrimination and criticism.