Application with palaeodata
In tidysdm: Species Distribution Models with Tidymodels

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# xgboost uses data.table
data.table::setDTthreads(2)
RhpcBLASctl::blas_set_num_threads(2)
RhpcBLASctl::omp_set_num_threads(2)

SDMs with tidymodels for palaeo data

In this article, we show how a Species Distribution Model can be fitted with tidysdm on time-scattered (i.e.palaeontological, archaeozoological, archaeological) data, with samples covering different time periods. We recommend users first read the "tidysdm overview" article, which introduces a number of functions and concepts that will be used in the present article.

We first load tidysdm:

library(tidysdm)

Preparing your data

We start by loading a set of radiocarbon dates (calibrated) for horses, covering from 22k years ago until 8k years ago.

data(horses)
horses

We convert our dataset into an sf data.frame so that we can easily plot it (here tidyterra shines):

library(sf)
horses <- st_as_sf(horses, coords = c("longitude", "latitude"))
st_crs(horses) <- 4326

As a background to our presences, we will use the land mask for the present, taken from pastclim, and cut to cover only Europe:

library(pastclim)
set_data_path(on_CRAN = TRUE)

library(pastclim)
land_mask <- pastclim::get_land_mask(time_bp = 0, dataset = "Example")
europe_poly <- vect(region_outline$Europe)
crs(europe_poly) <- "lonlat"
land_mask <- crop(land_mask, europe_poly)
land_mask <- mask(land_mask, europe_poly)

And use tidyterra to plot:

library(tidyterra)
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  geom_sf(data = horses, aes(col = time_bp))

We now thin our presences, so that locations are further than 100km and 2000 years apart.

set.seed(123)
horses <- thin_by_dist_time(horses,
  dist_min = km2m(100),
  interval_min = y2d(2000),
  time_col = "time_bp",
  lubridate_fun = pastclim::ybp2date
)
nrow(horses)

And see what we have left:

ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  geom_sf(data = horses, aes(col = time_bp))

We now need a time series of palaeoclimate reconstructions. In this vignette, we will use the example dataset from pastclim. This dataset only has reconstructions every 5k years for the past 20k years at 1 degree resolution, with 3 bioclimatic variables. It will suffice for illustrative purposes, but we recommend that you download higher quality datasets with pastclim for real analysis. As for the land mask, we will cut the reconstructions to cover Europe only:

library(pastclim)
climate_vars <- c("bio01", "bio10", "bio12")
climate_full <- pastclim::region_series(
  bio_variables = climate_vars,
  data = "Example",
  crop = region_outline$Europe
)

Now we thin the observations to only keep one per cell in the raster (it would be better if we had an equal area projection...), and remove locations outside the desired area (if there was any):

set.seed(123)
horses <- thin_by_cell_time(horses,
  raster = climate_full,
  time_col = "time_bp",
  lubridate_fun = pastclim::ybp2date
)
nrow(horses)

Let's see what we have left of our points:

ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  geom_sf(data = horses, aes(col = time_bp))

Now we sample pseudo-absences (we will constraint them to be at least 70km away from any presences), selecting three times the number of presences

set.seed(123)
horses <- sample_pseudoabs_time(horses,
  n_per_presence = 3,
  raster = climate_full,
  time_col = "time_bp",
  lubridate_fun = pastclim::ybp2date,
  method = c("dist_min", km2m(70))
)

Let's see our presences and absences:

ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  geom_sf(data = horses, aes(col = class))

Now let's get the climate for these location. pastclim requires a data frame with two columns with coordinates and a column of time in years before present (where negative values represent time in the past). We manipulate the sf object accordingly:

horses_df <- horses %>%
  dplyr::bind_cols(sf::st_coordinates(horses)) %>%
  mutate(time_bp = date2ybp(time_step)) %>%
  as.data.frame() %>%
  select(-geometry)
# get climate
horses_df <- location_slice_from_region_series(horses_df,
  region_series = climate_full
)

# add the climate reconstructions to the sf object, and remove the time_step
# as we don't need it for modelling
horses <- horses %>%
  bind_cols(horses_df[, climate_vars]) %>%
  select(-time_step)

Fit the model by crossvalidation

Next, we need to set up a recipe to define how to handle our dataset. We don't want to transform our data, so we just need to define the formula (class is the outcome, all other variables are predictors; note that, for sf objects, geometry is automatically ignored as a predictor):

horses_rec <- recipe(horses, formula = class ~ .)
horses_rec

We can quickly check that we have the variables that we want with:

horses_rec$var_info

We now build a workflow_set of different models, defining which hyperparameters we want to tune. We will use glm, gam, random forest and boosted trees as our models, so only random forest and boosted trees have tunable hyperparameters. For the most commonly used models, tidysdm automatically chooses the most important parameters, but it is possible to fully customise model specifications.

horses_models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = horses_rec),
    models = list(
      # the standard glm specs  (no params to tune)
      glm = sdm_spec_glm(),
      # the standard sdm specs (no params to tune)
      gam = sdm_spec_gam(),
      # rf specs with tuning
      rf = sdm_spec_rf(),
      # boosted tree model (gbm) specs with tuning
      gbm = sdm_spec_boost_tree()
    ),
    # make all combinations of preproc and models,
    cross = TRUE
  ) %>%
  # set formula for gams
  update_workflow_model("default_gam",
    spec = sdm_spec_gam(),
    formula = gam_formula(horses_rec)
  ) %>%
  # tweak controls to store information needed later to create the ensemble
  option_add(control = control_ensemble_grid())

Note that gams are unusual, as we need to specify a formula to define to which variables we will fit smooths. By default, gam_formula() fits a smooth to every continuous predictor, but a custom formula can be provided instead.

We now want to set up a spatial block cross-validation scheme to tune and assess our models:

library(tidysdm)
set.seed(1005)
horses_cv <- spatial_block_cv(horses, v = 5)
autoplot(horses_cv)

We can now use the block CV folds to tune and assess the models:

set.seed(123)
horses_models <-
  horses_models %>%
  workflow_map("tune_grid",
    resamples = horses_cv, grid = 5,
    metrics = sdm_metric_set(), verbose = TRUE
  )

Note that workflow_set correctly detects that we have no tuning parameters for glm and gam. We can have a look at the performance of our models with:

autoplot(horses_models)

Now let's create an ensemble, selecting the best set of parameters for each model (this is really only relevant for the random forest, as there were not hype-parameters to tune for the glm and gam). We will use the Boyce continuous index as our metric to choose the best random forest and boosted tree. When adding members to an ensemble, they are automatically fitted to the full training dataset, and so ready to make predictions.

horses_ensemble <- simple_ensemble() %>%
  add_member(horses_models, metric = "boyce_cont")

And visualise it

autoplot(horses_ensemble)

Projecting to other times

We can now make predictions with this ensemble (using the default option of taking the mean of the predictions from each model) for the Last Glacial Maximum (LGM, 21,000 years ago).

climate_lgm <- pastclim::region_slice(
  time_bp = -20000,
  bio_variables = climate_vars,
  data = "Example",
  crop = region_outline$Europe
)

And predict using the ensemble:

prediction_lgm <- predict_raster(horses_ensemble, climate_lgm)
ggplot() +
  geom_spatraster(data = prediction_lgm, aes(fill = mean)) +
  scale_fill_terrain_c()