In sjevelazco/flexsdm: Tools for Data Preparation, Fitting, Prediction, Evaluation, and Post-Processing of Species Distribution Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  # fig.path = "man/figures/README-",
  # out.width = "100%",
  fig.width = 6,
  fig.height = 6,
  # dpi = 60,
  echo = TRUE,
  warning = FALSE,
  eval = TRUE
)

library(knitr)
# devtools::install_github('sjevelazco/flexsdm')

Introduction

Species distribution modeling (SDM) has become a standard tool in multiple research areas, including ecology, conservation biology, biogeography, paleobiogeography, and epidemiology. SDM is an area of active theoretical and methodological research. The flexsdm package provides users the ability to manipulate and parameterize models in a variety of ways that meet their unique research needs.

This flexibility enables users to define their own complete or partial modeling procedure specific for their modeling situations (e.g., number of variables, number of records, different algorithms and ensemble methods, algorithms tuning, etc.).

In this vignette, users will learn about the second set of functions in the flexsdm package that fall under the "modeling" umbrella. These functions were designed to construct and validate different types of models and can be grouped into fit_ , tune_ , and esm_* family functions. In addition there is a function to perform ensemble modeling.

The fit_ functions construct and validate models with default hyper-parameter values. The tune_ functions construct and validate models by searching for the best combination of hyper-parameter values, and esm_ functions can be used for constructing and validating Ensemble of Small Models. Finally, the fit_ensemble() function is for fitting and validating ensemble models.

These are the functions for model construction and validation:

fit_* functions family

fit_gam() Fit and validate Generalized Additive Models
fit_gau() Fit and validate Gaussian Process models
fit_gbm() Fit and validate Generalized Boosted Regression models
fit_glm() Fit and validate Generalized Linear Models
fit_max() Fit and validate Maximum Entropy models
fit_net() Fit and validate Neural Networks models
fit_raf() Fit and validate Random Forest models
fit_svm() Fit and validate Support Vector Machine models

tune_* functions family

tune_gbm() Fit and validate Generalized Boosted Regression models with exploration of hyper-parameters
tune_max() Fit and validate Maximum Entropy models with exploration of hyper-parameters
tune_net() Fit and validate Neural Networks models with exploration of hyper-parameters
tune_raf() Fit and validate Random Forest models with exploration of hyper-parameters
tune_svm() Fit and validate Support Vector Machine models with exploration of hyper-parameters

model ensemble

fit_ensemble() Fit and validate ensemble models with different ensemble methods

esm_* functions family

esm_gam() Fit and validate Generalized Additive Models with Ensemble of Small Model approach
esm_gau() Fit and validate Gaussian Process models Models with Ensemble of Small Model approach
esm_gbm() Fit and validate Generalized Boosted Regression models with Ensemble of Small Model approach
esm_glm() Fit and validate Generalized Linear Models with Ensemble of Small Model approach
esm_max() Fit and validate Maximum Entropy models with Ensemble of Small Model approach
esm_net() Fit and validate Neural Networks models with Ensemble of Small Model approach
esm_svm() Fit and validate Support Vector Machine models with Ensemble of Small Model approach

Installation

First, install the flexsdm package. You can install the released version of flexsdm from github with:

# devtools::install_github('sjevelazco/flexsdm')
require(flexsdm)
require(terra)
require(dplyr)

Project directory setup

Decide where on your computer you would like to store the inputs and outputs of your project (this will be your main directory). Use an existing one or use dir.create() to create your main directory. Then specify whether or not to include folders for projections, calibration areas, algorithms, ensembles, and thresholds. For more details see Vignette 01_pre_modeling

Data, species occurrence and background data

In this tutorial, we will be using species occurrences and environmental data that are available through the flexsdm package. The "abies" example dataset includes a pr_ab column (presence = 1, and absence = 0), location columns (x, y) and other environmental data. You can load the "abies" data into your local R environment by using the code below:

(THIS EXAMPLE LOOKS A LITTLE STRANGE BECAUSE WE ARE ALSO USING BACKGROUND DATA, WHILE THE ABIES DATASET CLEARLY HAS ABSENCES...)

data("abies")
data("backg")

dplyr::glimpse(abies)
dplyr::glimpse(backg)

If you want to replace the abies dataset with your own data, make sure that your dataset contains the environmental conditions related to presence-absence data. We use a pre-modeling family function for a k-fold partition method (to be used for cross-validation). In the partition method the number of folds or replications must be the same for presence-absence and for background points datasets.

abies2 <- part_random(
  data = abies,
  pr_ab = "pr_ab",
  method = c(method = "kfold", folds = 5)
)

dplyr::glimpse(abies2)

Now, in the abies2 object we have a new column called ".part" with the 5 k-folds (1, 2, 3, 4, 5), indicating which partition each record (row) is a member of. Next, we have to apply the same partition method and number of folds to the environmental conditions of the background points.

backg2 <- part_random(
  data = backg,
  pr_ab = "pr_ab",
  method = c(method = "kfold", folds = 5)
)

dplyr::glimpse(backg2)

In backg2 object we have a new column called ".part" with the 5 k-folds (1, 2, 3, 4, 5).

1. Fit and validate models

We fit and validate models: I. a maximum entropy model with default hyper-parameter values (flexsdm::fit_max) and II. a random forest model with exploration of hyper-parameters (flexsdm::tune_raf).

I. Maximum Entropy models with default hyper-parameter values.

max_t1 <- fit_max(
  data = abies2,
  response = "pr_ab",
  predictors = c("aet", "ppt_jja", "pH", "awc", "depth"),
  predictors_f = c("landform"),
  partition = ".part",
  background = backg2,
  thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen"),
  clamp = TRUE,
  classes = "default",
  pred_type = "cloglog",
  regmult = 1
)

This function returns a list object with the following elements:

names(max_t1)

model: A "MaxEnt" class object. This object can be used for predicting.

options(max.print = 20)
max_t1$model

predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.

max_t1$predictors

performance: The performance metric (see sdm_eval). Those metrics that are threshold dependent are calculated based on the threshold specified in the argument. We can see all the selected threshold values.

max_t1$performance

Predicted suitability for each test partition (row) based on the best model. This database is used in fit_ensemble.

max_t1$data_ens

II- Random forest models with exploration of hyper-parameters.

First, we create a data.frame that provides hyper-parameters values to be tested. It Is recommended to generate this data.frame. Hyper-parameter needed for tuning is 'mtry'. The maximum mtry must be equal to total number of predictors.

tune_grid <-
  expand.grid(
    mtry = seq(1, 7, 1),
    ntree = c(300, 500, 700, 900)
  )

We use the same data object abies2, with the same k-fold partition method:

rf_t <-
  tune_raf(
    data = abies2,
    response = "pr_ab",
    predictors = c(
      "aet", "cwd", "tmin", "ppt_djf",
      "ppt_jja", "pH", "awc", "depth"
    ),
    predictors_f = c("landform"),
    partition = ".part",
    grid = tune_grid,
    thr = "max_sens_spec",
    metric = "TSS",
  )

Let's see what the output object contains. This function returns a list object with the following elements:

names(rf_t)

model: A "randomForest" class object. This object can be used to see the formula details, a basic summary o fthe model, and for predicting.

rf_t$model

predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.

rf_t$predictors

rf_t$performance

Predicted suitability for each test partition (row) based on the best model. This database is used in fit_ensemble.

rf_t$data_ens

These model objects can be used in flexsdm::fit_ensemble().

2. Model Ensemble

In this example we fit and validate and ensemble model using the two model objects that were just created.

# Fit and validate ensemble model
an_ensemble <- fit_ensemble(
  models = list(max_t1, rf_t),
  ens_method = "meansup",
  thr = NULL,
  thr_model = "max_sens_spec",
  metric = "TSS"
)

# Outputs
names(an_ensemble)

an_ensemble$thr_metric
an_ensemble$predictors
an_ensemble$performance

3. Fit and validate models with Ensemble of Small Model approach

This method consists of creating bivariate models with all pair-wise combinations of predictors and perform an ensemble based on the average of suitability weighted by Somers' D metric (D = 2 x (AUC -0.5)). ESM is recommended for modeling species with very few occurrences. This function does not allow categorical variables because the use of these types of variables could be problematic when applied to species with few occurrences. For more detail see Breiner et al. (2015, 2018)

data("abies")
library(dplyr)

# Create a smaller subset of occurrences
set.seed(10)
abies2 <- abies %>%
  na.omit() %>%
  group_by(pr_ab) %>%
  dplyr::slice_sample(n = 10) %>%
  group_by()

We can use different methods in the flexsdm::part_random function according to our data. See part_random for more details.

# Using k-fold partition method for model cross validation
abies2 <- part_random(
  data = abies2,
  pr_ab = "pr_ab",
  method = c(method = "kfold", folds = 3)
)
abies2

This function constructs Generalized Additive Models using the Ensembles of Small Models (ESM) approach (Breiner et al., 2015, 2018).

# We set the model without threshold specification and with the kfold created above
esm_gam_t1 <- esm_gam(
  data = abies2,
  response = "pr_ab",
  predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
  partition = ".part",
  thr = NULL
)

This function returns a list object with the following elements:

names(esm_gam_t1)

esm_model: A list with "GAM" class object for each bivariate model. This object can be used for predicting using the ESM approachwith sdm_predict function.

options(max.print = 10) # If you don't want to see printed all the output
esm_gam_t1$esm_model

predictors: A tibble with variables use for modeling.

esm_gam_t1$predictors

performance: Performance metric (see sdm_eval). Those threshold dependent metrics are calculated based on the threshold specified in the argument.

esm_gam_t1$performance

Now, we test the rep_kfold partition method. In this method 'folds' refers to the number of partitions for data partitioning and 'replicate' refers to the number of replicates. Both assume values >=1.

# Remove the previous k-fold partition
abies2 <- abies2 %>% select(-starts_with("."))

# Test with rep_kfold partition using 3 folds and 5 replicates
set.seed(10)
abies2 <- part_random(
  data = abies2,
  pr_ab = "pr_ab",
  method = c(method = "rep_kfold", folds = 3, replicates = 5)
)
abies2

We use the new rep_kfold partition in the gam model

esm_gam_t2 <- esm_gam(
  data = abies2,
  response = "pr_ab",
  predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
  partition = ".part",
  thr = NULL
)

Test with random bootstrap partitioning. In method 'replicate' refers to the number of replicates (assumes a value >=1), 'proportion' refers to the proportion of occurrences used for model fitting (assumes a value >0 and <=1). With this method we can configure the proportion of training and testing data according to the species occurrences. In this example, proportion='0.7' indicates that 70% of data will be used for model training, while 30% will be used for model testing. For this method, the function will return .partX columns with "train" or "test" words as the entries.

# Remove the previous k-fold partition
abies2 <- abies2 %>% select(-starts_with("."))

# Test with bootstrap partition using 10 replicates
set.seed(10)
abies2 <- part_random(
  data = abies2,
  pr_ab = "pr_ab",
  method = c(method = "boot", replicates = 10, proportion = 0.7)
)
abies2

Use the new rep_kfold partition in the gam model

esm_gam_t3 <- esm_gam(
  data = abies2,
  response = "pr_ab",
  predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
  partition = ".part",
  thr = NULL
)