In sgvignali/SDMtune: Species Distribution Model Selection

knitr::opts_chunk$set(collapse = TRUE,
                      comment = "#>",
                      fig.align = "center")
options(knitr.table.format = "html")

library(kableExtra)
data <- SDMtune:::t
files <- list.files(path = file.path(system.file(package = "dismo"), "ex"), 
                    pattern = "grd", 
                    full.names = T)
predictors <- terra::rast(files)

Intro

In the previous articles you have learned how to prepare the data for the analysis, how to train a model, how to make predictions and how to evaluate a model using SDMtune. In this article you will learn different evaluation strategies to achieve a better estimate of the model performance.

Training and testing datasets

First we load the SDMtune package:

library(SDMtune)

It's always a good practice to split the species locations into two parts and use one part to train the model and the remaining part to evaluate it. We can use the trainValTest() function for this purpose. Let's say we want to use 80\% of the species locations to train our model and 20\% as testing dataset to evaluate it:

library(zeallot)  # For unpacking assignment
c(train, test) %<-% trainValTest(data, 
                                 test = 0.2, 
                                 only_presence = TRUE, 
                                 seed = 25)

maxnet_model <- train("Maxnet", 
                      data = train)

The only_presence argument is used to split only the presence and not the background locations. We can now evaluate the model using the testing dataset that has not been used to train the model:

cat("Training auc: ", auc(maxnet_model))
cat("Testing auc: ", auc(maxnet_model, test = test))

We can plot the ROC curve for both, training and testing datasets, with:

plotROC(maxnet_model, 
        test = test)

This approach is valid when we have a large dataset. In our case, with only r nrow(train@data[train@pa == 1, ]) observations, the evaluation depends strongly on how we split our presence locations. Let's run a small experiment in which we perform different train/test splits and we compute the AUC:

output <- data.frame(matrix(NA, nrow = 10, ncol = 3)) # Create an empty data.frame
colnames(output) <- c("seed", "trainAUC", "testAUC")

set.seed(25)
seeds <- sample.int(1000, 10) # Create 10 different random seeds

for (i in seq_along(seeds)) { # Loop through the seeds
  c(train, test) %<-% trainValTest(data, 
                                   test = 0.2, 
                                   seed = seeds[i], 
                                   only_presence = TRUE) # Make the train/test split

  m <- train("Maxnet", 
             data = train) # train the model

  # Populate the output data.frame
  output[i, 1] <- seeds[i]
  output[i, 2] <- auc(m)
  output[i, 3] <- auc(m, test = test)
}

The testing AUC varies from r round(min(output[, 3]), 3) to r round(max(output[, 3]), 3).

# Print the output
output

# Print the output
kableExtra::kable(output) |> 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                            position = "center",
                            full_width = FALSE)

When we have to deal with a small dataset a better approach is the cross validation.

Cross validation

To perform a cross validation in SDMtune we have to pass the fold argument to the train() function. First we have to create the folds. There are several way to create them, here we explain how to make a random partition of 4 folds using the function randomFolds():

folds <- randomFolds(data, 
                     k = 4, 
                     only_presence = TRUE, 
                     seed = 25)

The output of the function is a list containing two matrices, the first for the training and the second for the testing locations. Each column of one matrix represents a fold with TRUE for the locations included in and FALSE excluded from the partition.

Let's perform a 4 fold cross validation using the Maxnet method (note that we use the full dataset):

cv_model <- train("Maxnet", 
                  data = data, 
                  folds = folds)
cv_model

cv_model <- SDMtune:::bm_maxnet_cv
cv_model

The output in this case is an SDMmodelCV() object. It contains the four trained models in the models slot and the fold partitions in the folds slot. We can compute the AUC of a SDMmodelCV() object using:

cat("Training AUC: ", auc(cv_model))
cat("Testing AUC: ", auc(cv_model, test = TRUE))

this returns the AUC value averaged across the four different models.

Try yourself

Repeat the analysis using the default_model that we created in the train a model article.

Spatial Cross Validation

The train() function accepts folds created with two other packages:

ENMeval [@Muscarella2014]
blockCV [@Valavi2019]

The function will convert internally the created folds into the correct format for SDMtune. These packages have specific function to create folds partitions that are spatially or environmentally independent.

Block partition using the package ENMeval:

library(ENMeval)
block_folds <- get.block(occ = data@coords[data@pa == 1, ], 
                         bg.coords = data@coords[data@pa == 0, ])

model <- train(method = "Maxnet", 
               data = data, 
               fc = "l", 
               reg = 0.8, 
               folds = block_folds)

Checkerboard1 partition using the package ENMeval:

cb_folds <- get.checkerboard1(occ = data@coords[data@pa == 1, ], 
                              env = predictors, 
                              bg.coords = data@coords[data@pa == 0, ], 
                              aggregation.factor = 4)

model <- train(method = "Maxnet", 
               data = data, 
               fc = "l", 
               reg = 0.8, 
               folds = cb_folds)

Environmental block using the package blockCV:

library(blockCV)
# Create sf object
sf_df <- sf::st_as_sf(cbind(data@coords, pa = data@pa),
                      coords = c("X", "Y"),
                      crs = terra::crs(predictors,
                                       proj = TRUE))

# Spatial blocks
spatial_folds <- cv_spatial(x = sf_df,
                            column = "pa",
                            rows_cols = c(8, 10),
                            k = 5,
                            hexagon = FALSE,
                            selection = "systematic")

model <- train(method = "Maxent", 
               data = data, 
               fc = "l", 
               reg = 0.8,
               folds = spatial_folds)