Species distribution models (SDMs) are widely used tools for predicting the potential distribution of a species based on environmental variables. However, it is crucial to evaluate the performance of these models to ensure their accuracy and reliability. One commonly used method for evaluating the performance of SDMs is block cross-validation (read more in Valavi et al. 2019 and the Tutorial 1). This approach allows for a more robust evaluation of the model as it accounts for spatial autocorrelation and other spatial dependencies (Roberts et al. 2017). This document illustrates how to utilize the blockCV
package to evaluate the performance of SDMs using block cross-validation.
Two examples are provided: modelling using the randomForest
, and biomod2
packages.
Check new updates of blockCV
in the tutorial 1- blockCV introduction: how to create block cross-validation folds.
Please cite blockCV
by: Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol. 2019; 10:225–232. doi: 10.1111/2041-210X.13107
The blockCV
package contains the raw format of the following data:
.tif
).csv
)These data are used to illustrate how the package is used. The raster data include several bioclimatic variables for Australia. The species data include presence-absence records (binary) of a simulated species.
options(scipen = 10)
To load the package raster data use:
library(sf) # working with spatial vector data library(terra) # working with spatial raster data library(tmap) # plotting maps # load raster data # the pipe operator |> is available for R version 4.1 or higher rasters <- system.file("extdata/au/", package = "blockCV") |> list.files(full.names = TRUE) |> terra::rast()
The presence-absence species data include 243
presence points and 257
absence points.
# load species presence-asence data and convert to sf points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) head(points)
The appropriate format of species data for the blockCV
package is simple features (from the sf
package). The data is provide in GDA2020 / GA LCC coordinate reference system with "EPSG:7845"
as defined by crs = 7845
. We convert the data.frame
to sf
as follows:
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)
Let's plot the species data using tmap
package:
tm_shape(rasters[[1]]) + tm_raster(legend.show = FALSE, n = 30, palette = gray.colors(10)) + tm_shape(pa_data) + tm_dots(col = "occ", style = "cat", size = 0.1)
Here, we generate two CV strategies, one k-fold CV using cv_spatial
and one LOO CV using cv_nndm
. See more options and configurations in the Tutorial 1 - introduction to blockCV
.
library(blockCV)
Creating spatial blocks:
scv1 <- cv_spatial( x = pa_data, column = "occ", # the response column (binary or multi-class) r = rasters, k = 5, # number of folds size = 360000, # size of the blocks in metres selection = "random", # random blocks-to-fold iteration = 50, # find evenly dispersed folds progress = FALSE, # turn off progress bar biomod2 = TRUE, # also create folds for biomod2 raster_colors = terrain.colors(10, rev = TRUE) # options from cv_plot for a better colour contrast )
Now, let's create LOO CV folds with nearest neighbour distance matching (NNDM; Milà et al. 2022) algorithm. To run cv_nndm
, you need a measure of spatial autocorrelation present in your data. This can be done wither by 1) fitting a model and use it's residual to calculate spatial autocorrelation, or 2) use the autocorrelation of response variable for it (Milà et al, 2022; Roberts et al. 2017). Here, we calculate spatial autocorrelation range in the response using the cv_spatial_autocor
function.
range <- cv_spatial_autocor( x = pa_data, # species data column = "occ", # column storing presence-absence records (0s and 1s) plot = FALSE ) range$range
So the range of spatial autocorrelation is roughly 360
kilometres.
scv2 <- cv_nndm( x = pa_data, column = "occ", r = rasters, size = 360000, # range of spatial autocorrelation num_sample = 10000, # number of samples of prediction points sampling = "regular", # sampling methods; it can be random as well min_train = 0.1, # minimum portion to keep in each train fold plot = TRUE )
You can visualise the generated folds of both methods using cv_plot
function. Here is three folds from the cv_nndm
object:
# see the number of folds in scv2 object scv2$k
cv_plot( cv = scv2, # cv object x = pa_data, # species spatial data num_plots = c(1, 10, 100) # three of folds to plot )
In this section, we show how to use the folds generated by blockCV
in the previous sections for the evaluation of SDMs constructed on the species data available in the package. The blockCV
stores training and testing folds in three different formats. The common format for all three blocking strategies is a list of the indices of observations in each fold. For cv_spatial
and cv_cluster
(but not cv_buffer
and cv_nndm
), the folds are also stored in a matrix format suitable for the biomod2
package and a vector of fold's number for each observation. This is equal to the number of observation in species spatial data. These three formats are stored in the cv objects as folds_list
, biomod_table
and folds_ids
respectively.
blockCV
with Random Forest modelFolds generated by cv_nndm
function are used here (a training and testing fold for each record) to show how to use folds from this function (the cv_buffer
is also similar to this approach) for evaluation species distribution models.
Note that with cv_nndm
using presence-absence data (and any other type of data except for presence-background data when presence_bg = TRUE
is used), there is only one point in each testing fold, and therefore AUC cannot be calculated for each fold separately. Instead, the value of each point is first predicted to the testing point (of each fold), and then a unique AUC is calculated for the full set of predictions.
# loading the libraries library(randomForest) library(precrec) # extract the raster values for the species points as a dataframe model_data <- terra::extract(rasters, pa_data, df = TRUE, ID = FALSE) # adding species column to the dataframe model_data$occ <- as.factor(pa_data$occ) head(model_data) # extract the fold indices from buffering object # created in the previous section # the folds_list works for all four blocking strategies folds <- scv2$folds_list # create a data.frame to store the prediction of each fold (record) test_table <- pa_data test_table$preds <- NA for(k in seq_len(length(folds))){ # extracting the training and testing indices # this way works with folds_list list (but not folds_ids) trainSet <- unlist(folds[[k]][1]) # training set indices; first element testSet <- unlist(folds[[k]][2]) # testing set indices; second element rf <- randomForest(occ ~ ., model_data[trainSet, ], ntree = 500) # model fitting on training set test_table$preds[testSet] <- predict(rf, model_data[testSet, ], type = "prob")[,2] # predict the test set } # calculate Area Under the ROC and PR curves and plot the result precrec_obj <- evalmod(scores = test_table$preds, labels = test_table$occ) auc(precrec_obj)
# to not run the model and reduce run time; result are calculated and loaded read.csv("../man/figures/roc_rf.csv")
Plot the curves:
library(ggplot2) autoplot(precrec_obj)
blockCV
in biomod2
packagePackage biomod2
(Thuiller et al., 2017) is a commonly used platform for modelling species distributions in an ensemble framework. In this example, we show how to use blockCV
folds in biomod2
. In this example, the folds generated by cv_spatial
is used to evaluate three modelling methods implemented in biomod2
. The CV.user.table
can be generated by both cv_spatial
and cv_cluster
functions and it is stored as biomod_table
in their output objects (note: in the older versions of the biomod2
package, the argument data.split.table
was used for external CV folds. This has now changed to CV.user.table
that also requires CV.strategy = "user.defined"
and new column names. See the example below).
# loading the library library(biomod2) # extract the raster values for the species points as a dataframe raster_values <- terra::extract(rasters, pa_data, df = TRUE, ID = FALSE) # 1. Formatting Data biomod_data <- BIOMOD_FormatingData(resp.var = pa_data$occ, expl.var = raster_values, resp.xy = sf::st_coordinates(pa_data), resp.name = "occ", na.rm = TRUE) # 2. Defining the folds for CV.user.table # note that biomod_table should be used here not folds # use generated folds from cv_spatial in previous section spatial_cv_folds <- scv1$biomod_table # the new update of the package biomod2 (v4.2-3 <) requires the names to be as below colnames(spatial_cv_folds) <- paste0("_allData_RUN", 1:ncol(spatial_cv_folds)) # 3. Defining Models Options; using default options here. You can use your own settting here. biomod_options <- bm_ModelingOptions(data.type = "binary", strategy = "bigboss") # 4. Model fitting biomod_model_out <- BIOMOD_Modeling(biomod_data, models = c('GLM','MARS','GBM'), CV.strategy = "user.defined", CV.user.table = spatial_cv_folds, OPT.user = biomod_options, var.import = 0, metric.eval = c('ROC'), do.full.models = TRUE)
# 5. Model evaluation biomod_model_eval <- get_evaluations(biomod_model_out) biomod_model_eval[c("run", "algo", "metric.eval", "calibration", "validation")]
# to not run the model and reduce run time; result are calculated and loaded read.csv("../man/figures/evl_biomod.csv")
The validation
column shows the result of spatial cross-validation and each RUN is a CV fold.
C. Milà, J. Mateu, E. Pebesma, and H. Meyer, Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation, Methods in Ecology and Evolution (2022).
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., others, 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Thuiller W, Georges D, Guéguen M, Engler R, Breiner F, Lafourcade B, Patin R, Blancheteau H (2024). biomod2: Ensemble Platform for Species Distribution Modeling. R package version 4.2-5. https://CRAN.R-project.org/package=biomod2.
Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol. 2019; 10:225–232. doi: 10.1111/2041-210X.13107
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.