knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # xgboost uses data.table data.table::setDTthreads(2) RhpcBLASctl::blas_set_num_threads(2) RhpcBLASctl::omp_set_num_threads(2) download_data <- FALSE # note that data were created in the overview vignette
In this vignette, we illustrate how a number of features from tidymodels
can
be used to enhance a conventional SDM pipeline. We recommend users first become familiar with tidymodels
;
there are a number of excellent tutorials (both introductory and advanced) on its dedicated website
We reuse the example on the Iberian lizard that we used
in the
tidysdm
overview article.
DALEX
An issue with machine learning algorithms is that it is not easy to understand the
role of different variables in giving the final prediction. A number of packages
have been created to explore and explain the behaviour of ML algorithms, such as those
used in tidysdm
. In the tidysdm
overview article, we illustrated how to use recipes
to create profiles.
Here we demonstrate how to use DALEX, an excellent package
that has methods to deal with tidymodels
. tidysdm
contains additional functions that
allow use to use the DALEX functions directly on tidysdm
ensembles.
We will use a simple ensemble that we built in the overview vignette.
library(tidysdm)
lacerta_ensemble
The first step in DALEX is to create an explainer object, which can then be
queried by different functions in the package, to turn the explainer into an
explanation (following the DALEX lingo). As a first step, we use the
custom function explain_tidysdm
to generate our explainer:
explainer_lacerta_ens <- explain_tidysdm(lacerta_ensemble)
Now that we have our explainer, we can explore variable importance for the ensemble:
library(DALEX) vip_ensemble <- model_parts(explainer = explainer_lacerta_ens) plot(vip_ensemble)
Or generate partial dependency plots for a given variable (e.g. bio05):
pdp_bio05 <- model_profile(explainer_lacerta_ens, N = 500, variables = "bio05") plot(pdp_bio05)
There are many other functions in DALEX that can be applied to the explainer to further explore the behaviour of the model; see several tutorial on https://modeloriented.github.io/DALEX/
It is also possible to explore the individual models that make up the ensemble:
explainer_list <- explain_tidysdm(tidysdm::lacerta_ensemble, by_workflow = TRUE)
The resulting list can be then used to build lists of explanations, which can then be plotted.
profile_list <- lapply(explainer_list, model_profile, N = 500, variables = "bio05" ) plot(profile_list)
The standard approach in tidymodels
is to make an initial split of the data
into a test and a training set. We will use retain 20% of the data (1/5) for the testing set, and
use the rest for training.
We start by loading a set of presences and absences and their associated climate,
analogous to the one that we generated in the tidysdm
overview article:
library(tidysdm) library(sf) lacerta_thin <- readRDS(system.file("extdata/lacerta_thin_all_vars.rds", package = "tidysdm" ))
We then use spatial_initial_split
to do the split, using a spatial_block_cv
scheme to partition the data:
set.seed(1005) lacerta_initial <- spatial_initial_split(lacerta_thin, prop = 1 / 5, spatial_block_cv ) autoplot(lacerta_initial)
And check the balance of presences vs pseudoabsences:
check_splits_balance(lacerta_initial, class)
We can now extract the training set from our lacerta_initial
split, and sample
folds to set up cross validation (note that we set the cellsize
and offset
based on the full dataset, lacerta_thin
; this allows us to use the same grid
we used for the initial_split
). In this example, we also have add a very small number
to the offset to avoid an error in the spatial_block_cv
function arising from
some points falling on the boundary of the grid cells). This is often not necessary,
so only introduce this if you encounter an error.
set.seed(1005) lacerta_training <- training(lacerta_initial) lacerta_cv <- spatial_block_cv(lacerta_training, v = 5, cellsize = grid_cellsize(lacerta_thin), offset = grid_offset(lacerta_thin) + 0.00001 ) autoplot(lacerta_cv)
And check the balance in the dataset:
check_splits_balance(lacerta_cv, class)
Only certain type of models (e.g. glm, svm) struggle with correlated variables; other algorithms, such as random forests, can handle correlated variables. So, we will create two recipes, one with all variables, and one only with the variables that are uncorrelated:
lacerta_rec_all <- recipe(lacerta_thin, formula = class ~ .) lacerta_rec_uncor <- lacerta_rec_all %>% step_rm(all_of(c( "bio01", "bio02", "bio03", "bio04", "bio07", "bio08", "bio09", "bio10", "bio11", "bio12", "bio14", "bio16", "bio17", "bio18", "bio19", "altitude" ))) lacerta_rec_uncor
And now use these two recipes in a workflowset
(we will keep it small for computational time),
selecting the appropriate recipe for each model. We will include a model (polynomial
support vector machines, or SVM) which does not have a wrapper in tidysdm
for creating a model specification. However, we can use a standard model spec from
yardstick
:
lacerta_models <- # create the workflow_set workflow_set( preproc = list( uncor = lacerta_rec_uncor, # recipe for the glm all = lacerta_rec_all, # recipe for the random forest all = lacerta_rec_uncor # recipe for svm ), models = list( # the standard glm specs glm = sdm_spec_glm(), # rf specs with tuning rf = sdm_spec_rf(), # svm specs with tuning svm = parsnip::svm_poly( cost = tune::tune(), degree = tune::tune() ) %>% parsnip::set_engine("kernlab") %>% parsnip::set_mode("classification") ), # make all combinations of preproc and models, cross = FALSE ) %>% # tweak controls to store information needed later to create the ensemble # note that we use the bayes version as we will use a Bayes search (see later) option_add(control = stacks::control_stack_bayes())
We can now use the block CV folds to
tune and assess the models. Note that there are multiple tuning approaches,
besides the standard grid method. Here we will use tune_bayes
from the
tune
package (see its
help page to see how a Gaussian Process model is used to choose parameter
combinations).
This tuning method (as opposed to use a standard grid) does not allow for
hyper-parameters with unknown limits, but
mtry
for random forest is undefined as its upper range depends on the number of
variables in the dataset. So, before tuning, we need to finalise mtry
by
informing the set dials with the actual data:
rf_param <- lacerta_models %>% # extract the rf workflow extract_workflow("all_rf") %>% # extract its parameters dials (used to tune) extract_parameter_set_dials() %>% # give it the predictors to finalize mtry finalize(x = st_drop_geometry(lacerta_thin) %>% select(-class)) # now update the workflowset with the new parameter info lacerta_models <- lacerta_models %>% option_add(param_info = rf_param, id = "all_rf")
And now we can tune the models:
set.seed(1234567) lacerta_models <- lacerta_models %>% workflow_map("tune_bayes", resamples = lacerta_cv, initial = 8, metrics = sdm_metric_set(), verbose = TRUE )
We can have a look at the performance of our models with:
autoplot(lacerta_models)
Instead of building a simple ensemble with the best version of each model type, we
can build a stack ensemble, as implemented in the package stacks
. Stacking
uses a meta-learning algorithm to learn how to best combine multiple models, including
multiple versions of the same algorithm with different hyper-parameters.
library(stacks) set.seed(1005) lacerta_stack <- # initialize the stack stacks() %>% # add candidate members add_candidates(lacerta_models) %>% # determine how to combine their predictions blend_predictions() %>% # fit candidates with non-zero weights (i.e. non-zero stacking coefficients) fit_members() autoplot(lacerta_stack, type = "weights")
We can see that three versions of the SVM and one of the random forests were selected; the stacking coefficients give an indication of the weight each model carries within the ensemble. We can now use the ensemble to make predictions about the testing data:
lacerta_testing <- testing(lacerta_initial) lacerta_test_pred <- lacerta_testing %>% bind_cols(predict(lacerta_stack, ., type = "prob"))
And look at the goodness of fit using some commonly used sdm metrics. Note that
sdm_metric_set
is first invoked to generate a function (with empty ()
) that is then used
on the data.
sdm_metric_set()(data = lacerta_test_pred, truth = class, .pred_presence)
We can now make predictions with this stacked ensemble. We start by extracting the climate for the variables of interest
download_dataset("WorldClim_2.1_10m") climate_vars <- lacerta_rec_all$var_info %>% filter(role == "predictor") %>% pull(variable) climate_present <- pastclim::region_slice( time_ce = 1985, bio_variables = climate_vars, data = "WorldClim_2.1_10m", crop = iberia_poly )
climate_present <- terra::readRDS( system.file("extdata/lacerta_climate_present_10m.rds", package = "tidysdm") ) climate_vars <- lacerta_rec_all$var_info %>% filter(role == "predictor") %>% pull(variable) if (!all(climate_vars %in% names(climate_present))) { stop("mismatched variables in the raster") }
prediction_present <- predict_raster(lacerta_stack, climate_present, type = "prob" ) library(tidyterra) ggplot() + geom_spatraster(data = prediction_present, aes(fill = .pred_presence)) + scale_fill_terrain_c() + # plot presences used in the model geom_sf(data = lacerta_thin %>% filter(class == "presence"))
Most machine learning algorithms do not natively use multilevel factors as predictors.
The solution is to create dummy variables, which are binary variables that represent
the levels of the factor. In tidymodels
, this is done using the step_dummy
function.
Let's create a factor variable with 3 levels based on altitude.
library(tidysdm) # load the dataset lacerta_thin <- readRDS(system.file("extdata/lacerta_thin_all_vars.rds", package = "tidysdm" )) # create a topography variable with 3 levels based on altitude lacerta_thin$topography <- cut(lacerta_thin$altitude, breaks = c(-Inf, 200, 800, Inf), labels = c("plains", "hills", "mountains") ) table(lacerta_thin$topography)
We then create the recipe by adding a step to create dummy variables for the topography
variable.
# subset to variable of interest lacerta_thin <- lacerta_thin %>% select( class, bio05, bio06, bio12, bio15, topography ) lacerta_rec <- recipe(lacerta_thin, formula = class ~ .) %>% step_dummy(topography) lacerta_rec
Let's us see what this does:
lacerta_prep <- prep(lacerta_rec) summary(lacerta_prep)
We have added two "derived" variables, topography_hills and topography_mountains, which are binary variables that allow us to code topography (with plains being used as the reference level, which is coded by both hills and mountains being 0 for a given location). We can look at the first few rows of the data to see the new variables by baking the recipe:
lacerta_bake <- bake(lacerta_prep, new_data = lacerta_thin) glimpse(lacerta_bake)
We can now run the sdm as usual:
# define the models lacerta_models <- # create the workflow_set workflow_set( preproc = list(default = lacerta_rec), models = list( # the standard glm specs glm = sdm_spec_glm(), # rf specs with tuning rf = sdm_spec_rf() ), # make all combinations of preproc and models, cross = TRUE ) %>% # tweak controls to store information needed later to create the ensemble option_add(control = control_ensemble_grid()) # tune set.seed(100) lacerta_cv <- spatial_block_cv(lacerta_thin, v = 3) lacerta_models <- lacerta_models %>% workflow_map("tune_grid", resamples = lacerta_cv, grid = 3, metrics = sdm_metric_set(), verbose = TRUE ) # fit the ensemble lacerta_ensemble <- simple_ensemble() %>% add_member(lacerta_models, metric = "boyce_cont")
We can now verify that the dummy variables were used by extracting the model fit from one of the models in the ensemble:
lacerta_ensemble$workflow[[1]] %>% extract_fit_parsnip()
We can see that we have coefficients for topography_hills and topography_mountains.
Let us now predict the presence of the lizard in the Iberian Peninsula using the ensemble. Note that,
for predict_raster()
to work, the name and levels for a categorical variable need to match with those used
when training the models (i.e. in the recipe with step_dummy()
):
climate_present <- terra::readRDS( system.file("extdata/lacerta_climate_present_10m.rds", package = "tidysdm" ) ) # first we add a topography variable to the climate data climate_present$topography <- climate_present$altitude climate_present$topography <- terra::classify(climate_present$topography, rcl = c(-Inf, 200, 800, Inf), include.lowest = TRUE, brackets = TRUE ) library(terra) levels(climate_present$topography) <- data.frame(ID = c(0, 1, 2), topography = c("plains", "hills", "mountains")) # now we can predict predict_factor <- predict_raster(lacerta_ensemble, climate_present) plot(predict_factor)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.