4. Fit and Explore Selected Models"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 6
)

Summary


Description

After the best performing models have been selected, users need to fit this models (using fit_selected()) in order to explore their characteristics and continue with the next steps. Fitted models can then be used to assess variable importance in models, as well as to explore variable response curves. Selected models can also be evaluated using independent records that were not used during calibration. This vignette contains examples to explore the multiple options available to fit and explore selected models.


Getting ready

At this point it is assumed that kuenm2 is installed (if not, see the Main guide). Load kuenm2 and any other required packages, and define a working directory (if needed).

Note: functions from other packages (i.e., not from base R or kuenm2) used in this guide will be displayed as package::function().

# Load packages
library(kuenm2)
library(terra)

# Current directory
getwd()

# Define new directory
#setwd("YOUR/DIRECTORY")  # uncomment and modify if setting a new directory

# Saving original plotting parameters
original_par <- par(no.readonly = TRUE)


Fitting selected models

To fit the selected models, we need a calibration_results object. For more details in model calibration, please refer to the vignette Model Calibration. The calibration_results object generated in this vignette is available as a data example in the package. Let's load it.

# Import an example of calibration results 
data("calib_results_maxnet", package = "kuenm2")

# Print calibration result
calib_results_maxnet


This object contains the results of candidate models calibrated using the maxnet algorithm. The package also provides a similar example wit models created using the glm algorithm. See below how to load the calib_results_glm object in case you want to explore it.

# Import calib_results_glm
data("calib_results_glm", package = "kuenm2")

# Print calibration result
calib_results_glm


Note that the calibration_results object stores only the information related to the calibration process and model evaluation, it does not include the fitted maxnet (or glm) models.

To obtain fitted models, we need to use the fit_selected() function. By default, this function fits a full model (i.e., without replicates and without splitting the data into training and testing sets). However, you can configure it to fit final models with replicates if desired.

In this example, we'll fit the final models with 4 k-fold replicates (leaving one fold out), as in the Model Calibration vignette.

# Fit models using calibration results
fm <- fit_selected(calibration_results = calib_results_maxnet, 
                   replicate_method = "kfolds", n_replicates = 4, 
                   verbose = FALSE, progress_bar = FALSE)
# Fit selected models using calibration results
fm <- fit_selected(calibration_results = calib_results_maxnet, 
                   replicate_method = "kfolds", n_replicates = 4)
# Fitting replicates...
#   |========================================================================| 100%
# Fitting full models...
#   |========================================================================| 100%


The fit_selected() function returns a fitted_models object, a list that contains essential information from the fitted models, which is required for the subsequent steps.

You can explore the contents of the fitted_models object by indexing its elements. For example, the fitted maxnet (or glm) model objects are stored within the Models element. Note that Models is a nested list: for each selected model (in this case, models 192 and 219), it includes both the replicates (if fitted with replicates) and the full model.

# See names of selected models
names(fm$Models)

# See models inside Model 192
names(fm$Models$Model_192)


The fitted_models object also stores the thresholds that can be used to binarize the models into suitable and unsuitable areas. These thresholds correspond to the omission error used during model selection (e.g., 5% or 10%).

You can access the omission error used to calculate the thresholds directly from the object:

# Get omission error used to select models and calculate the theshold values
fm$omission_rate


The omission error used to calculate the thresholds was 10%, meaning that when the predictions are binarized, approximately 10% of the presence records used to calibrate the models will fall into cells with predicted values below the threshold.

The thresholds are summarized in two ways: the mean and median across replicates for each model, and the consensus mean and median across all selected models (when more than one model is selected):

fm$thresholds


Response curves

The selected models in the fitted_models object can be used to generate response curves. Individual variable response curves illustrate how each environmental variables influences suitability, while keeping all other variables constant.

By default, the curves are generated with all other variables set to their mean values (or the mode for categorical variables). The mean or mode are calculated from the combined set of presence and background data (averages_from = "pr_bg"). You can change this behavior to use only the presence localities by setting averages_from = "pr".


All response curves

An easy way to explore response curves for all variables is as follows:

# Produce response curves for all variables in all fitted models
all_response_curves(fm)


In the previous plot, the dashed lines indicate the range of variables values in the data used to fit models. If multiple models were fitted, variability is shown via a Generalized Additive Model (GAM) using the mean and the 95% confidence interval. For a more detailed view of what the response curve for each of the models fitted look like, the argument show_lines can be set to TRUE.

# All response curves showing each model as a different line
all_response_curves(fm, show_lines = TRUE)


To explore response curves of all variables for each model independently, the argument model_ID can be used. The plot will show variable response curves for the full model.

# All response curves for model 219
all_response_curves(fm, modelID = "Model_219")


To get a view of variability in response curves for a model produced with replicates, use show_variability = TRUE. Both, the GAM or the multiple-line approaches can be used to show variability.

# All response curves for model 219 (GAM for variability)
all_response_curves(fm, modelID = "Model_219", show_variability = TRUE)

# All response curves for model 219 (each curve is a replicate)
all_response_curves(fm, modelID = "Model_219", show_variability = TRUE, 
                    show_lines = TRUE)


Customized response curves

The previous plots help to gain a quick understanding of response curves for fitted models, but little can be done make each panel look better. To produce nicer plots, response curves for each variable at a time can be produced. Let's check which variables are available to plot by examining the coefficients of the full models:

# Get variables with non-zero coefficients in the models
fm$Models[[1]]$Full_model$betas  # From the first model selected
fm$Models[[2]]$Full_model$betas  # From the second model selected


The variables bio_1, bio_7, bio_12 and bio_15have non-zero coefficient values, which means they contribute to the model and are available for generating response curves.

Remember that response curves are computed using all selected models. This time let's change the margins and labels for each plot

par(mfrow = c(2, 2), mar = c(4, 4, 1, 0.5))  # Set grid of plot
response_curve(models = fm, variable = "bio_1", las = 1)
response_curve(models = fm, variable = "bio_7", ylab = "", las = 1)
response_curve(models = fm, variable = "bio_12", las = 1)
response_curve(models = fm, variable = "bio_15", ylab = "", las = 1)


We can also specify which of the selected models should be used to generate the response curves:

par(mfrow = c(2, 2), mar = c(4, 4, 2.5, 0.5))  # Set grid of plot
response_curve(models = fm, variable = "bio_1", modelID = "Model_192", 
               main = "Model_192", las = 1)
response_curve(models = fm, variable = "bio_1", modelID = "Model_219", 
               main = "Model_219", ylab = "", las = 1)
response_curve(models = fm, variable = "bio_7", modelID = "Model_192", 
               main = "Model_192", las = 1)
response_curve(models = fm, variable = "bio_7", modelID = "Model_219", 
               main = "Model_219", ylab = "", las = 1)


By default, the response curve extends beyond the range of model fitting limits (dashed lines) based on the observed range of values and an extrapolation_factor (when extrapolation = TRUE). The default extrapolation factor is set to 10% of the range.

If extrapolation = FALSE, no extrapolation occurs, and the plot limits match the fitting data range exactly.

We can increase the extrapolation factor to allow a broader range beyond the observed data. Below is the response curve plotted with an extrapolation factor of 2:

par(mfrow = c(2, 2), mar = c(4, 4, 1, 0.5))  # Set grid of plot
response_curve(models = fm, variable = "bio_1", extrapolation_factor = 2, 
               las = 1)
response_curve(models = fm, variable = "bio_7", extrapolation_factor = 2,
               ylab = "", las = 1)
response_curve(models = fm, variable = "bio_12", extrapolation_factor = 2,
               las = 1)
response_curve(models = fm, variable = "bio_15", extrapolation_factor = 2,
               ylab = "", las = 1)


Note that the response curve now extends further beyond the observed data range. Optionally, we can manually set the lower and upper limits of the variables. For example, since bio_12 represents annual precipitation and negative values are not realistic, we can set its lower limit to 0:

response_curve(models = fm, variable = "bio_12", extrapolation_factor = 0.1, 
               l_limit = 0)


Now, the lower limit of the plot for bio_12 is set to 0. Since we did not specify an upper limit, the plot uses the extrapolation factor (here, 0.1) to define the upper limit.

The other aspect to notice after increasing the extrapolation factor is that the curves looked like they were not perfectly unimodal. Unfortunately, using a GAM to summarize and represent variability can some times generate this effect. However, plotting responses for each model as independent curves will show the real shape of the curve, see the differences after setting show_lines = TRUE.

par(mfrow = c(1, 2), mar = c(4, 4, 1, 0.5))  # Set grid of plot
response_curve(models = fm, variable = "bio_1", extrapolation_factor = 0.5, 
               las = 1)
response_curve(models = fm, variable = "bio_1", extrapolation_factor = 0.5, 
               show_lines = TRUE, ylab = "", las = 1)


Optionally, we can add the presence and background points used to fit the models to the response curve plot by setting add_points = TRUE:

response_curve(models = fm, variable = "bio_1", show_lines = TRUE,
               add_points = TRUE)


Bivariate response curves

All previous curve plots are showing responses to individual variables. Exploring responses to two variables at a time can help understand the small differences between models fitted with distinct terms. Let's see an example below:

# First, let's check the terms in the models fitted
## Model 192
fm$Models$Model_192$Full_model$betas

## Model 219
fm$Models$Model_219$Full_model$betas

# Example of a bivariate response plot 
bivariate_response(models = fm, variable1 = "bio_1", variable2 = "bio_15", 
                   modelID = "Model_192")


In the previous plot, the dashed lines represent the limits of the data used for model fitting, and darker colors represent higher suitability. Let's now compare the two models and check if there are any differences. Multiple bivariate response plots can be put together in a single plot if the suitability bar legend is not used.

par(mfrow = c(1, 2), mar = c(4, 4, 2.5, 0.5))

# Bivariate response model 192 
bivariate_response(models = fm, variable1 = "bio_1", variable2 = "bio_15", 
                   modelID = "Model_192", add_bar = FALSE, main = "Model 192")

# Bivariate response model 219 
bivariate_response(models = fm, variable1 = "bio_1", variable2 = "bio_15", 
                   modelID = "Model_219", add_bar = FALSE, main = "Model 219", 
                   ylab = "")


These two models look very similar, the only noticeable difference is that suitability values above the threshold extend farther on variable bio_15 for model 219. More noticeable effects can be seen in bivariate representations when fitted models include or not product terms. However, as in this example, even a small variation on the number or type of terms included in a model can change how predictions look like.


Variable importance

The relative importance of model predictors can be calculated exploring the deviance using the var_importance() function. This process starts by fitting the full model (maxnet or glm), which includes all predictors. Then, the function fits separate models excluding one predictor at a time, and quantifies how the removal affects model performance. The values of contribution are computed for variables by default, but explorations by terms are also possible using the argument by_terms (i.e., if the model includes linear and quadratic terms of a variable, the process will run for each of them separately).

By default, the function runs on a single core. You can enable parallel processing by setting parallel = TRUE and specifying the number of cores with ncores. Note that parallelization only speeds up the computation when there are many variables (e.g., >17) and the calibration dataset is large (e.g., over 15,000 total points).

Variable importance is computed for all selected models by default:

# Calculate variable importance
imp <- variable_importance(models = fm, progress_bar = FALSE, verbose = FALSE)
# Calculate variable importance
imp <- variable_importance(models = fm)

# Calculating variable contribution for model 1 of 2
#   |======================================================================| 100%
# Calculating variable contribution for model 2 of 2
#   |======================================================================| 100%


The function returns a data.frame with the relative contribution of each variable. If multiple models are included in the fitted object, an additional column identifies each model.

imp


We can visualize variable importance using the plot_importance() function. When the fitted_models object contains more than one selected model, contributions are displayed as a boxplot, along with the mean contribution, and the number (N) of fitted models with that predictor.

plot_importance(imp)


If variable importance is computed for a single model, the plot displays a bars instead of boxes:

# Calculate variable importance for a specific selected Model
imp_192 <- variable_importance(models = fm, modelID = "Model_192", 
                               progress_bar = FALSE)

# Plot variable contribution for model 192
plot_importance(imp_192, main = "Variable importance: Model 192")


The previous examples show how important are variables by checking the change in models if we remove a variable completely from the equation. To compute contribution of all variable terms (i.e., how important is to keep distinct terms of the same variable) set by_terms = TRUE:

# Calculate variable importance for a specific selected Model
imp_terms <- variable_importance(models = fm, by_terms = TRUE, 
                                 progress_bar = FALSE)

# Check results
imp_terms

# Plot variable contribution for model 192
par(cex = 0.7, mar = c(3, 4, 2.5, 0.5))  # Set grid of plot
plot_importance(imp_terms, main = "Importance of variable terms")


Evaluation with independent data

Selected models can be evaluated using an independent set of presence records (not used when fitting selected models). This approach is especially useful when new records become available. This can be useful to assess models for invasive species, for which model fitting is usually done in native areas with transfers to potential invasive areas.

The independent_evaluation() function computes omission rates (i.e., the proportion of independent records predicted as below the suitability threshold) and partial ROC. The function also assesses whether environmental conditions in independent data are analogous those under which models were fit using the mobility oriented-parity metric (MOP; Cobos et al. 2024).

Let's use the new_occ as an example of independent data (provided in the package). This dataset contains coordinates of Myrcia hatschbachii sourced from NeotropicTree (Oliveira-Filho, 2017), and they were not part of the data used to fit the models.

# Import independent records
data("new_occ", package = "kuenm2")

# See data structure
str(new_occ)


In order to evaluate predictive ability of models and how analogous conditions in independent data are to fitting conditions, environmental values need to be extracted to these new records. Let's import the raster variables and extract values for new_occ:

# Import raster layers
var <- terra::rast(system.file("extdata", "Current_variables.tif",
                               package = "kuenm2"))

# Extract variables to occurrences
new_data <- extract_occurrence_variables(occ = new_occ, x = "x", y = "y",
                                         raster_variables = var)

# See data structure
str(new_data)


Finding non-analogous conditions in independent records is not uncommon, especially in regions outside model calibration areas. To better illustrate this case, let's add three fake records, in which some variables have non-analogous values, either higher than the upper limit or lower than the lower limit observed in the data used to fit models.

# Add some fake data beyond the limits of calibration ranges
fake_data <- data.frame("pr_bg" = c(1, 1, 1),
                        "x" = c(NA, NA, NA),
                        "y" = c(NA, NA, NA),
                        "bio_1" = c(10, 15, 23),
                        "bio_7" = c(12, 16, 20),
                        "bio_12" = c(2300, 2000, 1000),
                        "bio_15" = c(30, 40, 50),
                        "SoilType" = c(1, 1, 1))

# Bind data
new_data <- rbind(new_data, fake_data)


Now, let's evaluate models with this independent dataset (keep in mind that the last three records are fake):

# Evaluate models with independent data
res_ind <- independent_evaluation(fitted_models = fm, new_data = new_data)


The output is a list with three elements:

  1. evaluation metrics (omission rate and pROC) for each selected model, as well as for the overall consensus.
  2. mop_results, a list containing the output of MOP analysis comparing conditions in independent data and those used to fit models. This happens because the perform_mop argument is set to TRUE (the default).
  3. predictions of continuous and binary suitability for each of the independent records (the default; return_predictions = TRUE). In addition, MOP results are included for all the records.
res_ind$evaluation


All selected models have significant pROC values, but show higher omission rates than expected based on the 10% omission threshold used for calibration (~35% of the independent records are predicted as unsuitable) .

For each records, the following MOP results are provided:

# Show the mop results for the last 5 independent records
res_ind$predictions$continuous[81:85 , c("mop_distance", "inside_range", 
                                         "n_var_out", "towards_low", 
                                         "towards_high")]


Note that two of the three fake records we added to new_data have non-analogous environmental conditions. One of them falls in a location where bio_7 and bio_1 have values lower than the minimum observed in the model fitting data, whereas bio_12 has a value higher than the maximum. Another record is in a location where bio_12 is below the model fitting range, and bio_1 exceeds the upper limit.


When we set return_predictions = TRUE (the default), the function also returns the predictions for each selected model and for the general consensus:

# Show the continuous predictions for the last 5 independent records
# Round to two decimal places
round(res_ind$predictions$continuous[81:85, 1:6], 2)

# Show the binary predictions for the last 5 independent records
res_ind$predictions$binary[81:85, 1:6]


These results can help determine whether independent data should be incorporated into the calibration dataset to re-run models. If omission rates for independent records are too high, pROC values are non-significant, or most of the records show non-analogous environmental conditions, it might be a good idea to re-run the models including independent records.


# Reset plotting parameters
par(original_par) 


Saving a fitted_models object

After fitting the best-performing models with fit_selected(), we can proceed to predict the models for single or multiple scenarios. As this object is essentially a list, users can save it to a local directory using saveRDS(). Saving the object facilitates loading it back into your R session later using readRDS(). See an example below:

# Set directory to save (here, in a temporary directory)
dir_to_save <- file.path(tempdir())

# Save the data:
saveRDS(fm, file.path(dir_to_save, "fitted_models.rds"))

# Import data
fm <- readRDS(file.path(dir_to_save, "fitted_models.rds"))


Try the kuenm2 package in your browser

Any scripts or data that you put into this service are public.

kuenm2 documentation built on April 21, 2026, 1:07 a.m.