variable_analysis: Function to evaluate relative importance of each variable.

View source: R/variable_analysis.R

variable_analysisR Documentation

Function to evaluate relative importance of each variable.

Description

Evaluate relative importance of each variable within the model using the following methods:

  • Jackknife test based on AUC ratio and Pearson correlation between the result of model using all variables

  • SHapley Additive exPlanations (SHAP) according to Shapley values

Usage

variable_analysis(
  model,
  pts_occ,
  pts_occ_test = NULL,
  variables,
  shap_nsim = 100,
  visualize = FALSE,
  seed = 10
)

Arguments

model

(isolation_forest) The extended isolation forest SDM. It could be the item model of POIsotree made by function isotree_po.

pts_occ

(sf) The sf style table that include training occurrence locations.

pts_occ_test

(sf, or NULL) The sf style table that include occurrence locations of test. If NULL, it would be set the same as var_occ. The default is NULL.

variables

(stars) The stars of environmental variables. It should have multiple attributes instead of dims. If you have raster object instead, you could use st_as_stars to convert it to stars or use read_stars directly read source data as a stars.

shap_nsim

(integer) The number of Monte Carlo repetitions in SHAP method to use for estimating each Shapley value. See details in documentation of function explain in package fastshap.

visualize

(logical) If TRUE, plot the analysis figures. The default is FALSE.

seed

(integer) The seed for any random progress. The default is 10L.

Details

Jackknife test of variable importance is reflected as the decrease in a model performance when an environmental variable is used singly or is excluded from the environmental variable pool. In this function, we used Pearson correlation and AUC ratio.

Pearson correlation is the correlation between the predictions generated by different variable importance evaluation methods and the predictions generated by the full model as the assessment of mode performance.

The area under the ROC curve (AUC) is a threshold-independent evaluator of model performance, which needs both presence and absence data. A ROC curve is generated by plotting the proportion of correctly predicted presence on the y-axis against 1 minus the proportion of correctly predicted absence on x-axis for all thresholds. Multiple approaches have been used to evaluate accuracy of presence-only models. Peterson et al. (2008) modified AUC by plotting the proportion of correctly predicted presence against the proportion of presences falling above a range of thresholds against the proportion of cells of the whole area falling above the range of thresholds. This is the so called AUC ratio that is used in this package.

SHapley Additive exPlanations (SHAP) uses Shapley values to evaluate the variable importance. The larger the absolute value of Shapley value, the more important this variable is. Positive Shapley values mean positive affect, while negative Shapely values mean negative affect. Please check references for more details if you are interested in.

Value

(VariableAnalysis) A list of

  • variables (vector of character) The names of environmental variables

  • pearson_correlation (tibble) A table of Jackknife test based on Pearson correlation

  • full_AUC_ratio (tibble) A table of AUC ratio of training and test dataset using all variables, that act as references for Jackknife test

  • AUC_ratio (tibble) A table of Jackknife test based on AUC ratio

  • SHAP (tibble) A table of Shapley values of training and test dataset separately

References

See Also

plot.VariableAnalysis, print.VariableAnalysis explain in fastshap

Examples


# Using a pseudo presence-only occurrence dataset of
# virtual species provided in this package
library(dplyr)
library(sf)
library(stars)
library(itsdm)

data("occ_virtual_species")
obs_df <- occ_virtual_species %>% filter(usage == "train")
eval_df <- occ_virtual_species %>% filter(usage == "eval")
x_col <- "x"
y_col <- "y"
obs_col <- "observation"

# Format the observations
obs_train_eval <- format_observation(
  obs_df = obs_df, eval_df = eval_df,
  x_col = x_col, y_col = y_col, obs_col = obs_col,
  obs_type = "presence_only")

env_vars <- system.file(
  'extdata/bioclim_tanzania_10min.tif',
  package = 'itsdm') %>% read_stars() %>%
  slice('band', c(1, 5, 12, 16))

# With imperfect_presence mode,
mod <- isotree_po(
  obs_mode = "imperfect_presence",
  obs = obs_train_eval$obs,
  obs_ind_eval = obs_train_eval$eval,
  variables = env_vars, ntrees = 10,
  sample_size = 0.8, ndim = 2L,
  seed = 123L, nthreads = 1,
  response = FALSE,
  spatial_response = FALSE,
  check_variable = FALSE)

var_analysis <- variable_analysis(
  model = mod$model,
  pts_occ = mod$observation,
  pts_occ_test = mod$independent_test,
  variables = mod$variables)
plot(var_analysis)



itsdm documentation built on July 9, 2023, 6:45 p.m.