ensemble_modelling: Build an ensemble SDM that assembles multiple algorithms

View source: R/ensemble_modelling.R

ensemble_modellingR Documentation

Build an ensemble SDM that assembles multiple algorithms

Description

Build an ensemble SDM that assembles multiple algorithms for a single species. The function takes as inputs an occurrence data frame made of presence/absence or presence-only records and a raster object for data extraction and projection. The function returns an S4 Ensemble.SDM class object containing the habitat suitability map, the binary map, the between-algorithm variance map and the associated evaluation tables (model evaluation, algorithm evaluation, algorithm correlation matrix and variable importance).

Usage

ensemble_modelling(
  algorithms,
  Occurrences,
  Env,
  Xcol = "Longitude",
  Ycol = "Latitude",
  Pcol = NULL,
  rep = 10,
  name = NULL,
  save = FALSE,
  path = getwd(),
  cores = 0,
  parmode = "replicates",
  PA = NULL,
  cv = "holdout",
  cv.param = c(0.7, 1),
  final.fit.data = "all",
  bin.thresh = "SES",
  metric = NULL,
  thresh = 1001,
  axes.metric = "Pearson",
  uncertainty = TRUE,
  tmp = FALSE,
  SDM.projections = FALSE,
  ensemble.metric = c("AUC"),
  ensemble.thresh = c(0.75),
  weight = TRUE,
  verbose = TRUE,
  GUI = FALSE,
  ...
)

Arguments

algorithms

character. A character vector specifying the algorithm name(s) to be run (see details below).

Occurrences

data frame. Occurrences table (can be processed first by load_occ).

Env

raster object. RasterStack object of environmental variables (can be processed first by load_var).

Xcol

character. Name of the column in the occurrence table containing Latitude or X coordinates.

Ycol

character. Name of the column in the occurrence table containing Longitude or Y coordinates.

Pcol

character. Name of the column in the occurrence table specifying whether a line is a presence or an absence. A value of 1 is presence and value of 0 is absence. If NULL presence-only dataset is assumed.

rep

integer. Number of repetitions for each algorithm.

name

character. Optional name given to the final Ensemble.SDM produced (by default 'Ensemble.SDM').

save

logical. If TRUE, the ensemble SDM is automatically saved.

path

character. If save is If TRUE, the path to the directory in which the ensemble SDM will be saved.

cores

integer. Specify the number of CPU cores used to do the computing. You can use detectCores) to automatically

parmode

character. Parallelization mode: along 'algorithms' or 'replicates'. Defaults to 'replicates'.

PA

list(nb, strat) defining the pseudo-absence selection strategy used in case of presence-only dataset. If PA is NULL, recommended PA selection strategy is used depending on the algorithm (see details below).

cv

character. Method of cross-validation used to evaluate the ensemble SDM (see details below).

cv.param

numeric. Parameters associated to the method of cross-validation used to evaluate the ensemble SDM (see details below).

final.fit.data

strategy used for fitting the final/evaluated Algorithm.SDMs: 'holdout'= use same train and test data as in (last) evaluation, 'all'= train model with all data (i.e. no test data) or numeric (0-1)= sample a custom training fraction (left out fraction is set aside as test data)

bin.thresh

character. Classification threshold (threshold) used to binarize model predictions into presence/absence and compute the confusion matrix (see details below).

metric

(deprecated) character. Classification threshold (SDMTools::optim.thresh) used to binarize model predictions into presence/absence and compute the confusion matrix (see details below). This argument is only kept for backwards compatibility, if possible please use bin.thresh instead.

thresh

(deprecated) integer. Number of equally spaced thresholds in the interval 0-1 (SDMTools::optim.thresh). Only needed when metric is set.

axes.metric

Metric used to evaluate variable relative importance (see details below).

uncertainty

logical. If TRUE, generates an uncertainty map and an algorithm correlation matrix.

tmp

logical or character. If FALSE, no temporary rasters are written (this could quickly fill up your working memory, if many replicates are modelled). If TRUE, temporary rasters are written to the „tmp“ directory of your R environment. If character, temporary rasters are written to a custom path. But beware: if you close R, temporary files will be deleted. To avoid any loss you can save your ensemble SDM with save.model. Depending on number, resolution and extent of models, temporary files can take a lot of disk space. Temporary files are written to the R environment temporary folder.

SDM.projections

logical. If FALSE (default), the Algorithm.SDMs inside the 'sdms' slot will not contain projections (for memory saving purposes).

ensemble.metric

character. Metric(s) used to select the best SDMs that will be included in the ensemble SDM (see details below).

ensemble.thresh

numeric. Threshold(s) associated with the metric(s) used to compute the selection.

weight

logical. If TRUE, SDMs are weighted using the ensemble metric or, alternatively, the mean of the selection metrics.

verbose

logical. If TRUE, allows the function to print text in the console.

GUI

logical. Do not take this argument into account (parameter for the user interface).

...

additional parameters for the algorithm modelling function (see details below).

Details

algorithms

'all' calls all the following algorithms. Algorithms include Generalized linear model (GLM), Generalized additive model (GAM), Multivariate adaptive regression splines (MARS), Generalized boosted regressions model (GBM), Classification tree analysis (CTA), Random forest (RF), Maximum entropy (MAXENT), Artificial neural network (ANN), and Support vector machines (SVM). Each algorithm has its own parameters settable with the ... (see each algorithm section below to set their parameters).

"PA"

list with two values: nb number of pseudo-absences selected, and strat strategy used to select pseudo-absences: either random selection or disk selection. We set default recommendation from Barbet-Massin et al. (2012) (see reference).

cv

Cross-validation method used to split the occurrence dataset used for evaluation: holdout data are partitioned into a training set and an evaluation set using a fraction (cv.param[1]) and the operation can be repeated (cv.param[2]) times, k-fold data are partitioned into k (cv.param[1]) folds being k-1 times in the training set and once the evaluation set and the operation can be repeated (cv.param[2]) times, LOO (Leave One Out) each point is successively taken as evaluation data.

metric

Choice of the metric used to compute the binary map threshold and the confusion matrix (by default SES as recommended by Liu et al. (2005), see reference below): Kappa maximizes the Kappa, CCR maximizes the proportion of correctly predicted observations, TSS (True Skill Statistic) maximizes the sum of sensitivity and specificity, SES uses the sensitivity-specificity equality, LW uses the lowest occurrence prediction probability, ROC minimizes the distance between the ROC plot (receiving operating characteristic curve) and the upper left corner (1,1).

axes.metric

Metric used to evaluate the variable relative importance (difference between a full model and one with each variable successively omitted): Pearson (computes a simple Pearson's correlation r between predictions of the full model and the one without a variable, and returns the score 1-r: the highest the value, the more influence the variable has on the model), AUC, Kappa, sensitivity, specificity, and prop.correct (proportion of correctly predicted occurrences).

ensemble.metric

Ensemble metric(s) used to select SDMs: AUC, Kappa, sensitivity, specificity, and prop.correct (proportion of correctly predicted occurrences).

"..."

See algorithm in detail section

Value

an S4 Ensemble.SDM class object viewable with the plot.model function.

Generalized linear model (GLM)

Uses the glm function from the package 'stats'. You can set parameters by supplying glm.args=list(arg1=val1,arg2=val2) (see glm for all settable arguments). The following parameters have defaults:

test

character. Test used to evaluate the SDM, default 'AIC'.

control

list (created with glm.control). Contains parameters for controlling the fitting process. Default is glm.control(epsilon = 1e-08, maxit = 500). 'epsilon' is a numeric and defines the positive convergence tolerance (eps). The iterations converge when |dev - dev_old|/(|dev| + 0.1) < eps. 'maxit' is an integer giving the maximal number of IWLS (Iterative Weighted Last Squares) iterations.

Generalized additive model (GAM)

Uses the gam function from the package 'mgcv'. You can set parameters by supplying gam.args=list(arg1=val1,arg2=val2) (see gam for all settable arguments). The following parameters have defaults:

test

character. Test used to evaluate the model, default 'AIC'.

control

list (created with gam.control). Contains parameters for controlling the fitting process. Default is gam.control(epsilon = 1e-08, maxit = 500). 'epsilon' is a numeric used for judging the conversion of the GLM IRLS (Iteratively Reweighted Least Squares) loop. 'maxit' is an integer giving the maximum number of IRLS iterations to perform.

Multivariate adaptive regression splines (MARS)

Uses the earth function from the package 'earth'. You can set parameters by supplying mars.args=list(arg1=val1,arg2=val2) (see earth for all settable arguments). The following parameters have defaults:

degree

integer. Maximum degree of interaction (Friedman's mi) ; 1 meaning build an additive model (i.e., no interaction terms). By default, set to 2.

Generalized boosted regressions model (GBM)

Uses the gbm function from the package 'gbm'. You can set parameters by supplying gbm.args=list(arg1=val1,arg2=val2) (see gbm for all settable arguments). The following parameters have defaults:

distribution

character. Automatically detected from the format of the presence column in the occurrence dataset.

n.trees

integer. The total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. By default, set to 2500.

n.minobsinnode

integer. minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations, not the total weight. By default, set to 1.

cv.folds

integer. Number of cross-validation folds to perform. If cv.folds>1 then gbm - in addition to the usual fit - will perform a cross-validation. By default, set to 3.

shrinkage

numeric. A shrinkage parameter applied to each tree in the expansion (also known as learning rate or step-size reduction). By default, set to 0.001.

bag.fraction

numeric. Fraction of the training set observations randomly selected to propose the next tree in the expansion.

train.fraction

numeric. Training fraction used to fit the first gbm. The remainder is used to compute out-of-sample estimates of the loss function. By default, set to 1 (since evaluation/holdout is done with SSDM::evaluate.

n.cores

integer. Number of cores to use for parallel computation of the CV folds. By default, set to 1. If you intend to use this, please set ncores=0 to avoid conflicts.

Classification tree analysis (CTA)

Uses the rpart function from the package 'rpart'. You can set parameters by supplying cta.args=list(arg1=val1,arg2=val2) (see rpart for all settable arguments). The following parameters have defaults:

control

list (created with rpart.control). Contains parameters for controlling the rpart fit. The default is rpart.control(minbucket=1, xval=3). 'mibucket' is an integer giving the minimum number of observations in any terminal node. 'xval' is an integer defining the number of cross-validations.

Random Forest (RF)

Uses the randomForest function from the package 'randomForest'. You can set parameters by supplying cta.args=list(arg1=val1,arg2=val2) (see randomForest all settable arguments). The following parameters have defaults:

ntree

integer. Number of trees to grow. This should not be set to a too small number, to ensure that every input row gets predicted at least a few times. By default, set to 2500.

nodesize

integer. Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). By default, set to 1.

Maximum Entropy (MAXENT)

Uses the maxent function from the package 'dismo'. Make sure that you have correctly installed the maxent.jar file in the folder ~\R\library\version\dismo\java available at https://biodiversityinformatics.amnh.org/open_source/maxent/. As with the other algorithms, you can set parameters by supplying maxent.args=list(arg1=val1,arg2=val2). Mind that arguments are passed from dismo to the MAXENT software again as an argument list (see maxent for more details). No specific defaults are set with this method.

Artificial Neural Network (ANN)

Uses the nnet function from the package 'nnet'. You can set parameters by supplying ann.args=list(arg1=val1,arg2=val2) (see nnet for all settable arguments). The following parameters have defaults:

size

integer. Number of units in the hidden layer. By default, set to 6.

maxit

integer. Maximum number of iterations, default 500.

Support vector machines (SVM)

Uses the svm function from the package 'e1071'. You can set parameters by supplying svm.args=list(arg1=val1,arg2=val2) (see svm for all settable arguments). The following parameters have defaults:

type

character. Regression/classification type SVM should be used with. By default, set to "eps-regression".

epsilon

float. Epsilon parameter in the insensitive loss function, default 1e-08.

cross

integer. If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the quality of the model: the accuracy rate for classification and the Mean Squared Error for regression. By default, set to 3.

kernel

character. The kernel used in training and predicting. By default, set to "radial".

gamma

numeric. Parameter needed for all kernels, default 1/(length(data) -1).

Warning

Depending on the raster object resolution the process can be more or less time and memory consuming.

References

M. Barbet-Massin, F. Jiguet, C. H. Albert, & W. Thuiller (2012) "Selecting pseudo-absences for species distribution models: how, where and how many?" Methods Ecology and Evolution 3:327-338 http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00172.x/full

C. Liu, P. M. Berry, T. P. Dawson, R. & G. Pearson (2005) "Selecting thresholds of occurrence in the prediction of species distributions." Ecography 28:85-393 http://www.researchgate.net/publication/230246974_Selecting_Thresholds_of_Occurrence_in_the_Prediction_of_Species_Distributions

See Also

modelling to build SDMs with a single algorithm, stack_modelling to build SSDMs.

Examples

## Not run: 
# Loading data
data(Env)
data(Occurrences)
Occurrences <- subset(Occurrences, Occurrences$SPECIES == 'elliptica')

# ensemble SDM building
ESDM <- ensemble_modelling(c('CTA', 'MARS'), Occurrences, Env, rep = 1,
                          Xcol = 'LONGITUDE', Ycol = 'LATITUDE',
                          ensemble.thresh = c(0.6))

# Results plotting
plot(ESDM)

## End(Not run)


sylvainschmitt/SSDM documentation built on Oct. 25, 2023, 11:19 p.m.