modelling: Build an SDM using a single algorithm

Description Usage Arguments Details Value Generalized linear model (GLM) Generalized additive model (GAM) Multivariate adaptive regression splines (MARS) Generalized boosted regressions model (GBM) Classification tree analysis (CTA) Random Forest (RF) Maximum Entropy (MAXENT) Artificial Neural Network (ANN) Support vector machines (SVM) Warning References See Also Examples

Description

This is a function to build an SDM with one algorithm for a single species. The function takes as inputs an occurrence data frame made of presence/absence or presence-only records and a raster object for data extraction and projection. The function returns an S4 Algorithm.SDM class object containing the habitat suitability map, the binary map and the evaluation table.

Usage

1
2
3
4
5
6
modelling(algorithm, Occurrences, Env, Xcol = "Longitude",
  Ycol = "Latitude", Pcol = NULL, name = NULL, PA = NULL,
  cv = "holdout", cv.param = c(0.7, 2), thresh = 1001,
  metric = "SES", axes.metric = "Pearson", select = FALSE,
  select.metric = c("AUC"), select.thresh = c(0.75), verbose = TRUE,
  GUI = FALSE, folder_tmp=NULL, ...)

Arguments

algorithm

character. Choice of the algorithm to be run (see details below).

Occurrences

data frame. Occurrence table (can be processed first by load_occ).

Env

raster object. Raster object of environmental variable (can be processed first by load_var).

Xcol

character. Name of the column in the occurrence table containing Latitude or X coordinates.

Ycol

character. Name of the column in the occurrence table containing Longitude or Y coordinates.

Pcol

character. Name of the column in the occurrence table specifying whether a line is a presence or an absence. A value of 1 is presence and value of 0 is absence. If NULL presence-only dataset is assumed.

name

character. Optional name given to the final SDM produced (by default 'Algorithm.SDM').

PA

list(nb, strat) defining the pseudo-absence selection strategy used in case of presence-only dataset. If PA is NULL, recommended PA selection strategy is used depending on the algorithms (see details below).

cv

character. Method of cross-validation used to evaluate the SDM (see details below).

cv.param

numeric. Parameters associated to the method of cross-validation used to evaluate the SDM (see details below).

thresh

numeric. A single integer value representing the number of equal interval threshold values between 0 and 1 (see optim.thresh).

metric

character. Metric used to compute the binary map threshold (see details below).

axes.metric

Metric used to evaluate variable relative importance (see details below).

select

logical. If set to true, models are evaluated before being projected, and not kept if they don't meet selection criteria (see details below).

select.metric

character. Metric(s) used to pre-select SDMs that reach a sufficient quality (see details below).

select.thresh

numeric. Threshold(s) associated with the metric(s) used to compute the selection.

verbose

logical. If set to true, allows the function to print text in the console.

GUI

logical. Don't take that argument into account (parameter for the user interface).

folder_tmp

characer. Donde se guarda el temporal.

...

additional parameters for the algorithm modelling function (see details below).

Details

algorithm

'all' allows to call directly all available algorithms. Currently, available algorithms include Generalized linear model (GLM), Generalized additive model (GAM), Multivariate adaptive regression splines (MARS), Generalized boosted regressions model (GBM), Classification tree analysis (CTA), Random forest (RF), Maximum entropy (MAXENT), Artificial neural network (ANN), and Support vector machines (SVM). Each algorithm has its own parameters settable with the ... (see each algorithm section below to set their parameters).

'PA'

list with two values: nb number of pseudo-absences selected, and strat strategy used to select pseudo-absences: either random selection or disk selection. We set default recommendation from Barbet-Massin et al. (2012) (see reference).

cv

Cross-validation method used to split the occurrence dataset used for evaluation: holdout data are partitioned into a training set and an evaluation set using a fraction (cv.param[1]) and the operation can be repeated (cv.param[2]) times, k-fold data are partitioned into k (cv.param[1]) folds being k-1 times in the training set and once the evaluation set and the operation can be repeated (cv.param[2]) times, LOO (Leave One Out) each point is successively taken as evaluation data.

metric

Choice of the metric used to compute the binary map threshold and the confusion matrix (by default SES as recommended by Liu et al. (2005), see reference below): Kappa maximizes the Kappa, CCR maximizes the proportion of correctly predicted observations, TSS (True Skill Statistic) maximizes the sum of sensitivity and specificity, SES uses the sensitivity-specificity equality, LW uses the lowest occurrence prediction probability, ROC minimizes the distance between the ROC plot (receiving operating curve) and the upper left corner (1,1).

axes.metric

Choice of the metric used to evaluate the variable relative importance (difference between a full model and one with each variable successively omitted): Pearson (computes a simple Pearson's correlation r between predictions of the full model and the one without a variable, and returns the score 1-r: the highest the value, the more influence the variable has on the model), AUC, Kappa, sensitivity, specificity, and prop.correct (proportion of correctly predicted occurrences).

select.metric

Selection metric(s) used to select SDMs: AUC, Kappa, sensitivity, specificity, and prop.correct (proportion of correctly predicted occurrences).

'...'

See algorithm in detail section

Value

an S4 Algorithm.SDM Class object viewable with the plot.model method.

Generalized linear model (GLM)

Uses the glm function from the package 'stats', you can set the following parameters (see glm for more details):

test

character. Test used to evaluate the SDM, default 'AIC'.

epsilon

numeric. Positive convergence tolerance eps ; the iterations converge when |dev - dev_old|/(|dev| + 0.1) < eps. By default, set to 10e-08.

maxit

numeric. Integer giving the maximal number of IWLS (Iterative Weighted Last Squares) iterations, default 500.

Generalized additive model (GAM)

Uses the gam function from the package 'mgcv', you can set the following parameters (see gam for more details):

test

character. Test used to evaluate the model, default 'AIC'.

epsilon

numeric. This is used for judging conversion of the GLM IRLS (Iteratively Reweighted Least Squares) loop, default 10e-08.

maxit

numeric. Maximum number of IRLS iterations to perform, default 500.

Multivariate adaptive regression splines (MARS)

Uses the earth function from the package 'earth', you can set the following parameters (see earth for more details):

degree

integer. Maximum degree of interaction (Friedman's mi) ; 1 meaning build an additive model (i.e., no interaction terms). By default, set to 2.

Generalized boosted regressions model (GBM)

Uses the gbm function from the package 'gbm,' you can set the following parameters (see gbm for more details):

trees

integer. The total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. By default, set to 2500.

final.leave

integer. minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations, not the total weight. By default, set to 1.

algocv

integer. Number of cross-validations, default 3.

thresh.shrink

integer. Number of cross-validation folds to perform. If cv.folds>1 then gbm, in addition to the usual fit, will perform a cross-validation. By default, set to 1e-03.

Classification tree analysis (CTA)

Uses the rpart function from the package 'rpart', you can set the following parameters (see rpart for more details):

final.leave

integer. The minimum number of observations in any terminal node, default 1.

algocv

integer. Number of cross-validations, default 3.

Random Forest (RF)

Uses the randomForest function from the package 'randomForest', you can set the following parameters (see randomForest for more details):

trees

integer. Number of trees to grow. This should not be set to a too small number, to ensure that every input row gets predicted at least a few times. By default, set to 2500.

final.leave

integer. Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). By default, set to 1.

Maximum Entropy (MAXENT)

Uses the maxent function from the package 'dismo'. Make sure that you have correctly installed the maxent.jar file in the folder ~\R\library\version\dismo\java available at https://www.cs.princeton.edu/~schapire/maxent/ (see maxent for more details).

Artificial Neural Network (ANN)

Uses the nnet function from the package 'nnet', you can set the following parameters (see nnet for more details):

maxit

integer. Maximum number of iterations, default 500.

Support vector machines (SVM)

Uses the svm function from the package 'e1071', you can set the following parameters (see svm for more details):

epsilon

float. Epsilon parameter in the insensitive loss function, default 1e-08.

algocv

integer. If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the quality of the model: the accuracy rate for classification and the Mean Squared Error for regression. By default, set to 3.

Warning

Depending on the raster object resolution the process can be more or less time and memory consuming.

References

M. Barbet-Massin, F. Jiguet, C. H. Albert, & W. Thuiller (2012) 'Selecting pseudo-absences for species distribution models: how, where and how many?' Methods Ecology and Evolution 3:327-338 http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00172.x/full

C. Liu, P. M. Berry, T. P. Dawson, R. & G. Pearson (2005) 'Selecting thresholds of occurrence in the prediction of species distributions.' Ecography 28:85-393 http://www.researchgate.net/publication/230246974_Selecting_Thresholds_of_Occurrence_in_the_Prediction_of_Species_Distributions

See Also

ensemble_modelling to build ensemble SDMs, stack_modelling to build SSDMs.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Loading data
data(Env)
data(Occurrences)
Occurrences <- subset(Occurrences, Occurrences$SPECIES == 'elliptica')

# SDM building
SDM <- modelling('GLM', Occurrences, Env, Xcol = 'LONGITUDE', Ycol = 'LATITUDE')

# Results plotting
## Not run: 
plot(SDM)

## End(Not run)

hugocalcad/LigthSSDM documentation built on June 22, 2019, 12:43 a.m.