knitr::opts_chunk$set
    echo = TRUE,
    fig.align = 'center'
)

Introduction

A core feature of the enmSdm package is a set of "train" function which calibrate species distribution models. These functions include:

But why would you want to use any of these functions if you could instead call them with perhaps more familiar functions like glm() or gam()? Unlike these base functions, the train functions actually go through the work of "tuning" the model so it best fits the data without being overfit. This process depends on the model and is described in this document.

How can you get started? First, copy the code that follows into R. This code will load a set of specimen records of lemurs in Madagascar, then a series of climate rasters. It then matches the records to climate data and generates a set of background sites to be used as a "contrast" to the presence records in the models. Then, simply skip to the section on your preferred modeling algorithm to understand what it does and how to use it.

Please note that this vignette is not intended to show you the best way to do distribution or niche modeling. For example, we will not worry about potential bias in the data or correlation between predictors.

library(dismo)
library(enmSdm)

# names of longitude/latitude fields
ll <- c('longitude', 'latitude')

# specimen records of lemurs from GBIF
data(lemurs)
brownLemur <- lemurs[lemurs$species == 'Eulemur fulvus', ]
# download boundary of Madagascar
madBorder <- getData('GADM', country='MDG', level=0)

# download climate data for Madagascar
worldClim <- getData('worldclim', var='bio', res=5)
# download boundary of Madagascar
madBorder <- getData('GADM', country='MDG', level=0, path='C:/ecology/!Scratch')

# download climate data for Madagascar
worldClim <- getData('worldclim', var='bio', res=5, path='C:/ecology/!Scratch')
# crop climate data to Madagascar
madClim <- crop(worldClim, madBorder)
# what have we got?
plot(madClim)
par(oma=rep(0, 4), mar=rep(0, 4))
plot(madBorder)
points(brownLemur[ , ll])

# match presence records with climate
presEnv <- extract(madClim, brownLemur[ , ll])
brownLemur <- cbind(brownLemur, presEnv)
brownLemur$presBg <- 1

# generate random background sites and extract climate
bgCoords <- randomPoints(madClim, n=10000)
bgEnv <- extract(madClim, bgCoords)
colnames(bgCoords) <- ll
bg <- cbind(bgCoords, bgEnv)
bg <- as.data.frame(bg)
bg$presBg <- 0

# combine data on presence records and background sites
fields <- c('presBg', 'longitude', 'latitude', paste0('bio', 1:19))
presBg <- rbind(brownLemur[ , fields], bg[ , fields])

# remove any rows with NA
presBg <- presBg[complete.cases(presBg), ]

head(presBg)
tail(presBg)

# define predictors
# to speed things up we will just use:
# mean annual temperature (BIO1)
# temperature seasonality (BIO4)
# temperature of the warmest month (BIO5)
# mean annual precipitation (BIO12)
# precipitation seasonality (BIO15)
# precipitation of the driest quarter (BIO17)
preds <- paste0('bio', c(1, 4, 5, 12, 15, 17))

Boosted regression trees (BRTs)/gradient boosting machines (GBMs)

BRTs use a series of classification and regression trees to divide a predictor space. The first tree uses the response specified by the modeler, and subsequent trees use the residuals from the previous tree (Elith et al. 2008). By "pasting" thousands of trees together a smooth curve can be approximated.

BRTs have several parameters than can be varied to improve fit to the data or improve generality in predicting withheld data. These include the: learning rate (how much each tree contributes to the overall model and thus how many trees are included in the optimal model), tree complexity (number of levels in each tree, which determines the degree of interactions between variables: none, two-way, three-way, etc.), bag fraction (proportion of data randomly withheld from each tree--helps reduce overfitting), and maximum number of trees (total complexity of the model and degree of fit to observed data). The trainBrt function performs a grid search across all possible combinations of these parameters. It then returns the model with the smallest deviance. Personally, I only ever test the learning rate and tree complexity, keeping the bag fraction fraction at about 0.6 or 0.7 (following Elith et al. 2008) and total number of trees 6000-10000 or so (smaller reduces run time but may not yield the optimal model).

The trainBrt function uses the gbm.step function from the dismo package to select a model with the optimal number of trees, up to the maximum allowed by the maxTrees argument (in trainBrt). Occasionally the model will not converge. In this case the trainBrt function attempts to find a model thet does converge by manipulating learning rate, tree complexity, maximum number of trees, and the step size (number of trees by which the maximum number of trees is decremented to find the optimal model). In a few cases none of these attempts will work, in which case I have found that you can sometimes get a model to converge if you just try again--after all, BRTs have a random component, so a new start may get it out of a stuck place.

Here's how to train a BRT using trainBrt. To speed things up I'll use just one learning rate, two tree depths, and a small number of maximum trees.

set.seed(123)
out <- trainBrt(
    presBg,
    resp='presBg',
    preds=preds,
    learningRate=c(0.001),
    treeComplexity = c(1, 3),
    maxTrees=4000,
    out=c('model', 'tuning'),
    verbose=TRUE
)

summary(out$model)
plot(out$model)
out$tuning

To make predictions you can use the predict or predictEnmSdm function:

predictions <- predict(out$model, newdata=presBg, n.trees=model$gbm.call$n.trees, type='response')
predictions <- predictEnmSdm(out$model, newdata=presBg)

Generalized additive models (GAMs)

Generalized additive models are similar to generalized linear models except that they apply "smoothers" (non-linear functions) to the predictors during the model fitting procedure. This means the models can potentially fit high-complex shapes. GLMs can do this, too, but the user must know (or guess) beforehand what the form of the curves will be. GAMs more-or-less automate the process of identfying the complexity of the curves. GAMs can also "glue" smoothers together, so you could, for example, have a very wiggle part of the response in one section and a very unchanging response in another.

By default, the trainGam function applies shrinkage to model coefficients. Shrinkage is a type of variable selection, which shrinks coefficients of terms that are not very influential toward 0.

For single predictors the trainGam function uses cubic splines as smoothers. For interactions between predictors trainGam uses Tensor products, which are appropriate for cases when predictors are on different scales (e.g., degrees C temperature and mm of precipitation), although this can be changed by the user. See help('smooth.terms', package='mgcv') for more information.

The trainGam function solves two potential challenges. First, we may have a sample size that is relatively small compared to the number of predictors/terms we'd like to test. Second, we probably don't know a priori what the best model is across all potential predictors and terms.

In the first step, the model construction phase, the trainGam function constructs a series of all possible simple models using just one or two terms (including an intercept-only model). It then ranks the models from most to least parsimonious using AICc. For example, say you provided the model two predictors, x1 and x2. The function would assess the following models (I will use the s(z) function to indicate the smoothed version of z and te(z1, z2) to indicate an "interaction" (Tensor product) smooth): y ~ s(x1) y ~ s(x2) * y ~ te(x1, x2) These are then ranked using AICc. Assume that their order is then: 1 y ~ te(x1, x2) 2 y ~ s(x2) 3 y ~ s(x1)

In the second step, the model selection phases, the trainGam function identifies the most parsimonious model from across a set of predictors/terms while not growing the number of candidate models too large. For example, assume we have 28 presences. The function constructs a "full" model by adding terms in the order listed above such that there will be at least 10 presences per term (which can be changed using the presPerTermInitial argument). So the full model will include te(x1, x2) and s(x2) but no others because we don't have enough terms. So the full model wil be y ~ te(x1, x2) + s(x2). The function then tests all possible submodels (including the intercept-only model). In this case there are only 4 models (two with one term apiece, the full model, and the intercept-only model). It then ranks these, removes models with fewer than 20 presences per term (which can be changed using the presPerTermFinal argument), and selects the model with the lowest AICc.

By default the function returns just this "best" model, but it can also return all models tested in the selection step and/or a table with AICc values for each model tested in the selection step.

Here is how you would implement trainGam in R:

out <- trainGam(
    presBg,
    resp='presBg',
    preds=preds,
    out=c('model', 'tuning'),
    verbose=TRUE
)

summary(out$model)
out$tuning

So the best model uses terms that have interactions between BIO2 and BIO4 and between BIO4 and BIO5.

To make predictions you can use the predict or predictEnmSdm functions:

predictions <- predict(out$model, newdata=presBg, type='response')
predictions <- predictEnmSdm(out$model, newdata=presBg)

Generalized linear models (GLMs)

GLMs are similar to normal linear regression models except that the response does not need to be normally distributed. In the case of distribution and niche modeling the most common type of response is binary (1 for presence, 0 for non-presence). By default, the trainGlm function assumes the response is binary, although this can be changed by the user.

The trainGlm function is designed to solve two problems. First, we may not know what predictors or set of terms best relate to the response. Second, we may have a small sample size relative to the number of predictors and terms we are interested in testing. Thus, trainGlm undergoes a two-phase operation in which: model construction is used to select the best set of predictors and associated model terms; followed by model selection in which the the best overall model is identified. Note that we're using "terms" to mean any component of the predictor side of the model, such as x or x^2 or x1 * x2. Both steps respect marginality which means if a higher-order term like a squared term or an interaction term are included in a model, then the simpler linear terms of which those terms are composed are also included.

The model construction phase begins by creating a series of simple models. By default, each simple model is composed of either a single linear term (plus intercept), the linear term plus its quadratic term and intercept, or two linear terms plus their interaction plus the intercept. For example, if we told the function to use the terms x1 and x2, then it would construct models of the form: y ~ x1 y ~ x2 y ~ x1 + I(x1^2) y ~ x2 + I(x2^2) * y ~ x1 + x2 + x1:x2 In this case there aren't too many models at all, but if we had even a few more we might have a lot of potential terms.

Then, the function calculates AICc for each simple model and ranks the simple models from most to least parsimonious (smallest to largest AICc). The function then constructs a "full model" by adding terms from the best set of simple models while ensuring that there are at least 10 occurrence records per term (not including the intercept; you can change the number of presence records per term using the argument presPerInitialTerm).

For example, say our simple model were ranked in this order, from lowest to highest AICc: 1 y ~ x1 + I(x1^2) 2 y ~ x2 3 y ~ x1 4 y ~ x2 + I(x2^2) 5 y ~ x1 + x2 + x1:x2 and let's assume that we had 47 presence records, meaning we would want have at most 4 terms (not including the intercept) in our initial model. In this case we include the first two terms (x1 and its square). That takes up the first 20 presence records. Then we include x2 from the second simple model, which takes up all together 30 presence records. Thus we have 17 more presence records left to use. The third model has just x1, but we already have that term. The fourth model has x2 and I(x2^2). We already included x2, so we'll just add I(x2^2). Now we only have 7 more presence records, which isn't enough to add another term. Thus our full model is y ~ x1 + I(x1^2) + x2 + I(x2^2).

The model construction process evaluates each predictor and its terms in a piecemeal fashion, but predictors can act more or less influentially depending on what other predictors appear in the model. Thus, in the model selection phase each possible submodel that can be constructed from the full model (including an intercept-only model) is evaluated using AICc. By default, the best of these models is returned, although you can also return all possible submodels, and/or a table displaying AICc for each submodel.

Here is how you would implement trainGlm in R. Note that to obviate problems associated with predictors on wildy different scales, we will scale each predictor so it has a mean of 0 and a unit variance.

scaledPresBg <- presBg
scaledPresBg[ , preds] <- scale(scaledPresBg[ , preds])

out <- trainGlm(
        scaledPresBg,
        resp='presBg',
        preds=preds,
        out=c('model', 'tuning'),
        verbose=TRUE
)

summary(out$model)
out$tuning

Now suppose we wanted to exclude certain terms from consideration, such as bio4:bio15.

out <- trainGlm(
    scaledPresBg,
    resp='presBg',
    preds=preds,
    verboten=c('bio4:bio5'),
    out=c('model', 'tuning')
)

summary(out$model)
out$tuning

Notice that we no longer have models with bio4:bio5 (or bio5:bio4--the order doesn't matter). If you didn't want bio4 and bio5 to appear in the same model ever (even if they didn't interact), you could use the argument verbotenCombos=list(c('bio4', 'bio5')).

Finally, note that sometimes you might get warnings that look like:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

This typically means that the presences and non-presence terms can be perfectly separated by at least one variable. This doesn't sound like a bad thing--after all, we're wanting to find which predictors most separate the presences from non-presences. However, for GLM to provide a stable model, there must be some "intermingling" of presences and non-presences (odd, isn't it?). There are several ways you can address this (including not using GLM), but one method is to employ Firth's correction (Firth 1993 and subsequent correction in 1995). To do this we will use the brglm2 package (available on CRAN), which is one of several that applies Firth's correction. In this example it isn't appropriate because we haven't received this error, but we'll use it anyway to show how it's done. Note that Firth's correction can increase runtime a lot, and it will change the coefficient values, even if the final model is structurally the same as the model without the correction.

library(brglm2)
out <- trainGlm(
    scaledPresBg,
    resp='presBg',
    preds=preds,
    method='brglmFit',
    out=c('model', 'tuning')
)

Ironically, applying Firth's correction created a lot of warnings about inseparability (not shown here--see for yourself in R)!

summary(out$model)
out$tuning

To make predictions you can use the predict or predictEnmSdm functions:

predictions <- predict(out$model, scaledPresBg, type='response')
predictions <- predictEnmSdm(out$model, scaledPresBg)

Least angle regression (LARS)

Least angle regression, or LARS, is a form of generalized linear modeling that does model and variable selection using regularization. Theoretically, it is particularly fit for cases where there is strong correlation between predictors and/or the number of predictors is large compared to the number of occurrences. In fact, LARS works even when the number of predictors is larger than the number of occurrences (a "p >> N" problem; Efron et al. 2004).

LARS operates by first assigning all coefficients to 0. Then, it find the one predictor with the highest degree of correlation (positive or negative) with the response. The coefficient for this one variable is then increased (or decreased) by a small amount. Then, the model is used to make a prediction, after which residuals are calculated. The procedure repeats, with the next coefficient (which may be the same as in the previous step) chosen such that increasing (or decreasing) it increases the correlation between the chosen predictor and the residuals. Note that sometimes values of coefficients can increase then decrease, depending as new variables enter the model.

The trainLars function uses an elastic net to tune the degree of penalty applied to including a particular variable and the magnitude of its coefficient. This penalty is determined using an elastic net, which allows the model to range from implementing a full least absolute shrinkage and selection operator (LASSO) penalty and a ridge penalty. Both the LASSO and ridge are forms of regularization. LASSO attempts to reduce the sum of the magnitudes of coefficients, which ridge regression attempts to reduce the sum of squares of coefficients. The elastic net allows the model to test a gradient of penalties between LASSO and ridge, then select the best one based on cross-validation. The result is typically a sparse model with most coefficients set to 0 or very small, with only the most "important" variables having non-negigible coefficients.

The function trainLars relies on the grpreg and grpregOverlap packages (Zheng & Breheny 2016). As this is being written (in late 2020), it appears one or both of these packages have not been updated on CRAN for some time, so probably can't be installed automatically. To install them, you simple need go to the packages pages on CRAN (grpreg[https://cran.r-project.org/package=grpreg] grpregOverlap[https://cran.r-project.org/package=grpregOverlap]), download the gzip file, then manually install it in R. The function grpregOverlap allows grouping variables together such that they will have their coefficients increased/decreased as a group. trainLars takes advantage of this to construct quadratic and interaction terms plus quadratic-by-interaction terms (e.g., x1 * x2^2), then enforce marginality constraints such that increasing/decreasing a coefficient related to a quadratic or interaction term will also cause the related linear terms to be altered as well. Note that this assumes the linear and non-linear terms have the same sign of relationship with the response.

Training a LARS model can take while, so to speed up the demonstration, we'll test just three ratios of ridge-to-LASSO penalty and use just three predictors:

# ridge/LASSO gradient
alphas <- c(0.01, 0.33, 0.66, 1)
presBgSparse <- presBg[ , c('presBg', 'bio1', 'bio12', 'bio15')]
out <- trainLars(presBgSparse, alphas=alphas, verbose=TRUE)
summary(out)

We can make two kinds of predictions using a LARS model, one using all terms (as you would for any other kind of model) and a partial response prediction, which uses just term(s) with the variable(s) of interest.

fullPred <- predictLars(out, presBg, type='response')
partialPred <- predictLars(out, presBg, preds='bio1', type='response')

# plot partial response to BIO1
plot(presBg$bio1, partialPred)

Maxent

Maximization of information entropy is a statistical method for finding probability distributions that reflect only--and no more than--what is known. The Maxent species distribution model first finds the probability of each environmental variable (or a transformation thereof) given a presence, then uses Bayes' theorem to inver the probability to yield a prediction of the probability of presence (or an index thereof; Phillips et al. 2006; Phillips & Dud?k 2008). For further reading on how Maxent works see Elith et al. (2011), Merow et al. (2013), and Phillips et al. (2017).

Maxent uses several tunable parameters and settings, including a "master regularization" parameter which controls the complexity (smoothness) of models and whether or not different transformations of predictors are used. The latter include linear, quadratic, and two-way interactions, plus hinge transformations (i.e., a flat line, then a linear increase or decrease, followed by a flat line again) and threshold transformations (i.e., a flat line, then a "step" up or down followed by a flat line again). Maxent also allows for categorical variables. Maxent only uses categorical variables to estimate offsets from a model-wide intercept (i.e., you can't use a categorical variable interacting with a continuous variable).

The trainMaxEnt and trainMaxNet functions evaluate all possible combinations of the regularization parameter and predictor transformations. They then rank the models by AICc, remove those with more paramters than number of presences, and return either the "best" model (lowest AICc), all models, and/or a table with model structure and AICc. The default transformations used by both functions include linear, quadratic, interaction, and hinge. trainMaxNet requires you to use the linear transformation in every model, but you can turn that behavior on or off in trainMaxEnt (by default it is on). Thus, by default, both functions will evaluate all possible models that meet these criteria: 1. Regularization parameter values of 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 7.5, and 10; 2. Have linear transformations; 3. All possible combinations of quadratic, two-way interaction, and hinge features.

trainMaxEnt and trainMaxNet are very similar and by default should result in the same set of models. However, there are some differences: * trainMaxEnt uses the older Java program to do the work (Maxent versions <=3.3.3k) and latter uses the R package maxnet to do the work (Maxent versions =3.4.0). Thus, to get trainMaxEnt to work you need to copy a Java file to a specific folder in the dismo package. See the directions in the maxent function (help('maxent', 'dismo')) for how to do this. If you don't do this you will get an error that looks like this:

Error in .getMeVersion() : file missing: C:/Ecology/Drive/R/libraries/dismo/java/maxent.jar. Please download it here: http://www.cs.princeton.edu/~schapire/maxent/

Here's how to use the trainMaxEnt and trainMaxNet functions. To speed things up we will use just two values of the master regularization parameter.

ent <- trainMaxEnt(
    presBg,
    resp='presBg',
    preds=preds,
    regMult=c(1, 2),
    verbose=TRUE
)

net <- trainMaxNet(
    presBg,
    resp='presBg',
    preds=preds,
    regMult=c(1, 2),
    verbose=TRUE
)

To make predictions you can use the predict function (either model), predictMaxEnt (for trainMaxEnt models only), or the predictEnmSdm function (either model):

# default: logistic output
predEnt <- raster::predict(ent, presBg)

# default: cloglog output; see ?maxnet for more
predNet <- predict(net, newdata=presBg)

# default: cloglog output
predEnt <- predictMaxEnt(ent, presBg)

# default: cloglog output
predEnt <- predictEnmSdm(ent, presBg)

# default: cloglog output
predNet <- predictEnmSdm(net, presBg)

Natural splines (NSs)

Natural splines are very similar to generalized additive models (GAM), except that they are "clamped" to produce straight lines (in the space of the predictors) outside of the range of the training data. This is different from "clamping" done by Maxent, boosted regression trees models, and random forests--in this case, the prediction can continue to increase or decrease, depending on what was happening just inside range of the training data. This is a nice featuse because you can get the high complexity of a GAM where you have information on the form of the response (i.e., within the range of the training data), but extrapolate in a highly predictable manner outside it.

The trainNs function uses two steps to solve two potential problems. First, in many cases you may have few occurrence records compared to the number of predictors you'd like to include in the model. Thus, during the model construction phase, trainNs selects among predictors by constructing univariate models with different degrees of freedom, then ranking them by AICc. The combination of degrees of freedom and variable(s) with the smallest AICc are selected to be in a "full" model, such that there must be (by default) 20 occurrences per predictor (which can be changed with the argument presPerTermInitial). Then, the trainNs function conducts a model selection phase in which all possible combinations of predictors is tested such that each model has at least 10 occurrences per predictor (changed with argument presPerTermFinal) are evaluated. These are then ranked by AICc. By default, the function returns the best model of these, but can also return a list of all possible models constructed from the full model and/or a table with model structure and AICc.

The "wiggliness" of the natural splines is determined by the argument df (degrees of freedom). By default, trainNs will test each variable with 1, 2, 3, or 4 degrees of freedom.

out <- trainNs(
    presBg,
    resp='presBg',
    preds=preds,
    out=c('model', 'tuning'),
    verbose=TRUE
)

summary(out$model)
out$tuning

You can make predictions using either of these functions:

preds <- predict(out$model, presBg, type='response')
preds <- predict(out$model, presBg)

Random forests (RFs)

Random forests are constructed from a sets of classification and regression trees (Breiman 2001). Each tree divides the predictor space into quadrants, and each quadrant is assigned an outcome (e.g., "present" or "absent"). To generate a prediction the model averages across the predictions from all trees. To reduce overfitting, each tree is fit with a random subset of the data (called "bagging") and a random subset of predictors.

The trainRf function calls the tuneRF function in the randomForest package. This latter function finds the optimal number of predictors to be chosen (at random) to construct each tree.

Here's how to use the trainRf function:

out <- trainBrt(
    presBg,
    resp='presBg',
    preds=preds
)

To make predictions you can use the predict function. Note that there are some weirdnesses if the original response in the model was binary (1/0) because it has to be converted to a factor:

predictions <- predict(out, newdata=presBg, type='prob')
predictions <- unlist(predictions)

You can also use the predictEnmSdm function in the enmSdm package:

predictions <- predictEnmSdm(out, newdata=presBg)

Literature cited

Breiman, L. 2001. Random forests. Machine Learning 45:5-32.

Efron, B., Hastie, T., Johnstone, I., and Tibrshirani, R. 2004. Least angle regresission. The Annals of Statistics 32:407-451.

Elith, J., J.R. Leathwick, and T. Hastie. 2008. A working guide to boosted regression trees. Journal of Animal Ecology 77:802-813.

Elith, J., S.J. Phillips, T. Hastie, M. Dud?k, Y.E. Chee, and C.J. Yates. 2011. A statistical explanation of MaxEnt for ecologists. Diversity and Distributions 17:43-57.

Merow, C., Smith, M.J., and Silander, J.A. Jr. 2013. A practical guide to MaxEnt for modeling species' distributions: What it does, and why inputs and settings matter. Ecography 36:1058-1069.

Phillips, S.J. and Dud?k, M. 2008. Modeling species distributions with Maxent: New extensions and a comprehensive evaluation. Ecography 31:161-175.

Phillips, S.J., Anderson, R.P., and Schapire, R.E. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190:231-259.

Phillips, S.J., Anderson, R.P., Dud?k, M. Schapire, R.E., and Blair, M.E. 2017. Opening the black box: An open-source release of Maxent. Ecography 40:887-893.

Firth, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80:27-38. (Note correction in Biometrika 1995.)

Firth, D. 1995. Bias reduction of maximum likelihood estimates. Biometrika 82:667. (Correction to 1993 Biometrika article.)

Zheng, Y. and Breheny, P. 2016. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Informatics 15:179-187.



adamlilith/enmSdm documentation built on Jan. 6, 2023, 11 a.m.