ENMeval Vignette

library(knitr)
knitr::opts_chunk$set(collapse=TRUE, message=FALSE, warning=FALSE, comment="#>")

Introduction {#intro}

ENMeval is an R package that performs automated runs and evaluations of ecological niche models, and currently only implements Maxent. ENMeval was made for those who want to "tune" their models to maximize predictive ability and avoid overfitting, or in other words, optimize model complexity to balance goodness-of-fit and predictive ability. The primary function, ENMevaluate, does all the heavy lifting and returns several items including a table of evaluation statistics and, for each setting combination (here, colloquially: runs), a model object and a raster layer showing the model prediction across the study extent. There are also options for calculating niche overlap between predictions, running in parallel to speed up computation, and more. For a more detailed description of the package, check out the open-access publication:

Muscarella, R., Galante, P. J., Soley-Guardia, M., Boria, R. A., Kass, J. M., Uriarte, M. and Anderson, R. P. (2014), ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in Ecology and Evolution, 5: 1198–1205.

Data Acquisition & Pre-processing {#data}

In this vignette, we briefly demonstrate acquisition and pre-processing of input data for ENMeval. There are a number of other excellent tutorials on these steps, some of which we compiled in the Resources section.

We'll start by reading in an occurrence dataset for Bradypus variegatus, the Brown-throated sloth. We'll go ahead and load the ENMeval package. (We are using spocc to download occurrence records).

if (!require('spocc')) install.packages('spocc', repos="https://cran.us.r-project.org")
if (!require('ENMeval')) install.packages('ENMeval', repos="https://cran.us.r-project.org")

library(ENMeval)

# Search GBIF for occurrence data.
occs <- readRDS("bvariegatus.rds")

# Remove duplicate rows (Note that you may or may not want to do this).
occs <- occs[!duplicated(occs),]

We are going to model the climatic niche suitability for our focal species using climate data from WorldClim. WorldClim has a range of variables available at various resolutions; for simplicity, here we'll use the 9 bioclimatic variables at 10 arcmin resolution (about 20 km across at the equator) included in the dismo package. These climatic data are based on 50-year averages from 1950-2000. Now's also a good time to load the package, as it includes all the downstream dependencies (raster, dismo, etc.).

library(raster)

# First, load some predictor rasters from the dismo folder:
files <- list.files(path=paste(system.file(package='dismo'), '/ex', sep=''), pattern='grd', full.names=TRUE)

# Put the rasters into a RasterStack:
envs <- stack(files)

# Plot first raster in the stack, bio1
plot(envs[[1]], main=names(envs)[1])

# Add points for all the occurrence points onto the raster
points(occs)

# There are some points all the way to the south-east, far from all others. Let's say we know that this represents a subpopulation that we don't want to include, and want to remove these points from the analysis. We can find them by first sorting the occs table by latitude.
head(occs[order(occs$latitude),])

# We see there are two such points, and we can find them by specifying a logical statement that says to find all records with latitude less than -20.
index <- which(occs$latitude < (-20))

# Next, let's subset our dataset to remove them by using the negative assignment on the index vector.
occs <- occs[-index,]

# Let's plot our new points over the old ones to see what a good job we did.
points(occs, col='red')

Next, we will specify the background extent by cropping (or "clipping" in ArcGIS terms) our global predictor variable rasters to a smaller region. Since our models will compare the environment at occurrence (or, presence) localities to the environment at background localities, we need to sample random points from a background extent. To help ensure we don't include areas that are suitable for our species but are unoccupied due to limitations like dispersal constraints, we will conservatively define the background extent as an area surrounding our occurrence localities. We will do this by buffering a bounding box that includes all occurrence localities. Some other methods of background extent delineation (e.g., minimum convex hulls) are more conservative because they better characterize the geographic space holding the points. In any case, this is one of the many things that you will need to carefully consider for your own study.

library(sp)

# Make a SpatialPoints object
occs.sp <- SpatialPoints(occs)

# Get the bounding box of the points
bb <- bbox(occs.sp)

# Add 5 degrees to each bound by stretching each bound by 10, as the resolution is 0.5 degree.
bb.buf <- extent(bb[1]-10, bb[3]+10, bb[2]-10, bb[4]+10)

# Crop environmental layers to match the study extent
envs.backg <- crop(envs, bb.buf)

We may also, however, want to remove the Caribbean islands (for example) from our background extent. For this, we can use tools from the maptools package, which is not automatically loaded with ENMeval.

if (!require('maptools')) install.packages('maptools', repos="https://cran.us.r-project.org/")
if (!require('rgeos')) install.packages('rgeos', repos="https://cran.us.r-project.org/")
library(maptools)
library(rgeos)

# Get a simple world countries polygon
data(wrld_simpl)

# Get polygons for Central and South America
ca.sa <- wrld_simpl[wrld_simpl@data$SUBREGION==5 | wrld_simpl@data$SUBREGION==13,]

# Both spatial objects have the same geographic coordinate system with slightly different specifications, so just name the coordinate reference system (crs) for ca.sa with that of
# envs.backg to ensure smooth geoprocessing.
crs(envs.backg) <- crs(ca.sa)

# Mask envs by this polygon after buffering a bit to make sure not to lose coastline.
ca.sa <- gBuffer(ca.sa, width=1)
envs.backg <- mask(envs.backg, ca.sa)

# Let's check our work. We should see Central and South America without the Carribbean.
plot(envs.backg[[1]], main=names(envs.backg)[1])
points(occs)

In the next step, we'll sample 10,000 random points from the background (note that the number of background points is also a consideration you should make with respect to your own study).

library(dismo)

# Randomly sample 10,000 background points from one background extent raster (only one per cell without replacement). Note: Since the raster has <10,000 pixels, you'll get a warning and all pixels will be used for background. We will be sampling from the biome variable because it is missing some grid cells, and we are trying to avoid getting background points with NA.
bg <- randomPoints(envs.backg[[9]], n=10000)
bg <- as.data.frame(bg)

# Notice how we have pretty good coverage (every cell).
plot(envs.backg[[1]], legend=FALSE)
points(bg, col='red')

Partitioning Occurrences for Evaluation {#partition}

A run of ENMevaluate begins by using one of six methods to partition occurrence localities into testing and training bins (folds) for k-fold cross-validation (Fielding and Bell 1997; Peterson et al. 2011). Generally, the data partitioning step is done within the main 'ENMevaluate' function call. In this section, we illustrate the different options.

  1. Block
  2. Checkerboard1
  3. Checkerboard2
  4. k-1 Jackknife
  5. Random k-fold
  6. User-defined

The first three partitioning methods are variations of what Radosavljevic and Anderson (2014) referred to as 'masked geographically structured' data partitioning. Basically, these methods partition both occurrence records and background points into evaluation bins based on some spatial rules. The intention is to reduce spatial-autocorrelation between points that are included in the testing and training bins, which can overinflate model performance, at least for data sets that result from biased sampling (Veloz 2009; Hijmans 2012; Wenger and Olden 2012).

1. Block {#block}

First, the 'block' method partitions data according to the latitude and longitude lines that divide the occurrence localities into four bins of (insofar as possible) equal numbers. Both occurrence and background localities are assigned to each of the four bins based on their position with respect to these lines. The resulting object is a list of two vectors that supply the bin designation for each occurrence and background point.

blocks <- get.block(occs, bg)
str(blocks)

plot(envs.backg[[1]], col='gray', legend=FALSE)
points(occs, pch=21, bg=blocks$occ.grp)

2. Checkerboard1 {#cb1}

The next two partitioning methods are variants of a 'checkerboard' approach to partition occurrence localities. These generate checkerboard grids across the study extent and partition the localities into bins based on where they fall in the checkerboard. In contrast to the block method, both checkerboard methods subdivide geographic space equally but do not ensure a balanced number of occurrence localities in each bin. For these methods, the user needs to provide a raster layer on which to base the underlying checkerboard pattern. Here we simply use the predictor variable RasterStack. Additionally, the user needs to define an aggregation.factor. This value tells the number of grids cells to aggregate when making the underlying checkerboard pattern.

The Checkerboard1 method partitions the points into k=2 bins using a simple checkerboard pattern.

``` {r part.ck1, fig.width = 5, fig.height = 5} check1 <- get.checkerboard1(occs, envs, bg, aggregation.factor=5)

plot(envs.backg[[1]], col='gray', legend=FALSE) points(occs, pch=21, bg=check1$occ.grp)

The partitioning method is more clearly illustrated by looking at the background points:

points(bg, pch=21, bg=check1$bg.grp)

We can change the aggregation factor to better illustrate how this partitioning method works:

check1.large <- get.checkerboard1(occs, envs, bg, aggregation.factor=30) plot(envs.backg[[1]], col='gray', legend=FALSE) points(bg, pch=21, bg=check1.large$bg.grp) points(occs, pch=21, bg=check1.large$occ.grp, col='white', cex=1.5)

#### 3. Checkerboard2 {#cb2}
The Checkerboard2 method partitions the data into k=4 bins. This is done by aggregating the input raster at two scales. Presence and background points are assigned to a bin with respect to where they fall in checkerboards of both scales.

``` {r part.ck2, fig.width = 5, fig.height = 5}
check2 <- get.checkerboard2(occs, envs, bg, aggregation.factor=c(5,5))

plot(envs.backg[[1]], col='gray', legend=FALSE)
points(bg, pch=21, bg=check2$bg.grp)
points(occs, pch=21, bg=check2$occ.grp, col='white', cex=1.5)

4. k-1 Jackknife {#jack}

The next two methods differ from the first three in that (i) they do not partition the background points into different groups, and (ii) they do not account for spatial autocorrelation between testing and training localities. Primarily when working with relatively small data sets (e.g. < ca. 25 presence localities), users may choose a special case of k-fold cross-validation where the number of bins (k) is equal to the number of occurrence localities (n) in the data set (Pearson et al. 2007; Shcheglovitova and Anderson 2013). This is referred to as the k-1 jackknife. This method will take prohibitively long times for computation when the number of presence localities is medium to large.

``` {r part.jk, fig.width = 5, fig.height = 5} jack <- get.jackknife(occs, bg)

plot(envs.backg[[1]], col='gray', legend=FALSE) points(occs, pch=21, bg=jack$occ.grp) # note that colors are repeated here

#### 5. Random k-fold {#rand}
The 'random k-fold' method partitions occurrence localities randomly into a user specified number of (k) bins. This method is equivalent to the 'cross-validate' partitioning scheme available in the current version of the Maxent software GUI.  

For instance, let's partition the data into five evaluation bins:
``` {r part.rand, fig.width = 5, fig.height = 5}
random <- get.randomkfold(occs, bg, k=5)

plot(envs.backg[[1]], col='gray', legend=FALSE)
points(occs, pch=21, bg=random$occ.grp)

6. User-defined {#user}

For maximum flexibility, the last partitioning method is designed so that users can define a priori partitions. This provides a flexible way to conduct spatially-independent cross-validation with background masking. For example, perhaps we would like to partition points based on a k-means clustering routine.

``` {r part.user1, fig.width = 5, fig.height = 5} ngrps <- 10 kmeans <- kmeans(occs, ngrps) occ.grp <- kmeans$cluster

plot(envs.backg[[1]], col='gray', legend=FALSE) points(occs, pch=21, bg=occ.grp)

When using the user-defined partitioning method, we need to supply ENMevaluate with group identifiers for both occurrence points AND background points. If we want to use all background points for each group, we can set the background to zero.

``` {r part.user2, fig.width = 5, fig.height = 5}
bg.grp <- rep(0, nrow(bg))

plot(envs.backg[[1]], col='gray', legend=FALSE)
points(bg, pch=16, bg=bg.grp)

Alternatively, we may think of various ways to partition background data. This depends on the goals of the study but we might, for example, find it reasonable to partition background by clustering around the centroids of the occurrence clusters.

``` {r part.user3, fig.width = 5, fig.height = 5} centers <- kmeans$center d <- pointDistance(bg, centers, lonlat=T) bg.grp <- apply(d, 1, function(x) which(x==min(x)))

plot(envs.backg[[1]], col='gray', legend=FALSE) points(bg, pch=21, bg=bg.grp)

Choosing among these data partitioning methods depends on the research objectives and the characteristics of the study system. Refer to the [Resources](#resources) section for additional considerations on appropriate partitioning for evaluation.

## Running ENMeval {#eval}
Once you decide which method of data partitioning you would like to use, you are ready to start building models. We now move on to the main function in ENMeval: `ENMevaluate`.

- [Initial considerations](#eval.consid)
- [Exploring the results (the ENMevaluate object)](#eval.explore)

#### Initial considerations {#eval.consid}
The two main parameters to define when calling `ENMevaluate` are (1) the range of regularization multiplier values and (2) the combinations of feature class to consider. The ***regularization multiplier*** (RM) determines the penalty for adding parameters to the model. Higher RM values impose a stronger penalty on model complexity and thus result in simpler (*flatter*) model predictions. The ***feature classes*** determine the potential shape of the response curves. A model that is only allowed to include linear feature classes will most likely be simpler than a model that is allowed to include all possible feature classes. Much more description of these parameters is available in the [Resources](#resources) section. For the purposes of this vignette, we demonstrate simply how to adjust these parameters. The following section deals with comparing the outputs of each model.

Unless you supply the function with background points (which is recommended in many cases), you will need to define how many background points should be used with the 'n.bg' argument. If any of your predictor variables are categorical (e.g., biomes), you will need to define which layer(s) these are using the 'categoricals' argument.

ENMevaluate builds a separate model for each unique combination of RM values and feature class combinations. For example, the following call will build and evaluate 2 models. One with RM=1 and another with RM=2, both allowing only linear features.

```r
data(eval2)

``` {r enmeval1a, eval=FALSE} eval1 <- ENMevaluate(occs, envs, bg, method='checkerboard2', RMvalues=c(1,2), fc=c('L'), algorithm='maxent.jar')

We may, however, want to compare a wider range of models that can use a wider variety of feature classes:

``` {r enmeval1b, eval=FALSE}
eval2 <- ENMevaluate(occ=occs, env=envs, bg.coords=bg, method='checkerboard2', RMvalues=c(1,2), fc=c('L','LQ','LQP'), algorithm='maxent.jar')

When building many models, the command may take a long time to run. Of course this depends on the size of your dataset and the computer you are using. When working on big projects, running the command in parallel can be faster.

``` {r enmeval2par, eval=FALSE} eval2.par <- ENMevaluate(occs, envs, bg, method='checkerboard2', RMvalues=c(1,2), fc=c('L','LQ','LQP'), parallel=TRUE, algorithm='maxent.jar')

Another way to save time at this stage is to turn off the option that generates model predictions across the full study extent (rasterPreds). Note, however, that the full model predictions are needed for calculating AICc values so those are returned as NA in the results table when the `rasterPreds` argument is set to FALSE.

``` {r enmeval3, eval=FALSE}
eval3 <- ENMevaluate(occs, envs, bg, method='checkerboard2', RMvalues=c(1,2), fc=c('L','LQ','LQP'), rasterPreds=FALSE, algorithm='maxent.jar')

We can also calculate one of two niche overlap statistics while running ENMevaluate by setting the niche.overlap argument, which supports Moran's I or Schoener's D. Note that you can also calculate this value at a later stage using the separate calc.niche.overlap function.

``` {r enmeval4, results='hide'} overlap <- calc.niche.overlap(eval2@predictions, stat='D')

``` {r enmeval5}
overlap

The bin.output argument determines if separate evaluation statistics for each testing bin are included in the results file. If bin.output=FALSE, only the mean and variance of evaluation statistics across k bins is returned.

Exploring the results {#eval.explore}

Now let's take a look at the ENMeval object in more detail. It contains the following slots:

Let's first examine the structure of the object: ``` {r stuff} eval2

str(eval2, max.level=3)

The first slot tells which algorithm was used (maxent.jar or maxnet) and which version of the software.
``` {r stuff1}
eval2@algorithm

The next slot holds the table of evaluation metrics. We can use this to, for example, select the 'best' model based on one or more of our evaluation criteria. Let's find the model settings that resulted in the lowest AICc. ``` {r stuff2} eval2@results

eval2@results[which(eval2@results$delta.AICc==0),]

Now let's access the RasterStack of the model predictions.  Note that these predictions are in the 'raw' output format.
``` {r stuff3}
eval2@predictions

Now plot the model with the lowest AICc: ``` {r stuff4, fig.width = 5, fig.height = 5} plot(eval2@predictions[[which(eval2@results$delta.AICc==0)]], main="Relative occurrence rate")

If we used the 'maxent.jar' algorithm, we can also access a list of Maxent model objects, which (as all lists) can be subset with double brackets (e.g. `results@eval2[[1]]`). The Maxent model objects provide access to various elements of the model (including the lambda file). The model objects can also be used for predicting models into other time periods or geographic areas. Note that the html file that is created when Maxent is run is **not** kept.

*(Stay tuned for an update on the vignette focusing on the output from the 'maxnet' algorithm)*

Let's look at the model object for our "AICc optimal" model:
```r
aic.opt <- eval2@models[[which(eval2@results$delta.AICc==0)]]

aic.opt

The "results" slot shows the Maxent model statistics:

aic.opt@results

You can use the var.importance function to get a data.frame of two variable importance metrics: percent contribution and permutation importance. See the function help file for more information on these metrics.

var.importance(aic.opt)

The "lambdas" slot shows which variables were included in the model. After the variable name, the next number is the variable coefficient, then the min and max of that variable for the inut data. If the coefficient is 0, that variable was not included in the model. You will likely find the syntax to be cryptic and the information is not stored in a very user-friendly way. Fortunately, John Baumgartner has developed a useful function to parse this file into a more user-friendly data.frame. See the parse_lambdas.R function in his rmaxent package for more details.

aic.opt@lambdas

Finally, the ENMevaluate object also remembers which occurrence partitioning method you used:

eval2@partition.method

Plotting results {#plot}

Plotting options in R are extremely flexible and here we demonstrate some key tools to explore the results of an ENMevaluate object graphically.

ENMeval has a built-in plotting function (eval.plot) to visualize the results of different models. It requires the results table of the ENMevaluation object. By default, it plots delta.AICc values.

``` {r plot.res, fig.width = 5, fig.height = 5} eval.plot(eval2@results)

You can choose which evaluation metric to plot, and you can also include error bars, if relevant.
``` {r plot.res2, fig.width = 5, fig.height = 5}
eval.plot(eval2@results, 'Mean.AUC', var='Var.AUC')

You can also plot the permutation importance or percent contribution. ``` {r plot.res3, fig.width = 5, fig.height = 5} df <- var.importance(aic.opt) barplot(df$permutation.importance, names.arg=df$variable, las=2, ylab="Permutation Importance")

#### Plotting model predictions {#plot.preds}
If you generated raster predictions of the models (i.e., `rasterpreds=T`), you can easily plot them. For example, let's look at the first two models included in our analysis - remember that the output values are in Maxent's 'raw' units.  

``` {r plot.pred1, fig.width = 5, fig.height = 5, mar=c(2,2,1,0)}
plot(eval2@predictions[[1]], legend=F)

# Now add the occurrence and background points, colored by evaluation bins:
points(eval2@bg.pts, pch=3, col=eval2@bg.grp, cex=0.5)
points(eval2@occ.pts, pch=21, bg=eval2@occ.grp)

Let's see how model complexity changes the predictions in our example. We'll compare the model predictions of the model with only linear feature classes and with the highest regularization multiplier value we used (i.e., fc='L', RM=2) versus the model with all feature class combination and the lowest regularization multiplier value we used (i.e., fc='LQP', RM=1).

``` {r plot.pred2, fig.width = 5, fig.height = 2.5}

bisect the plotting area to make two columns

par(mfrow=c(1,2), mar=c(2,2,1,0))

plot(eval2@predictions[['L_2']], ylim=c(-30,20), xlim=c(-90,-40), legend=F, main='L_2 prediction')

plot(eval2@predictions[['LQP_1']], ylim=c(-30,20), xlim=c(-90,-40), legend=F, main='LQP_1 prediction')

#### Plotting response curves {#plot.resp}
We can also plot the response curves of our model to see how different input variables influence our model predictions.  (Note that, as with the `dismo::maxent` function, using this function requires that the maxent.jar file be installed in the `dismo` package java folder).

``` {r response_curves, eval=FALSE}
  response(eval2@models[[1]])

Downstream Analyses (under construction) {#downstream}

Below is a running list of other things we plan to add to this vignette. Feel free to let us know if there are particular things you would like to see added.

Resources (under construction) {#resources}

Web Resources
General Guides
Model Evaluation
Some Empirical Examples


Try the ENMeval package in your browser

Any scripts or data that you put into this service are public.

ENMeval documentation built on Jan. 13, 2021, 8:08 p.m.