To cite the package 'MaxentVariableSelection' in publications, use:
Jueterbock, A., Smolina, I., Coyer, J. A., & Hoarau, G. (2016). The fate of the Arctic seaweed Fucus distichus under climate change: an ecological niche modeling approach. Ecology and Evolution 6(6) 1712-1724.
Complex niche models show low performance in identifying the most important range-limiting environmental variables and in transferring habitat suitability to novel environmental conditions [@Warren2011; @Warren2014]. This vignette demonstrates how to constrain complexity and increase performance of Maxent niche models by identifying the most important set of uncorrelated environmental variables and by fine-tuning Maxent's regularization multiplier.
Users of this package should be familiar with Maxent Niche Modelling. If you are not, you can find a great tutorial here.
To test for the importance of environmental variables in discriminating suitable from non-suitable habitat, these variables must be available as ASCII grids (in ESRI's .asc format). All variables must have the same extent and resolution.
For example, biologically meaningful variables are freely available from the WorldClim dataset [@Hijmans2005] for terrestrial systems and from the Bio-ORACLE dataset [@Tyberghein2012] for marine systems.
This package runs Maxent to evaluate the performance of models built with different subsets of environmental variables. The Maxent program, however, doesn't come with this R package. It has to be downloaded separately from here. The program is free but you will have to enter your name, institution, and email prior to downloading. This package has been tested with Maxent version 3.3.3k.
Geo-coded background and occurrence records have to be provided in a
format that is referred to in the
Maxent tutorial
as SWD format. The data sets Backgrounddata.csv
and
Occurrencedata.csv
, which were used in [@Jueterbock2015] and are
included in this package, exemplify the required content.
For each location, this file lists the longitude and latitude values, as well as the values for each environmental variable.
Let's have a look at the content of Occurrencedata.csv
:
occurrencelocations <- system.file("extdata", "Occurrencedata.csv", package="MaxentVariableSelection") occurrencelocations <- read.csv(occurrencelocations,header=TRUE) head(occurrencelocations)
For each of 98 occurrence records of the macroalga Fucus distichus
(Fd
), this dataset contains longitude
and latitude
values, as
well as values for four environmental variables (calcite
, parmean
,
salinity
, and sstmax
).
Such files are easy to create with the extract function of the R package 'raster' [@Hijmans2015] once we have longitude and latitude values of the occurrence records.
You can install the 'raster' package with the following command
install.packages('raster')
Let's extract variables from the
Bio-ORACLE database [@Tyberghein2012] for each of the 98
occurrence records of Fucus distichus. Downloading the variables
(from here)
may take a while as the rasters sum to about 160 MB in size. the
variables are compressed as RAR files. Extract them and save them in a
folder called BioORACLEVariables
. The following R code shows how to
obtain the value of these variables for each of the occurrence records
# load the raster package library(raster) # Load the occurrence records occurrencelocations <- system.file("extdata", "Occurrencedata.csv", package="MaxentVariableSelection") occurrencelocations <- read.csv(occurrencelocations,header=TRUE) LonLatData <- occurrencelocations[,c(2,3)] # Then load the environmental variables into R with the help of the # stack function of the 'raster' package. You can not just copy the # following line but have to adjust the filepath to your own. files <- list.files("/home/alj/Downloads/BioORACLEVariables",pattern='asc', full.names=TRUE) Grids <- raster::stack(files) # Extracting the variables for all occurrencelocations VariablesAtOccurrencelocations <- raster::extract(Grids,LonLatData) # Combining the extracted values with the longitude and latitude values Outfile <- as.data.frame(cbind("Fucusdistichus", LonLatData, VariablesAtOccurrencelocations)) colnames(Outfile) <- c("species","longitude","latitude", colnames(VariablesAtOccurrencelocations)) #writing this table to a csv file: write.csv(Outfile, file = "VariablesAtOccurrencelocations.csv", append = FALSE,sep = ",", eol = "\n", na = "NA", dec = ".",col.names = TRUE,row.names=FALSE)
The central function of this package is called
VariableSelection
. The basic usage of this function is:
VariableSelection(maxent, outdir, gridfolder,
occurrencelocations, backgroundlocations, additionalargs,
contributionthreshold, correlationthreshold, betamultiplier)
The following sections guides you through an example with the possible settings of each argument.
maxent
Specify the file path to the maxent.jar
file, which has to be
downloaded from
here. Not that you
can not just copy the following line but instead have to specify the
file path to the folder containing maxent.jar
on your own computer.
maxent <- ("/home/alj/Downloads/maxent.jar")
If you are working on a Windows platform, your filepath might rather
look similar to the following, where ... should be replaced by the
hierarchical file structure that leads to maxent.jar
on your
computer.:
maxent <- ("C:/.../maxent.jar")
outdir
Filepath to the output directory, including the name of the directory, like:
outdir <- ("/home/alj/Downloads/OutputDirectory")
Again, you need to adjust the filepath to the correct path and folder
on your own computer. All result files will be written into this
folder. Please don't put important files in this folder as all files
but the output files of the VariableSelection
function will be
deleted from this folder.
gridfolder
Here, you specify the filepath to the folder containing all your ASCII grids of environmental variables that you consider to be potentially relevant in setting distribution limits of your target species. All variables must have the same extent and resolution.
If you are puzzled what grids I am referring to, have a look at the section ASCII Grids of environmental variables above.
Like for the maxent
and outdir
arguments, you also can not just
copy the following line as the location of ASCII grids is likely
different on your computer. Adjust the filepath to the folder where
you stored your ASCII grids. In this example I provide the filepath to
the Bio-ORACLE
rasters that I downloaded from
here.
gridfolder <- ("/home/alj/Downloads/BioORACLEVariables")
occurrencelocations
Here, you need to specify the filepath to the csv file of occurrence locations (see the section Files of occurrence and background locations above for the required file format). An example file of occurrence locations for the macroalga Fucus distichus is included in this package. You load it with:
occurrencelocations <- system.file("extdata", "Occurrencedata.csv", package="MaxentVariableSelection")
Instead, if the file with your occurrence locations is stored on your own computer, set here the filepath to it. For example:
occurrencelocations <- "/home/alj/Downloads/Occurrencedata.csv"
backgroundlocations
The same applies for the backgroundlocations
argument as for the
occurrencelocations
argument.
If you want to use the background locations that were used in [@Jueterbock2015], type:
backgroundlocations <- system.file("extdata", "Backgrounddata.csv", package="MaxentVariableSelection")
Instead, if you have your own csv file with backgroundlocations specify the filepath here, similar to:
backgroundlocations <- "/home/alj/Downloads/Backgrounddata.csv"
additionalargs
Maxent arguments can be specified with the additionalargs
argument.
You find an overview of possible arguments in the end of the Maxent
help file.
The following settings are preset in this package to calculate
information criteria and AUC values. Don't change these values in the
additionalargs
string:
autorun=true
: starts running with the startup of the program.plots=false
: does not create any plots.writeplotdata=false
: does not write output data that are required
to plot response curves.visible=false
: does not show the Maxent user interface but runs the
program from the command line.randomseed=true
: the background data will be split into different
test/train partitions with each new run.writebackgroundpredictions=false
: does not write .csv files with
predictions at background points.replicates=10
: number of replicate runs; this is only set for the
calculation of AUC values. The calculation of AICc values doesn't
require to partition the occurrence sites in test and training data,
thus, also does not require to replicate runs.replicatetype=subsample
: uses random test/training data for
replicated sample sets; this is only set for the calculation of AUC
values.randomtestpoints=50
: 50 percent of presence points are randomly
set aside as test data; this is only set for the calculation of AUC
values.redoifexists
: redoes the analysis if output files exist already.writemess=false
: does not write multidimensional similarity
surfaces (MESS), which inform on novel climate conditions in the
projection layers.writeclampgrid=false
: does not write a grid on the spatial
distribution of clamping.askoverwrite=false
: does not ask if the data shall be
overwritten if they exist already.pictures=false
: does not create .png images of the output grids.outputgrids=false
: does not write output grids for all replicate
runs but only the summary grids; this is only set for the
calculation of AUC values.outputformat=raw
: write output grids with raw probabilities of
presence; this is only set for calculation of information criteria,
like AICc.Instead, what additional Maxent argument you might want to set are the
features to be used, by selecting true or false for the arguments
linear
, quadratic
, product
, threshold
, and hinge
.
For this example, we are using only hinge
features, with the
following setting:
additionalargs="nolinear noquadratic noproduct nothreshold noautofeature"
contributionthreshold
This sets the threshold of model contribution below which environmental variables are excluded from the Maxent model. Model contributions range from 0% to 100% and reflect the importance of environmental variables in limiting the distribution of the target species.
In this example, we set the threshold to 5%, which means that all variables will be excluded when they contribute less than 5% to the model:
contributionthreshold <- 5
correlationthreshold
This sets the threshold of Pearson's correlation coefficient (ranging from 0 to 1) above which environmental variables are regarded to be correlated (based on values at all background locations). Of the correlated variables, only the variable with the highest contribution score will be kept, all other correlated variables will be excluded from the Maxent model. Correlated variables should be removed because they may reflect the same environmental conditions, and can lead to overly complex or overpredicted models. Also, models compiled with correlated variables might give wrong predictions in scenarios where the correlations between the variables differ.
Here, we are setting the threshold to 0.9:
correlationthreshold <- 0.9
betamultiplier
This argument sets the values of beta multipliers (regularization
multipliers) for which variable selection shall be performed. The
smaller this value, the more closely will the projected distribution
fit to the training data set. Models that are overfitted to the
occurrence data are poorly transferable to novel environments and,
thus, not appropriate to project distribution changes under
environmental change. Performance will be compared between models
created with the beta values given in this betamultiplier
vector. Thus, providing a range of beta values from 1 (the default in
Maxent) to 15 or so, will help you to spot the optimal beta multiplier
for your specific model.
Here, we are testing betamultipliers from 2 to 6 with steps of 0.5 with the following setting:
betamultiplier=seq(2,6,0.5)
With the argument seetings that we specified above, we can start the variable selection. The following function starts the model-selection procedure
library("MaxentVariableSelection") VariableSelection(maxent, outdir, gridfolder, occurrencelocations, backgroundlocations, additionalargs, contributionthreshold, correlationthreshold, betamultiplier )
The run will take a while (ca. one hour). The resulting output
files will be saved in the folder that you specified in the argument
outdir
.
The following paragraphs explain what steps are actually taken in this
procedure to identify the set of most relevant environmental
variables. First, an initial Maxent model is compiled with all
provided environmental variables. Then, all variables are excluded
which have a relative contribution score below the value set with
contributionthreshold
. Then, those variables are removed that
corrleate with the variable of highest contribution (at a correlation
coefficient > correlationthreshold
or < -correlationthreshold
).
The remaining set of variables is then used to compile a new Maxent
model. Again, variables with low contribution scores are removed and
remaining variables that are correlated to the variable of
second-highest contribution are discarded. This process is repeated
until left with a set of uncorrelated variables that all had a model
contribution above the value set with contributionthreshold. These
steps are performed for the range of betamultipliers specified with
betamultiplier
.
The performance of each Maxent model is assessed with the sample-size-adjusted Akaike information criterion (AICc) [@Akaike1974], the area under the receiver operating characteristic (AUC) estimated from test data [@Fielding1997], and the difference between AUC values from test and training data (AUC.Diff), which estimates model overfitting. AICc values are estimated from single models that include all occurrence sites, based on code in ENMTools [@Warren2010]. AUC values, instead, are averaged over ten replicate runs which differ in the set of 50% test data that are randomly sub-sampled from the occurrence sites and withheld from model construction.
Maximization of test AUC values generally favors models that can well
discriminate between presence and absence sites
[@Fielding1997; @Warren2011]. This, however, bears the risk to
overpredict the realized niche of a species [@Jimenez2012]. Instead,
minimization of AICc values generally favors models that recognize the
fundamental niche of a species and that are better transferable to
future climate scenarios [@Warren2011]. The optimal set of variables
along with the best beta multiplier is then identified as the model of
highest AUC or lowest AICc values (the results of the
VariableSelection
functions gives you both options).
The results are saved in the directory that was specified in the
argument outdir
. If you were patient enough to wait for the example
variable selection to finish (see above), you will find seven tables
and four figures in the output directory.
The output gives you the opportunity to decide whether you choose AUC.Test values or AICc values as performance criterion to identify the model of highest performance. If we choose AICc values as performance criterion, the following result files are most relevant:
The figure ModelSelectionAICc_MarkedMinAICc.png
shows that the model
with beta-multiplier 2.5 and with only 2 environmental variables shows
highest performance (lowest AICc).
Instead, if you would decide to use AUC.Test values as performance
criterion, the most relevant figure is
ModelSelectionAUCTest_MarkedMaxAUCTest.png
. If you compare the two
figures, you find that the model with maximum AUC.Test value differs
from the model with minimum AICc value. Although both models chose 2
variables, the model with maximum AUC.Test value had a betamultiplier
of 3.5, while the model with minimum AICc value had a betamultiplier
of 2.5.
The figures also shows that the maximum number of variables in any of
the models is four , although the gridfolder
contained many more
variables. Only those four variables that were extracted for the
occurrencelocations
and for the backgroundlocations
were included
in the initial Maxent models. If you would like to identify if other
than these four variables (calcite
, parmean
, sstmax
, and
salinity
) are important in setting distribution limits, they first
have to be extracted for the occurrencelocations
and the
backgroundlocations
as described above in the section 'Files of
occurrence and background locations' above.
The table ModelWithMinAICc.txt
lists performance indicators of the
model with lowest AICc value. This table is a subset of the table
ModelPerformance.txt
, which lists performance indicators of all
created Maxent models. The first five columns show that the best
performing model is model number 4, with a betamultiplier of 2.5 and
only 2 variables:
|Model|betamultiplier|variables|samples|parameters|loglikelihood| |:-----:|:--------------:|:---------:|:-------:|:----------:|:-------------:| |4 | 2.5| 2| 97| 10| -1456.44|
The following six columns provide the performance indicators of this model:
| AIC | AICc | BIC |AUC.Test|AUC.Train|AUC.Diff| |:-------:|:-------:|:-------:|:--------:|:---------:|:--------:| |2932.87|2935.43|2958.62| 0.75| 0.78| 0.035|
The table ModelWithMaxAUCTest.txt
, instead, lists performance
indicators of the model with maximum AUC.Test value.
The table VariableSelectionMinAICc.txt
shows contributions of and
correlations between environmental variables for those models that
lead directly to the model of lowest AICc value. This table and the
table VariableSelectionMaxAUCTest.txt
are subsets of
VariableSelectionProcess.txt
, which lists variable contributions and
correlations for all created Maxent models.
The content of VariableSelectionMinAICc.txt
is:
|Test |Contributions|Correlation|Contributions|Correlation| |:--------------|-------------:|-----------:|-------------:|-----------:| |Model | 3| 3| 4| 4| |betamultiplier| 2.5| 2.5| 2.5| 2.5| |calcite | 55.6044000| 1.0000000| 58.9015000| 0.1314525| |parmean | 0.0332| NA| NA| NA| |salinity | 1.4447| NA| NA| NA| |sstmax | 42.9176000| 0.1314525| 41.0985000| 1.0000000|
Here, the variable selection process that leaded to the model with
minimum AICc value started with model 3. Two other models with
betamultiplier 2 were created before. Since the contribtionthreshold
was set to 5, all variables with a lower contribution (parmean
and
salinity
) were excluded. That's why they have NA
values from the
first correlation test on (NA
stands for 'not available'). The
remaining variables (calcite
and sstmax
) were kept in the end
because the correlation between them was below the
correlationthreshold
of 0.9.
In conclusion, the optimal model settings (based on the
performance-indicator AICc) are: 1) a beta-multiplier of 2.5; and 2) a
subset of two environmental variables: calcite
and sstmax
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.