R/ENMevaluate.R

Defines functions ENMevaluate

Documented in ENMevaluate

#' @title Tune ecological niche model (ENM) settings and calculate evaluation statistics
#' @description \code{ENMevaluate()} is the primary function for the \pkg{ENMeval} package. This
#' function builds ecological niche models iteratively across a range of user-specified tuning 
#' settings. Users can choose to evaluate models with cross validation or a full-withheld testing 
#' dataset. \code{ENMevaluate()} returns an \code{ENMevaluation} object with slots containing 
#' evaluation statistics for each combination of settings and for each cross validation fold therein, as
#' well as raster predictions for each model when raster data is input. The evaluation statistics in the 
#' results table should aid users in identifying model settings that balance fit and predictive ability. See
#' the extensive vignette for fully worked examples: 
#' <https://jamiemkass.github.io/ENMeval/articles/ENMeval-2.0-vignette.html>.
#' 
#' @param occs matrix / data frame: occurrence records with two columns for longitude and latitude 
#' of occurrence localities, in that order. If specifying predictor variable values
#' assigned to presence/background localities (without inputting raster data), this table should also have 
#' one column for each predictor variable. See Note for important distinctions between running the function
#' with and without rasters.
#' @param envs RasterStack: environmental predictor variables. These should be in same geographic projection as occurrence data.
#' @param bg matrix / data frame: background records with two columns for longitude and latitude of 
#' background (or pseudo-absence) localities, in that order. If NULL, points will be randomly sampled across \code{envs} 
#' with the number specified by argument \code{n.bg}. If specifying predictor variable values
#' assigned to presence/background localities (without inputting raster data), this table should also have 
#' one column for each predictor variable. See Details for important distinctions between running the function
#' with and without rasters.
#' @param tune.args named list: model settings to be tuned (i.e., for Maxent models:  \code{list(fc = c("L","Q"), rm = 1:3)})
#' @param partitions character: name of partitioning technique. Currently available options are
#' the nonspatial partitions "randomkfold" and "jackknife", and the spatial partitions "block",
#' "checkerboard1", and "checkerboard2", "testing" for partitioning with fully withheld data (see 
#' argument occs.testing), the "user" option (see argument user.grp), and "none" for no partitioning 
#' (see \code{?partitions} for details).
#' @param algorithm character: name of the algorithm used to build models. Currently one of "maxnet",
#' "maxent.jar", or "bioclim", else the name from a custom ENMdetails implementation.
#' @param partition.settings named list: used to specify certain settings for partitioning schema.
#' See Details and ?partitions for descriptions of these settings.
#' @param other.settings named list: used to specify extra settings for the analysis. 
#' All of these settings have internal defaults, so if they are not specified the analysis will be run 
#' with default settings. See Details for descriptions of these settings, including how to specify arguments
#' for maxent.jar.
#' @param categoricals character vector: name or names of categorical environmental variables. If not specified,
#' all predictor variables will be treated as continuous unless they are factors. If categorical variables
#' are already factors, specifying names of such variables in this argument is not needed.
#' @param doClamp boolean: if TRUE (default), model prediction extrapolations will be restricted to the upper and lower
#' bounds of the predictor variables. Clamping avoids extreme predictions for environment values outside
#' the range of the training data. If free extrapolation is a study aim, this should be set to FALSE, but
#' for most applications leaving this at the default of TRUE is advisable to avoid unrealistic predictions. 
#' When predictor variables are input, they are clamped internally before making model predictions when clamping is on.
#' When no predictor variables are input and data frames of variable values are used instead (SWD format),
#' validation data is clamped before making model predictions when clamping is on.
#' @param clamp.directions named list: specifies the direction ("left" for minimum, "right" for maximum) 
#' of clamping for predictor variables -- (e.g., \code{list(left = c("bio1","bio5"), right = c("bio10","bio15"))}).
#' @param user.enm ENMdetails object: a custom ENMdetails object used to build models. 
#' This is an alternative to specifying \code{algorithm} with a character string.
#' @param user.grp named list: specifies user-defined partition groups, where \code{occs.grp} = vector of partition group 
#' (fold) for each occurrence locality, intended for user-defined partitions, and \code{bg.grp} = same vector for 
#' background (or pseudo-absence) localities.
#' @param occs.testing matrix / data frame: a fully withheld testing dataset with two columns for longitude and latitude 
#' of occurrence localities, in that order when \code{partitions = "testing"}. These occurrences will be used only 
#' for evaluation but not for model training, and thus no cross validation will be performed.
#' @param taxon.name character: name of the focal species or taxon. This is used primarily for annotating
#' the ENMevaluation object and output metadata (rmm), but not necessary for analysis.
#' @param n.bg numeric: the number of background (or pseudo-absence) points to randomly sample over the environmental  
#' raster data (default: 10000) if background records were not already provided.
#' @param overlap boolean: if TRUE, calculate niche overlap statistics (Warren \emph{et al.} 2008).
#' @param overlapStat character: niche overlap statistics to be calculated -- 
#' "D" (Schoener's D) and or "I" (Hellinger's I) -- see ?calc.niche.overlap for more details.
#' @param user.val.grps matrix / data frame: user-defined validation record coordinates and predictor variable values. 
#' This is used internally by \code{ENMnulls()} to force each null model to evaluate with empirical validation data,
#' and does not have any current use when running \code{ENMevaluate()} independently.
#' @param user.eval function: custom function for specifying performance metrics not included in \pkg{ENMeval}.
#' The function must first be defined and then input as the argument \code{user.eval}. 
#' This function should have a single argument called \code{vars}, which is a list that includes different data 
#' that can be used to calculate the metric. See Details below and the vignette for a worked example.
#' @param rmm rangeModelMetadata object: if specified, \code{ENMevaluate()} will write metadata details for the analysis into
#' this object, but if not, a new \code{rangeModelMetadata} object will be generated and included in the output
#' \code{ENMevaluation} object.
#' @param parallel boolean: if TRUE, run with parallel processing.
#' @param numCores numeric: number of cores to use for parallel processing. If NULL, all available cores will be used.
#' @param parallelType character: either "doParallel" or "doSNOW" (default: "doSNOW") .
#' @param updateProgress boolean: if TRUE, use shiny progress bar. This is only for use in shiny apps.
#' @param quiet boolean: if TRUE, silence all function messages (but not errors).
#' @param occ,env,bg.coords,RMvalues,fc,occ.grp,bg.grp,method,bin.output,rasterPreds,clamp,progbar These arguments from previous versions are backward-compatible to avoid unnecessary errors for older scripts, but in a later version
#' these arguments will be permanently deprecated.
#' 
#' @details There are a few methodological details in the implementation of ENMeval >=2.0.0 that are important to mention.
#' There is also a brief discussion of some points relevant to null models in ?ENMnulls.
#' 
#' 1. By default, validation AUC is calculated with respect to the full background (training + validation).
#' This approach follows Radosavljevic & Anderson (2014).This setting can be changed by assigning 
#' other.settings$validation.bg to "partition", which will calculate AUC with respect 
#' to the validation background only. The default value for other.settings$validation.bg is "full".
#' 
#' 2. The continuous Boyce index (always) and AICc (when no raster is provided) are not calculated using 
#' the predicted values of the RasterStack delineating the full study extent, but instead using the predicted
#' values for the background records. This decision to use the background only for calculating the continuous 
#' Boyce index was made to simplify the code and improve running time. The decision for AICc was made in order
#' to allow AICc calculations for datasets that do not include raster data. See ?calc.aicc for more details,
#' and for caveats when calculating AICc without raster data (mainly, that if the background does not 
#' adequately represent the occurrence records, users should use the raster approach, for reasons explained
#' in the calc.aicc documentation). For both metrics, if the background records are a good representation 
#' of the study extent, there should not be much difference between this approach using the background 
#' data and the approach that uses rasters.
#' 
#' 3. When running \code{ENMevaluate()} without raster data, and instead adding the environmental predictor values
#' to the occurrence and background data tables, users may notice some differences in the results. Occurrence records
#' that share a raster grid cell are automatically removed when raster data is provided, but without raster data
#' this functionality cannot operate, and thus any such duplicate occurrence records can remain in the training data.
#' The Java implementation of Maxent (maxent.jar) should automatically remove these records, but the R implementation 
#' \code{maxnet} does not, and the \code{bioclim()} function from the R package \code{dismo} does not as well. Therefore,  
#' it is up to the user to remove such records before running \code{ENMevaluate()} when raster data are not included.
#' 
#' Below are descriptions of the parameters used in the other.settings, partition.settings, and user.eval arguments.
#' 
#' For other.settings, the options are:\cr*
#' abs.auc.diff - boolean: if TRUE, take absolute value of AUCdiff (default: TRUE)\cr*
#' pred.type - character: specifies which prediction type should be used to generate maxnet or 
#' maxent.jar prediction rasters (default: "cloglog").\cr*
#' validation.bg - character: either "full" to calculate training and validation AUC and CBI 
#' for cross-validation with respect to the full background (default), or "partition" (meant for 
#' spatial partitions only) to calculate each with respect to the partitioned background only 
#' (i.e., training occurrences are compared to training background, and validation occurrences 
#' compared to validation background).\cr*
#' other.args - named list: any additional model arguments not specified for tuning; this can
#' include arguments for maxent.jar, which are described in the software's Help file.\cr
#' 
#' For partition.settings, the current options are:\cr*
#' orientation - character: one of "lat_lon" (default), "lon_lat", "lat_lat", or "lon_lon" (required for block partition).\cr* 
#' aggregation.factor - numeric vector: one or two numbers specifying the factor with which to aggregate the envs (default: 2)
#' raster to assign partitions (required for the checkerboard partitions).\cr*
#' kfolds - numeric: the number of folds (i.e., partitions) for random partitions (default: 5).\cr
#' 
#' For the block partition, the orientation specifications are abbreviations for "latitude" and "longitude", 
#' and they determine the order and orientations with which the block partitioning function creates the partition groups. 
#' For example, "lat_lon" will split the occurrence localities first by latitude, then by longitude. For the checkerboard 
#' partitions, the aggregation factor specifies how much to aggregate the existing cells in the envs raster
#' to make new spatial partitions. For example, checkerboard1 with an aggregation factor value of 2 will make the grid cells 
#' 4 times larger and then assign occurrence and background records to partition groups based on which cell they are in. 
#' The checkerboard2 partition is hierarchical, so cells are first aggregated to define groups like checkerboard1, but a 
#' second aggregation is then made to separate the resulting 2 bins into 4 bins. For checkerboard2, two different numbers 
#' can be used to specify the two levels of the hierarchy, or if a single number is inserted, that value will be used 
#' for both levels.
#' 
#' For user.eval, the accessible variables you have access to in order to run your custom function are below. 
#' See the vignette for a worked example.\cr*
#' enm - ENMdetails object\cr*
#' occs.train.z - data frame: predictor variable values for training occurrences\cr*
#' occs.val.z - data frame: predictor variable values for validation occurrences\cr*
#' bg.train.z - data frame: predictor variable values for training background\cr*
#' bg.val.z - data frame: predictor variable values for validation background\cr*
#' mod.k - Model object for current partition (k)\cr*
#' nk - numeric: number of folds (i.e., partitions)\cr*
#' other.settings - named list: other settings specified in ENMevaluate()\cr*
#' partitions - character: name of the partition method (e.g., "block")\cr*
#' occs.train.pred - numeric: predictions made by mod.k for training occurrences\cr*
#' occs.val.pred - numeric: predictions made by mod.k for validation occurrences\cr*
#' bg.train.pred - numeric: predictions made by mod.k for training background\cr*
#' bg.val.pred - numeric: predictions made by mod.k for validation background
#' 
#' @references 
#' 
#' Muscarella, R., Galante, P. J., Soley-Guardia, M., Boria, R. A., Kass, J. M., Uriarte, M., & Anderson, R. P. (2014). ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. \emph{Methods in Ecology and Evolution}, \bold{5}: 1198-1205. \doi{10.1111/2041-210X.12261}
#' 
#' Warren, D. L., Glor, R. E., Turelli, M. & Funk, D. (2008) Environmental niche equivalency versus conservatism: quantitative approaches to niche evolution. \emph{Evolution}, \bold{62}: 2868-2883. \doi{10.1111/j.1558-5646.2008.00482.x}
#' 
#' @return An ENMevaluation object. See ?ENMevaluation for details and description of the columns
#' in the results table.
#'
#' @importFrom magrittr %>%
#' @importFrom foreach %dopar%
#' @importFrom grDevices rainbow
#' @importFrom methods new slot validObject
#' @importFrom stats pnorm predict quantile runif sd quantile
#' @importFrom utils citation combn packageVersion setTxtProgressBar txtProgressBar
#'
#'
#' @examples
#' \dontrun{
#' occs <- read.csv(file.path(system.file(package="dismo"), "/ex/bradypus.csv"))[,2:3]
#' envs <- raster::stack(list.files(path=paste(system.file(package="dismo"), "/ex", sep=""), 
#'                                  pattern="grd", full.names=TRUE))
#' occs.z <- cbind(occs, raster::extract(envs, occs))
#' occs.z$biome <- factor(occs.z$biome)
#' bg <- as.data.frame(dismo::randomPoints(envs, 1000))
#' names(bg) <- names(occs)
#' bg.z <- cbind(bg, raster::extract(envs, bg))
#' bg.z$biome <- factor(bg.z$biome)
#' 
#' # set other.settings -- pred.type is only for Maxent models
#' os <- list(abs.auc.diff = FALSE, pred.type = "cloglog", validation.bg = "partition")
#' # set partition.settings -- here's an example for the block method
#' # see Details for the required settings for other partition methods
#' ps <- list(orientation = "lat_lat")
#' 
#' # here's a run with maxnet -- note the tune.args for feature classes (fc)
#' # and regularization multipliers (rm), as well as the designation of the
#' # categorical variable we are using (this can be a vector if multiple
#' # categorical variables are used)
#' e.maxnet <- ENMevaluate(occs, envs, bg, 
#' tune.args = list(fc = c("L","LQ","LQH","H"), rm = 1:5), 
#' partitions = "block", other.settings = os, partition.settings = ps,
#' algorithm = "maxnet", categoricals = "biome", overlap = TRUE)
#' 
#' # print the tuning results
#' eval.results(e.maxnet)
#' 
#' # there is currently no native function to make raster model predictions for
#' # maxnet models, but ENMeval can be used to make them like this:
#' # here's an example where we make a prediction based on the L2 model
#' # (feature class: Linear, regularization multiplier: 2) for our envs data
#' mods.maxnet <- eval.models(e.maxnet)
#' pred.L2 <- enm.maxnet@predict(mods.maxnet$fc.L_rm.2, envs, os)
#' raster::plot(pred.L2)
#' 
#' #' # here's a run with maxent.jar -- note that if the R package rJava cannot 
#' install or load, or if other issues with Java exist on your computer, 
#' maxent.jar will not function
#' e.maxnet <- ENMevaluate(occs, envs, bg, 
#' tune.args = list(fc = c("L","LQ","LQH","H"), rm = 1:5), 
#' partitions = "block", other.settings = os, partition.settings = ps,
#' algorithm = "maxent.jar", categoricals = "biome", overlap = TRUE)
#' 
#' # print the tuning results
#' eval.results(e.maxent.jar)
#' # raster predictions can be made for maxent.jar models with dismo or ENMeval
#' mods.maxent.jar <- eval.models(e.maxent.jar)
#' pred.L2 <- dismo::predict(mods.maxent.jar$fc.L_rm.2, envs, args = "outputform=cloglog")
#' pred.L2 <- enm.maxent.jar@predict(mods.maxent.jar$fc.L_rm.2, envs, os)
#' raster::plot(pred.L2)
#' 
#' # this will give you the percent contribution (not deterministic) and
#' # permutation importance (deterministic) values of variable importance for
#' # Maxent models, and it only works with maxent.jar
#' eval.variable.importance(e.maxent.jar)
#' 
#' # here's a run with BIOCLIM -- note that 1) we need to remove the categorical
#' # variable here because this algorithm only takes continuous variables, and
#' # that 2) the way BIOCLIM makes predicted is getting tuned (as opposed to the
#' way the model is fit like maxnet or maxent.jar), namely, the tails of the 
#' # distribution that are ignored when predicting (see ?dismo::bioclim)
# e.bioclim <- ENMevaluate(occs, envs[[-9]], bg,
# tune.args = list(tails = c("low", "high", "both")),
# partitions = "block", other.settings = os, partition.settings = ps,
# algorithm = "bioclim", overlap = TRUE)
#' 
#' # print the tuning results
#' eval.results(e.bioclim)
#' # make raster predictions with dismo or ENMeval
#' mods.bioclim <- eval.models(e.bioclim)
#' # note: the models for low, high, and both are actually all the same, and
#' # the only difference for tuning is how they are predicted during
#' # cross-validation
#' pred.both <- dismo::predict(mods.bioclim$tails.both, envs, tails = "both")
#' os <- c(os, list(tails = "both"))
#' pred.both <- enm.bioclim@predict(mods.bioclim$tails.both, envs, os)
#' raster::plot(pred.both)
#' 
#' # please see the vignette for more examples of model tuning, 
#' # partitioning, plotting functions, and null models
#' # https://jamiemkass.github.io/ENMeval/articles/ENMeval-2.0-vignette.html
#' }
#' 
#' @export 

ENMevaluate <- function(occs, envs = NULL, bg = NULL, tune.args = NULL, partitions = NULL, algorithm = NULL, 
                        partition.settings = NULL, 
                        other.settings = NULL, 
                        categoricals = NULL, doClamp = TRUE, clamp.directions = NULL,
                        user.enm = NULL, user.grp = NULL, occs.testing = NULL, taxon.name = NULL, 
                        n.bg = 10000, overlap = FALSE, overlapStat = c("D", "I"), 
                        user.val.grps = NULL, user.eval = NULL, rmm = NULL,
                        parallel = FALSE, numCores = NULL, parallelType = "doSNOW", updateProgress = FALSE, quiet = FALSE, 
                        # legacy arguments
                        occ = NULL, env = NULL, bg.coords = NULL, RMvalues = NULL, fc = NULL, occ.grp = NULL, bg.grp = NULL,
                        method = NULL, bin.output = NULL, rasterPreds = NULL, clamp = NULL, progbar = NULL) {
  
  # legacy argument handling so ENMevaluate doesn't break for older code
  all.legacy <- list(occ, env, bg.coords, RMvalues, fc, occ.grp, bg.grp, method, bin.output, rasterPreds)
  if(sum(sapply(all.legacy, function(x) !is.null(x))) > 0) {
    if(quiet != TRUE) message("* Running ENMeval v2.0.4 with legacy arguments. These will be phased out in the next version.")
  }
  if(!is.null(occ)) occs <- occ
  if(!is.null(env)) envs <- env
  if(!is.null(bg.coords)) bg <- bg.coords
  if(!is.null(method)) partitions <- method
  if(!is.null(clamp)) doClamp <- clamp
  if(!is.null(rasterPreds)) {
    stop('The argument "rasterPreds" was deprecated. Please leave this argument at its default NULL. If you want to avoid generating model prediction rasters, include predictor variable values in the occurrence and background data frames (SWD format). See Details in ?ENMevaluate for more information.')
  }
  if(!is.null(RMvalues)) tune.args$rm <- RMvalues
  if(!is.null(fc)) tune.args$fc <- fc
  if(!is.null(occ.grp) & !is.null(bg.grp)) {
    user.grp <- list(occs.grp = occ.grp, bg.grp = bg.grp)
    if(quiet != TRUE) message('Warning: These arguments were deprecated and replaced with the argument "user.grp".')
  }
  if((!is.null(occ.grp) & is.null(bg.grp)) | (is.null(occ.grp) & !is.null(bg.grp))) {
    stop('For user partitions, please input both arguments "occ.grp" and "bg.grp". Warning: These are legacy arguments that were replaced with the argument "user.grp".')
  }
  
  if(is.null(algorithm) & is.null(user.enm)) {
    stop('* Please select a model name (argument "algorithm") or specify a user model (argument "user.enm").')
  }
  
  # record start time
  start.time <- proc.time()
  
  # check if ecospat is installed, and if not, prevent CBI calculations
  if((requireNamespace("ecospat", quietly = TRUE))) {
    ecospat.use <- TRUE
  }else{
    message("Package ecospat is not installed, so Continuous Boyce Index (CBI) cannot be calculated.")
    ecospat.use <- FALSE
  }
  
  ######################## #
  # INITIAL DATA CHECKS ####
  ######################## #
  
  # coerce occs and bg to df
  occs <- as.data.frame(occs)
  if(!is.null(bg)) bg <- as.data.frame(bg)
  # extract species name and coordinates
  
  # fill in these arguments with defaults if they are NULL
  if(is.null(partition.settings)) partition.settings <- list(orientation = "lat_lon", 
                                                             aggregation.factor = 2, 
                                                             kfolds = 5)
  other.settings <- c(other.settings, list(abs.auc.diff = TRUE, 
                                                     pred.type = "cloglog", 
                                                     validation.bg = "full"))
  # add whether to use ecospat to other.settings to avoid multiple calls to require()
  other.settings <- c(other.settings, ecospat.use = ecospat.use)
  
  # make sure taxon name column is not included
  if(inherits(occs[,1], "character") | inherits(bg[,1], "character")) stop("* If first column of input occurrence or background data is the taxon name, remove it and instead include the 'taxon.name' argument. The first two columns must be the longitude and latitude of the occurrence/background localities.")
  
  if(is.null(taxon.name)) {
    if(quiet != TRUE) message(paste0("*** Running initial checks... ***\n"))
  }else{
    if(quiet != TRUE) message(paste0("*** Running initial checks for ", taxon.name, " ... ***\n"))
  }
  
  ## general argument checks
  all.partitions <- c("jackknife", "randomkfold", "block", "checkerboard1", 
                      "checkerboard2", "user", "testing", "none")
  
  if(!(partitions %in% all.partitions)) {
    stop("Please enter an accepted partition method.")
  }
  
  if(partitions == "testing" & is.null(occs.testing)) {
    stop("If doing testing evaluations, please provide testing data (occs.testing).")
  }
  
  if((partitions == "checkerboard1" | partitions == "checkerboard2") & is.null(envs)) {
    stop('For checkerboard partitioning, predictor variable rasters "envs" are required.')
  }
  
  if(partitions == "randomkfold") {
    if(is.null(partition.settings$kfolds)) {
      stop('For random k-fold partitioning, a numeric, non-zero value of "kfolds" is required.')  
    }else{
      if(partition.settings$kfolds == 0) {
        stop('For random k-fold partitioning, a numeric, non-zero value of "kfolds" is required.')  
      }
    }
  }
  
  # if occs.testing input, coerce partitions to 'testing'
  if(partitions == "testing") {
    if(is.null(occs.testing)) {
      stop('If performing fully withheld testing, enter occs.testing dataset and assign partitions to "testing".')
    }
    if(nrow(occs.testing) == 0) {
      stop('If performing fully withheld testing, enter occs.testing dataset and assign partitions to "testing".')
    }
  }
  
  if(is.null(tune.args) & overlap == TRUE) {
    if(quiet != TRUE) message('* As no tuning arguments were specified, turning off niche overlap.')
    overlap <- FALSE
  }
  
  # make sure occs and bg are data frames with identical column names
  if(all(names(occs) != names(bg)) & !is.null(bg)) {
    stop('Datasets "occs" and "bg" have different column names. Please make them identical and try again.')
  }
  
  if(doClamp == FALSE & !is.null(clamp.directions)) {
    stop("If specifying clamp directions, please make doClamp = TRUE.")
  }
  
  # if a vector of tuning arguments is numeric, make sure it is sorted (for results table and plotting)
  tune.args.num <- which((sapply(tune.args, class) %in% c("numeric", "integer")) & sapply(tune.args, length) > 1)
  for(i in tune.args.num) {
    tune.args[[i]] <- sort(tune.args[[i]])
  }
  
  # make sure that validation.bg is only "bg" if the partitions are spatial
  if(!(partitions %in% c("block","checkerboard1","checkerboard2","user")) & other.settings$validation.bg == "partition") {
    stop('If using non-spatial partitions, please set validation.bg to "full". The "partition" option only makes sense when partitions represent different regions of the study extent. See ?ENMevaluate for details.')
  }else if(partitions == "user" & other.settings$validation.bg == "partition") {
    message('* Please make sure that the user-specified partitions are spatial, else validation.bg should be set to "full". The "partition" option only makes sense when partitions represent different regions of the study extent. See ?ENMevaluate for details.')
  }
  
  # choose a built-in ENMdetails object matching the input model name
  # unless the model is chosen by the user
  if(is.null(user.enm)) {
    enm <- lookup.enm(algorithm)
  }else{
    enm <- user.enm
  }
  
  # get unique values of user partition groups to make sure they all remain after occurrence data processing
  if(!is.null(user.grp)) {
    user.grp.uniq <- unique(c(user.grp$occs.grp, user.grp$bg.grp))
  }
  
  # algorithm-specific errors
  enm@errors(occs, envs, bg, tune.args, partitions, algorithm, partition.settings, other.settings, 
             categoricals, doClamp, clamp.directions)

  
  ########################################################### #
  # ASSEMBLE COORDINATES AND ENVIRONMENTAL VARIABLE VALUES ####
  ########################################################### #
  
  # if environmental rasters are input as predictor variables
  if(!is.null(envs)) {
    # make sure envs is a RasterStack -- if RasterLayer, maxent.jar crashes
    envs <- raster::stack(envs)
    envs.z <- raster::values(envs)
    envs.naMismatch <- sum(apply(envs.z, 1, function(x) !all(is.na(x)) & !all(!is.na(x))))
    if(envs.naMismatch > 0) {
      if(quiet != TRUE) message(paste0("* Found ", envs.naMismatch, " raster cells that were NA for one or more, but not all, predictor variables. Converting these cells to NA for all predictor variables."))
      envs.names <- names(envs)
      envs <- raster::stack(raster::calc(envs, fun = function(x) if(sum(is.na(x)) > 0) x * NA else x))
      names(envs) <- envs.names
    }
    # if no background points specified, generate random ones
    if(is.null(bg)) {
      if(quiet != TRUE) message(paste0("* Randomly sampling ", n.bg, " background points ..."))
      bg <- as.data.frame(dismo::randomPoints(envs, n = n.bg))
      names(bg) <- names(occs)
    }
    
    # remove cell duplicates
    occs.cellNo <- raster::extract(envs, occs, cellnumbers = TRUE)
    occs.dups <- duplicated(occs.cellNo[,1])
    if(sum(occs.dups) > 0) if(quiet != TRUE) message(paste0("* Removed ", sum(occs.dups), " occurrence localities that shared the same grid cell."))
    occs <- occs[!occs.dups,]
    if(!is.null(user.grp)) user.grp$occs.grp <- user.grp$occs.grp[!occs.dups]
    
    # bind coordinates to predictor variable values for occs and bg
    occs.z <- raster::extract(envs, occs)
    bg.z <- raster::extract(envs, bg)
    occs <- cbind(occs, occs.z)
    bg <- cbind(bg, bg.z)
  }else{
    # if no bg included, stop
    if(is.null(bg)) {
      stop("* If inputting variable values without rasters, please make sure to input background coordinates with values as well as occurrences.")
    }
    # for occ and bg coordinates with environmental predictor values (SWD format)
    if(quiet != TRUE) message("* Variable values were input along with coordinates and not as raster data, so no raster predictions can be generated and AICc is calculated with background data for Maxent models.")
    # make sure both occ and bg have predictor variable values
    if(ncol(occs) < 3 | ncol(bg) < 3) stop("* If inputting variable values without rasters, please make sure these values are included in the occs and bg tables proceeding the coordinates.")
  }
  
  # if NA predictor variable values exist for occs or bg, remove these records and modify user.grp accordingly
  occs.z.na <- which(rowSums(is.na(occs)) > 0)
  if(length(occs.z.na) > 0) {
    if(quiet != TRUE) message(paste0("* Removed ", length(occs.z.na), " occurrence points with NA predictor variable values."))
    occs <- occs[-occs.z.na,]
    if(!is.null(user.grp)) user.grp$occs.grp <- user.grp$occs.grp[-occs.z.na]
  }
  
  bg.z.na <- which(rowSums(is.na(bg)) > 0)
  if(length(bg.z.na) > 0) {
    if(quiet != TRUE) message(paste0("* Removed ", length(bg.z.na), " background points with NA predictor variable values."))
    bg <- bg[-bg.z.na,]
    if(!is.null(user.grp)) user.grp$bg.grp <- user.grp$bg.grp[-bg.z.na]
  }
  
  # make main df with coordinates and predictor variable values
  d <- rbind(occs, bg)
  
  # add presence-background identifier for occs and bg
  d$pb <- c(rep(1, nrow(occs)), rep(0, nrow(bg)))
  
  # if user-defined partitions, assign grp variable before filtering out records with NA predictor variable values
  # for all other partitioning methods, grp assignments occur after filtering
  if(!is.null(user.grp)) {
    d[d$pb == 1, "grp"] <- as.numeric(as.character(user.grp$occs.grp))
    d[d$pb == 0, "grp"] <- as.numeric(as.character(user.grp$bg.grp))
    if(!all(user.grp.uniq %in% d$grp)) stop("Removal of cell duplicates caused one or more user partition groups to be missing. Please make sure all partition groups are represented by at least one non-duplicate occurrence record.")
    d$grp <- factor(d$grp)
  }
  
  ############################################ #
  # ADD TESTING DATA (IF INPUT) ####
  ############################################ #
  
  # add testing data to main df if provided
  if(partitions == "testing") {
    if(!is.null(envs)) {
      occs.testing.z <- as.data.frame(raster::extract(envs, occs.testing))
      occs.testing.z <- cbind(occs.testing, occs.testing.z)
    }else{
      occs.testing.z <- occs.testing
    }
  }else{
    occs.testing.z <- NULL
  }
  
  ################################# #
  # ASSIGN CATEGORICAL VARIABLES ####
  ################################# #
  
  # find factor rasters or columns and identify them as categoricals
  if(!is.null(envs)) {
    categoricals <- unique(c(categoricals, names(envs)[which(raster::is.factor(envs))]))
  }else{
    categoricals <- unique(c(categoricals, names(occs)[which(sapply(occs, is.factor))]))
  }
  if(length(categoricals) == 0) categoricals <- NULL
  
  # if categoricals argument was specified, convert these columns to factor class
  if(!is.null(categoricals)) {
    for(i in 1:length(categoricals)) {
      if(quiet != TRUE) message(paste0("* Assigning variable ", categoricals[i], " to categorical ..."))
      d[, categoricals[i]] <- as.factor(d[, categoricals[i]])
      if(!is.null(user.val.grps)) user.val.grps[, categoricals[i]] <- factor(user.val.grps[, categoricals[i]], levels = levels(d[, categoricals[i]]))
      if(!is.null(occs.testing.z)) occs.testing.z[, categoricals[i]] <- factor(occs.testing.z[, categoricals[i]], levels = levels(d[, categoricals[i]]))
    }
  }
  
  # drop categoricals designation in other.settings to feed into other functions
  other.settings$categoricals <- categoricals
  
  ################# #
  # CLAMPING ####
  ################# #
  if(doClamp == TRUE) {
    # when predictor variable rasters are input
    if(!is.null(envs)) {
      # assign both clamp directions to all variables if none are set
      if(is.null(clamp.directions)) {
        clamp.directions$left <- names(envs)
        clamp.directions$right <- names(envs)
      }
      # record clamp directions in other.settings
      other.settings$clamp.directions <- clamp.directions
      # run function to clamp predictor variable rasters
      envs <- clamp.vars(orig.vals = envs, ref.vals = rbind(occs.z, bg.z), 
                         left = clamp.directions$left, 
                         right = clamp.directions$right, 
                         categoricals = categoricals)
      if(quiet != TRUE) message("* Clamping predictor variable rasters...")
    }else{
      # if no predictor variable rasters are input, assign both clamp directions
      # to all variable names (columns besides the first two, which should be
      # coordinates) if none are set
      # during cross-validation, validation data will be clamped using the
      # clamp.vars() function
      if(is.null(clamp.directions)) {
        clamp.directions$left <- names(d[, 3:(ncol(d)-1)])
        clamp.directions$right <- names(d[, 3:(ncol(d)-1)])
      }
    }
  }
  # record clamping choice as FALSE in other.settings regardless of doClamp
  # selection -- this is because the clamping is done on the predictor rasters 
  # or values directly, so that internally all model predictions are made with 
  # clamping off
  # when enm.predict() functions are run externally, users can specify
  # other.settings$doClamp to turn on clamping functionality
  other.settings$doClamp <- FALSE
  
  ###################### #
  # ASSIGN PARTITIONS ####
  ###################### #
  
  # unpack occs and bg records for partitioning
  d.occs <- d %>% dplyr::filter(pb == 1) %>% dplyr::select(1:2)
  d.bg <- d %>% dplyr::filter(pb == 0) %>% dplyr::select(1:2)
  
  # partition occs based on selected partition method
  grps <- switch(partitions, 
                 jackknife = get.jackknife(d.occs, d.bg),
                 randomkfold = get.randomkfold(d.occs, d.bg, partition.settings$kfolds),
                 block = get.block(d.occs, d.bg, partition.settings$orientation),
                 checkerboard1 = get.checkerboard1(d.occs, envs, d.bg, partition.settings$aggregation.factor),
                 checkerboard2 = get.checkerboard2(d.occs, envs, d.bg, partition.settings$aggregation.factor),
                 user = NULL,
                 testing = list(occs.grp = rep(0, nrow(d.occs)), bg.grp = rep(0, nrow(d.bg))),
                 none = list(occs.grp = rep(0, nrow(d.occs)), bg.grp = rep(0, nrow(d.bg))))
  
  
  # choose a user message reporting on partition choice
  parts.message <- switch(partitions,
                          jackknife = "* Model evaluations with k-1 jackknife (leave-one-out) cross validation...",
                          randomkfold = paste0("* Model evaluations with random ", partition.settings$kfolds, "-fold cross validation..."),
                          block =  paste0("* Model evaluations with spatial block (4-fold) cross validation and ", partition.settings$orientation, " orientation..."),
                          checkerboard1 = "* Model evaluations with checkerboard (2-fold) cross validation...",
                          checkerboard2 = "* Model evaluations with hierarchical checkerboard (4-fold) cross validation...",
                          user = paste0("* Model evaluations with user-defined ", length(unique(user.grp$occs.grp)), "-fold cross validation..."),
                          testing = "* Model evaluations with testing data...",
                          none = "* Skipping model evaluations (only calculating full model statistics)...")
  if(quiet != TRUE) message(parts.message)
  
  # if jackknife partitioning, do not calculate CBI because there are too few validation occurrences
  # per partition (n=1) to have a meaningful result
  if(partitions == "jackknife") other.settings$cbi.cv <- FALSE else other.settings$cbi.cv <- TRUE
  
  # record user partition settings and do same check as above if partitions are identical to jackknife
  if(partitions == "user") {
    user.nk <- length(unique(user.grp$occs.grp))
    partition.settings$kfolds <- user.nk
    if(user.nk == nrow(d[d$pb==1,])) other.settings$cbi.cv <- FALSE else other.settings$cbi.cv <- TRUE
  }
  
  # add these values as the 'grp' column
  if(!is.null(grps)) d$grp <- factor(c(grps$occs.grp, grps$bg.grp))
  
  ################# #
  # MESSAGE
  ################# #
  # print model-specific message
  if(is.null(taxon.name)) {
    if(quiet != TRUE) message(paste("\n*** Running ENMeval v2.0.4 with", enm@msgs(tune.args, other.settings), "***\n"))
  }else{
    if(quiet != TRUE) message(paste("\n*** Running ENMeval v2.0.4 for", taxon.name, "with", enm@msgs(tune.args, other.settings), "***\n"))
  }
  
  ################# #
  # MODEL TUNING #### 
  ################# #
  
  # make table for all tuning parameter combinations
  tune.tbl <- expand.grid(tune.args, stringsAsFactors = FALSE) %>% tibble::as_tibble()
  # make tune.tbl NULL, not an empty table, if no settings are specified
  # this makes it easier to use tune.i as a parameter in function calls
  # when tune.args does not exist
  if(nrow(tune.tbl) == 0) tune.tbl <- NULL
  
  if(parallel) {
    results <- tune.parallel(d, envs, enm, partitions, tune.tbl, doClamp, 
                             other.settings, partition.settings, user.val.grps, 
                             occs.testing.z, numCores, parallelType, user.eval, 
                             algorithm, quiet)  
  }else{
    results <- tune.regular(d, envs, enm, partitions, tune.tbl, doClamp, 
                            other.settings, partition.settings, user.val.grps, 
                            occs.testing.z, updateProgress, user.eval, 
                            algorithm, quiet)
  }
  
  ##################### #
  # ASSEMBLE RESULTS #### 
  ##################### #
  
  # flatten all training statistics data frames from results list into a single data frame
  train.stats.all <- dplyr::bind_rows(lapply(results, function(x) x$train.stats))
  # flatten all validation statistics data frames from results list into a single data frame
  # (these are no validation stats if no partitions were chosen)
  val.stats.all <- dplyr::bind_rows(lapply(results, function(x) x$cv.stats))
  
  if(is.null(tune.tbl)) {
    # if not tuned settings, the "tune name" is the model name
    tune.names <- enm@name
  }else{
    # define tuned settings names and bind them to the tune table
    tune.tbl <- dplyr::mutate_all(tune.tbl, as.factor)
    tune.names <- train.stats.all$tune.args
    tune.tbl$tune.args <- factor(tune.names, levels = tune.names)
  }
  # gather all full models into list and name them
  mod.full.all <- lapply(results, function(x) x$mod.full)
  names(mod.full.all) <- tune.names
  
  # gather all model prediction rasters into a stack and name them
  # if envs is null, make an empty stack
  if(!is.null(envs)) {
    mod.full.pred.all <- raster::stack(sapply(results, function(x) x$mod.full.pred))
    names(mod.full.pred.all) <- tune.names
  }else{
    mod.full.pred.all <- raster::stack()
  }
  
  # define number of grp (the value of "k") for occurrences
  # k is 1 for partition "testing"
  # k is 0 for partitions "none" and "user"
  if(partitions %in% c("testing", "none")) {
    nk <- 0
  }else{
    nk <- length(unique(d[d$pb == 1, "grp"]))
  }
  
  # if partitions were specified
  if(nk > 0) {
    # define number of settings (plus the tune.args field)
    nset <- ifelse(!is.null(tune.tbl), ncol(tune.tbl), 0)
    
    # if jackknife cross-validation (leave-one-out), correct variance for
    # non-independent samples (Shcheglovitova & Anderson 2013)
    if(partitions == "jackknife") {
      sum.list <- list(avg = mean, sd = ~sqrt(corrected.var(., nk)))
    }else{
      sum.list <- list(avg = mean, sd = sd)
    } 
    
    # if there is one partition, or if using an testing dataset, do not take summary statistics
    if(nk == 1 | partitions == "testing") sum.list <- list(function(x) {x})
    
    # if tune.tbl exists, make tune.args column a factor to keep order after using dplyr functions
    if(!is.null(tune.tbl)) val.stats.all$tune.args <- factor(val.stats.all$tune.args, levels = tune.names)
    
    # calculate summary statistics
    cv.stats.sum <- val.stats.all %>% 
      dplyr::group_by(tune.args) %>%
      dplyr::select(-fold) %>% 
      dplyr::summarize_all(sum.list) %>%
      dplyr::ungroup() 
    
    # change names (replace _ with .)
    names(cv.stats.sum) <- gsub("(.*)_(.*)", "\\1.\\2", names(cv.stats.sum))
    # order columns alphabetically
    cv.stats.sum <- cv.stats.sum[, order(colnames(cv.stats.sum))]
    
    # if tune.tbl exists
    if(!is.null(tune.tbl)) {
      # make tune.args column in training stats factor too for smooth join
      train.stats.all$tune.args <- factor(train.stats.all$tune.args, levels = tune.names)
      # eval.stats is the join of tune.tbl, training stats, and cv stats
      eval.stats <- tune.tbl %>% dplyr::left_join(train.stats.all, by = "tune.args") %>%
        dplyr::left_join(cv.stats.sum, by = "tune.args")
    }else{
      # if tune.tbl does not exist, eval.stats is the binding of training stats to cv stats
      train.stats.all$tune.args <- NULL
      cv.stats.sum$tune.args <- NULL
      eval.stats <- dplyr::bind_cols(train.stats.all, cv.stats.sum)
    }
  }else{
    # make tune.args column in training stats factor too for smooth join
    train.stats.all$tune.args <- factor(train.stats.all$tune.args, levels = tune.names)
    # if no partitions assigned, eval.stats is the join of tune.tbl to training stats
    eval.stats <- dplyr::left_join(tune.tbl, train.stats.all, by = "tune.args") 
    if(nrow(val.stats.all) > 0) eval.stats <- dplyr::left_join(eval.stats, val.stats.all, by = "tune.args")
    if("fold" %in% names(eval.stats)) eval.stats <- eval.stats %>% dplyr::select(-fold)
  }
  
  # calculate number of non-zero parameters in model
  ncoefs <- sapply(mod.full.all, enm@ncoefs)
  
  # calculate AICc
  if((enm@name == "maxnet" | enm@name == "maxent.jar")) {
    pred.type.raw <- switch(enm@name, maxnet = "exponential", maxent.jar = "raw")
    aic.settings <- other.settings
    aic.settings$pred.type <- pred.type.raw
    if(!is.null(envs)) {
      pred.all.raw <- raster::stack(lapply(mod.full.all, enm@predict, envs, aic.settings))
    }else{
      pred.all.raw <- NULL
    }
    occs.pred.raw <- dplyr::bind_rows(lapply(mod.full.all, enm@predict, occs[,-c(1,2)], aic.settings))
    aic <- aic.maxent(occs.pred.raw, ncoefs, pred.all.raw)
    eval.stats <- dplyr::bind_cols(eval.stats, aic)
  }
  
  # add ncoef column
  eval.stats$ncoef <- ncoefs
  
  if(is.null(taxon.name)) taxon.name <- ""
  if(is.null(tune.tbl)) tune.tbl <- data.frame()
  if(is.null(occs.testing.z)) occs.testing.z <- data.frame()
  if(partitions != "block") partition.settings$orientation <- NULL
  if(partitions != "checkerboard1" | partitions != "checkerboard2") partition.settings$aggregation.factor <- NULL
  if(partitions != "randomkfold") partition.settings$kfolds <- NULL
  if(is.null(partition.settings) | length(partition.settings) == 0) partition.settings <- list()
  if(is.null(clamp.directions)) clamp.directions <- list()
  
  # get variable importance for all models
  variable.importance.all <- lapply(mod.full.all, enm@variable.importance)
  
  # remove the doClamp = FALSE recorded in other.settings to avoid confusion
  other.settings$doClamp <- NULL
  
  # assemble the ENMevaluation object
  e <- ENMevaluation(algorithm = enm@name, tune.settings = as.data.frame(tune.tbl),
                     results = as.data.frame(eval.stats), results.partitions = val.stats.all,
                     predictions = mod.full.pred.all, models = mod.full.all, 
                     variable.importance = variable.importance.all,
                     partition.method = partitions, partition.settings = partition.settings,
                     other.settings = other.settings, doClamp = doClamp, clamp.directions = clamp.directions, 
                     taxon.name = as.character(taxon.name),
                     occs = d[d$pb == 1, 1:(ncol(d)-2)], occs.testing = occs.testing.z, occs.grp = factor(d[d$pb == 1, "grp"]),
                     bg = d[d$pb == 0, 1:(ncol(d)-2)], bg.grp = factor(d[d$pb == 0, "grp"]),
                     rmm = list())
  
  # add the rangeModelMetadata object to the ENMevaluation object
  # write to existing RMM if input by user
  e@rmm <- buildRMM(e, envs, rmm)
  
  # if niche overlap selected, calculate and add the resulting matrix to results
  if(overlap == TRUE) {
    nr <- raster::nlayers(e@predictions)
    if(nr == 0) {
      if(quiet != TRUE) message("Warning: calculate niche overlap without model prediction rasters.")
    }else if(nr == 1) {
      if(quiet != TRUE) message("Warning: only 1 model prediction raster found. Need at least 2 rasters to calculate niche overlap. Increase number of tuning arguments and run again.") 
    }else{
      for(ovStat in overlapStat) {
        if(quiet != TRUE) message(paste0("Calculating niche overlap for statistic ", ovStat, "..."))
        # turn negative values to 0 for niche overlap calculations
        predictions.noNegs <- raster::calc(e@predictions, function(x) {x[x<0] <- 0; x})
        overlap.mat <- calc.niche.overlap(predictions.noNegs, ovStat, quiet)
        e@overlap[[ovStat]] <- overlap.mat
      }
    }
  }
  
  # calculate time expended and print message
  timed <- proc.time() - start.time
  t.min <- floor(timed[3] / 60)
  t.sec <- timed[3] - (t.min * 60)
  if(quiet != TRUE) message(paste("ENMevaluate completed in", t.min, "minutes", round(t.sec, 1), "seconds."))
  
  return(e)
}

Try the ENMeval package in your browser

Any scripts or data that you put into this service are public.

ENMeval documentation built on Jan. 9, 2023, 5:08 p.m.