View source: R/trainByCrossValid.r
trainByCrossValid | R Documentation |
This function is an extension of any of the "trainXYZ" functions for calibrating species distribution and ecological niche models. This function uses the "trainXYZ" function to calibrate and evaluate a suite of models using cross-validation. The models are evaluated against withheld data to determine the optimal settings for a "final" model using all available data.
trainByCrossValid( data, resp = names(data)[1], preds = names(data)[2:ncol(data)], folds = dismo::kfold(data), trainFx = enmSdm::trainGlm, ..., metrics = c("logLoss", "cbi", "auc", "fpb", "tss", "msss", "mdss", "minTrainPres", "trainSe95", "trainSe90"), weightEvalTrain = TRUE, weightEvalTest = TRUE, na.rm = FALSE, out = c("models", "tuning"), verbose = 1 )
data |
Data frame or matrix. Environmental predictors (and no other fields) for presences and background sites. |
resp |
Character or integer. Name or column index of response variable. Default is to use the first column in |
preds |
Character list or integer list. Names of columns or column indices of predictors. Default is to use the second and subsequent columns in |
folds |
Either a numeric vector, or matrix or data frame:
|
trainFx |
Function, name of the "trainXYZ" function to use. Currently the functions/algorithms supported are |
... |
Arguments to pass to the "trainXYZ" function. |
metrics |
Character vector, names of evaluation metrics to calculate. If
|
weightEvalTrain |
Logical, if |
weightEvalTest |
Logical, if |
na.rm |
Logical, if |
out |
Character. Indicates type of value returned. If |
verbose |
Numeric. If 0 show no progress updates. If > 0 then show minimal progress updates for this function only. If > 1 show detailed progress for this function. If > 2 show detailed progress plus detailed progress for the "trainXYZ" function. |
In some cases models do not converge (e.g., boosted regression trees and generalized additive models sometimes suffer from this issue). In this case the model will be skipped, but a data frame with the k-fold and model number in the fold will be returned in the $meta element in the output. If all models converged, then this data frame will be empty.
A list object with several named elements:
meta
: Meta-data on the model call.
folds
: The folds
object.
models
(if 'models'
is in argument out
): A list of model objects, one per data fold
tuning
(if 'tuning'
is in argument out
): One data frame per k-fold, each containing evaluation statistics for all candidate models in the fold.
Fielding, A.H. and J.F. Bell. 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24:38-49.
La Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., and Bretagnolle, V. 2014. Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography 23:811-820.
Wunderlich, R.F., Lin, P-Y., Anthony, J., and Petway, J.R. 2019. Two alternative evaluation metrics to replace the true skill statistic in the assessment of species distribution models. Nature Conservation 35:97-116.
trainBrt
, trainCrf
, trainGam
, trainGlm
, trainMaxEnt
, trainMaxNet
, trainLars
, trainMaxNet
, trainRf
, trainNs
## Not run: set.seed(123) ### contrived example # generate training/testing data n <- 10000 x1 <- seq(-1, 1, length.out=n) + rnorm(n) x2 <- seq(10, 0, length.out=n) + rnorm(n) x3 <- rnorm(n) y <- 2 * x1 + x1^2 - 10 * x2 - x1 * x2 y <- statisfactory::invLogitAdj(y, 0.001) presAbs <- as.integer(runif(10000) > (1 - y)) data <- data.frame(presAbs=presAbs, x1=x1, x2=x2, x3=x3) model <- trainGlm(data, verbose=TRUE) summary(model) # most parsimonious model folds <- dismo::kfold(data, 3) out <- trainByCrossValid(data, folds=folds, verbose=1) str(out, 1) summaryByCrossValid(out) str(out, 1) head(out$tuning[[1]]) head(out$tuning[[2]]) head(out$tuning[[3]]) # can do following for each fold (3 of them) lapply(out$models[[1]], coefficients) sapply(out$models[[1]], logLik) sapply(out$models[[1]], AIC) # select model for k = 1 with greatest CBI top <- which.max(out$tuning[[1]]$cbiTest) summary(out$models[[1]][[top]]) # in fold k = 1, which models perform well but are not overfit? plot(out$tuning[[1]]$cbiTrain, out$tuning[[1]]$cbiTest, col='white', main='Model Numbers for k = 1') abline(0, 1, col='red') numModels <- nrow(out$tuning[[1]]) text(out$tuning[[1]]$cbiTrain, out$tuning[[1]]$cbiTest, labels=1:numModels) usr <- par('usr') x <- usr[1] + 0.9 * (usr[4] - usr[3]) y <- usr[3] + 0.1 * (usr[4] - usr[3]) text(x, y, labels='overfit', col='red', xpd=NA) x <- usr[1] + 0.1 * (usr[4] - usr[3]) y <- usr[3] + 0.9 * (usr[4] - usr[3]) text(x, y, labels='suspicious', col='red', xpd=NA) # other algorithms # boosted regression trees (with "fast" set of parameters... not recommended # for normal use) brt <- trainByCrossValid(data, folds=folds, verbose=2, trainFx=trainBrt, maxTrees=2000, treeComplexity=2, learningRate=c(0.01, 0.001)) # MaxEnt with "fast" set of settings (not recommended for normal use) mx <- trainByCrossValid(data, folds=folds, verbose=2, trainFx=trainMaxEnt, regMult=c(1, 2), classes='lp') ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.