View source: R/trainByCrossValid.r
| trainByCrossValid | R Documentation |
This function is an extension of any of the "trainXYZ" functions for calibrating species distribution and ecological niche models. This function uses the "trainXYZ" function to calibrate and evaluate a suite of models using cross-validation. The models are evaluated against withheld data to determine the optimal settings for a "final" model using all available data.
trainByCrossValid(
data,
resp = names(data)[1],
preds = names(data)[2:ncol(data)],
folds = dismo::kfold(data),
trainFx = enmSdm::trainGlm,
...,
metrics = c("logLoss", "cbi", "auc", "fpb", "tss", "msss", "mdss", "minTrainPres",
"trainSe95", "trainSe90"),
weightEvalTrain = TRUE,
weightEvalTest = TRUE,
na.rm = FALSE,
out = c("models", "tuning"),
verbose = 1
)
data |
Data frame or matrix. Environmental predictors (and no other fields) for presences and background sites. |
resp |
Character or integer. Name or column index of response variable. Default is to use the first column in |
preds |
Character list or integer list. Names of columns or column indices of predictors. Default is to use the second and subsequent columns in |
folds |
Either a numeric vector, or matrix or data frame:
|
trainFx |
Function, name of the "trainXYZ" function to use. Currently the functions/algorithms supported are |
... |
Arguments to pass to the "trainXYZ" function. |
metrics |
Character vector, names of evaluation metrics to calculate. If
|
weightEvalTrain |
Logical, if |
weightEvalTest |
Logical, if |
na.rm |
Logical, if |
out |
Character. Indicates type of value returned. If |
verbose |
Numeric. If 0 show no progress updates. If > 0 then show minimal progress updates for this function only. If > 1 show detailed progress for this function. If > 2 show detailed progress plus detailed progress for the "trainXYZ" function. |
In some cases models do not converge (e.g., boosted regression trees and generalized additive models sometimes suffer from this issue). In this case the model will be skipped, but a data frame with the k-fold and model number in the fold will be returned in the $meta element in the output. If all models converged, then this data frame will be empty.
A list object with several named elements:
meta: Meta-data on the model call.
folds: The folds object.
models (if 'models' is in argument out): A list of model objects, one per data fold
tuning (if 'tuning' is in argument out): One data frame per k-fold, each containing evaluation statistics for all candidate models in the fold.
Fielding, A.H. and J.F. Bell. 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24:38-49.
La Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., and Bretagnolle, V. 2014. Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography 23:811-820.
Wunderlich, R.F., Lin, P-Y., Anthony, J., and Petway, J.R. 2019. Two alternative evaluation metrics to replace the true skill statistic in the assessment of species distribution models. Nature Conservation 35:97-116.
trainBrt, trainCrf, trainGam, trainGlm, trainMaxEnt, trainMaxNet, trainLars, trainMaxNet, trainRf, trainNs
## Not run:
set.seed(123)
### contrived example
# generate training/testing data
n <- 10000
x1 <- seq(-1, 1, length.out=n) + rnorm(n)
x2 <- seq(10, 0, length.out=n) + rnorm(n)
x3 <- rnorm(n)
y <- 2 * x1 + x1^2 - 10 * x2 - x1 * x2
y <- statisfactory::invLogitAdj(y, 0.001)
presAbs <- as.integer(runif(10000) > (1 - y))
data <- data.frame(presAbs=presAbs, x1=x1, x2=x2, x3=x3)
model <- trainGlm(data, verbose=TRUE)
summary(model) # most parsimonious model
folds <- dismo::kfold(data, 3)
out <- trainByCrossValid(data, folds=folds, verbose=1)
str(out, 1)
summaryByCrossValid(out)
str(out, 1)
head(out$tuning[[1]])
head(out$tuning[[2]])
head(out$tuning[[3]])
# can do following for each fold (3 of them)
lapply(out$models[[1]], coefficients)
sapply(out$models[[1]], logLik)
sapply(out$models[[1]], AIC)
# select model for k = 1 with greatest CBI
top <- which.max(out$tuning[[1]]$cbiTest)
summary(out$models[[1]][[top]])
# in fold k = 1, which models perform well but are not overfit?
plot(out$tuning[[1]]$cbiTrain, out$tuning[[1]]$cbiTest, col='white',
main='Model Numbers for k = 1')
abline(0, 1, col='red')
numModels <- nrow(out$tuning[[1]])
text(out$tuning[[1]]$cbiTrain, out$tuning[[1]]$cbiTest, labels=1:numModels)
usr <- par('usr')
x <- usr[1] + 0.9 * (usr[4] - usr[3])
y <- usr[3] + 0.1 * (usr[4] - usr[3])
text(x, y, labels='overfit', col='red', xpd=NA)
x <- usr[1] + 0.1 * (usr[4] - usr[3])
y <- usr[3] + 0.9 * (usr[4] - usr[3])
text(x, y, labels='suspicious', col='red', xpd=NA)
# other algorithms
# boosted regression trees (with "fast" set of parameters... not recommended
# for normal use)
brt <- trainByCrossValid(data, folds=folds, verbose=2, trainFx=trainBrt,
maxTrees=2000, treeComplexity=2, learningRate=c(0.01, 0.001))
# MaxEnt with "fast" set of settings (not recommended for normal use)
mx <- trainByCrossValid(data, folds=folds, verbose=2, trainFx=trainMaxEnt,
regMult=c(1, 2), classes='lp')
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.