summaryByCrossValid: Summarize distribution/niche model cross-validation object

View source: R/summaryByCrossValid.r

summaryByCrossValidR Documentation

Summarize distribution/niche model cross-validation object

Description

This function summarizes models calibrated using the trainByCrossValid function. It returns aspects of the best models across k-folds (the particular aspects depends on the kind of models used).

Usage

summaryByCrossValid(
  x,
  trainFxName = "trainGlm",
  metric = "cbiTest",
  decreasing = TRUE
)

Arguments

x

An object of class crossValid (which is also a list). Note that the object must include a sublist named tuning.

trainFxName

Character, name of function used to train the SDM (examples: 'trainGlm', 'trainMaxEnt', 'trainBrt')

metric

Metric by which to select the best model in each k-fold. This can be any of the columns that appear in the data frames in x$tuning (or any columns added manually), but typically is one of the following plus either Train, Test, or Delta (e.g., 'logLossTrain', 'logLossTest', or 'logLossDelta'):

  • 'logLoss': Log loss.

  • 'cbi': Continuous Boyce Index (CBI). Calculated with contBoyce.

  • 'auc': Area under the receiver-operator characteristic curve (AUC). Calculated with aucWeighted.

  • 'tss': Maximum value of the True Skill Statistic. Calculated with tssWeighted.

  • 'msss': Sensitivity and specificity calculated at the threshold that maximizes sensitivity (true presence prediction rate) plus specificity (true absence prediction rate).

  • 'mdss': Sensitivity (se) and specificity (sp) calculated at the threshold that minimizes the difference between sensitivity and specificity.

  • 'minTrainPres': Sensitivity and specificity at the greatest threshold at which all training presences are classified as "present".

  • 'trainSe95' and/or 'trainSe90': Sensitivity at the threshold that ensures either 95

decreasing

Logical, if TRUE (default), for each k-fold sort models by the value listed in metric in decreasing order (highest connotes "best", lowest "worst"). If FALSE use the lowest value of metric.

Value

Data frame with statistics on the best set of models across k-folds. Depending on the model algorithm, this could be:

  • BRTs (boosted regression trees): Learning rate, tree complexity, and bag fraction.

  • GLMs (generalized linear models): Frequency of use of each term in the best models.

  • Maxent: Frequency of times each specific combination of feature classes was used in the best models plus mean master regularization multiplier for each feature set.

  • NSs (natural splines): Data frame, one row per fold and one column per predictor, with values representing the maximum degrees of freedom used for each variable in the best model of each fold.

See Also

trainByCrossValid

Examples

## Not run: 
set.seed(123)
### contrived example
# generate training/testing data
n <- 10000
x1 <- seq(-1, 1, length.out=n) + rnorm(n)
x2 <- seq(10, 0, length.out=n) + rnorm(n)
x3 <- rnorm(n)
y <- 2 * x1 + x1^2 - 10 * x2 - x1 * x2
y <- statisfactory::probitAdj(y, 0)
y <- y^3
presAbs <- runif(n) < y
data <- data.frame(presAbs=presAbs, x1=x1, x2=x2, x3=x3)

model <- trainGlm(data)
summary(model)

folds <- dismo::kfold(data, 3)
out <- trainByCrossValid(data, folds=folds, verbose=1)

summaryByCrossValid(out)

str(out, 1)
head(out$tuning[[1]])
head(out$tuning[[2]])
head(out$tuning[[3]])

# can do following for each fold (3 of them)
lapply(out$models[[1]], coefficients)
sapply(out$models[[1]], logLik)
sapply(out$models[[1]], AIC)

# select model for k = 1 with greatest CBI
top <- which.max(out$tuning[[1]]$cbiTest)
summary(out$models[[1]][[top]])

# in fold k = 1, which models perform well but aren not overfit?
plot(out$tuning[[1]]$cbiTrain, out$tuning[[1]]$cbiTest, pch='.',
		main='Model Numbers for k = 1')
abline(0, 1, col='red')
numModels <- nrow(out$tuning[[1]])
text(out$tuning[[1]]$cbiTrain, out$tuning[[1]]$cbiTest, labels=1:numModels)
usr <- par('usr')
x <- usr[1] + 0.9 * (usr[4] - usr[3])
y <- usr[3] + 0.1 * (usr[4] - usr[3])
text(x, y, labels='overfit', col='red', xpd=NA)
x <- usr[1] + 0.1 * (usr[4] - usr[3])
y <- usr[3] + 0.9 * (usr[4] - usr[3])
text(x, y, labels='suspicious', col='red', xpd=NA)

## End(Not run)

adamlilith/enmSdm documentation built on Jan. 6, 2023, 11 a.m.