knitr::opts_chunk$set(
  echo = TRUE, 
  message = FALSE,
  warning = FALSE,
  fig.show = 'asis',
  out.width = '75%',
  fig.align = 'center',
  fig.width = 7,
  fig.height = 7
)

Outlines

This library has been created for model selection to predict species distribution (Biomass, density or presence/absence). Its final aim is to produce maps of predicted distributions. However, the core is a k-fold cross-validation model selection procedure that can be applied to any kind of model, provided that parameters of the selection are well defined. We use the classical cars dataset to show the different possibilities of the library. The variable of interest is mpg. This is a continuous numerical variable, only positive. This is in accordance with our data modelling case datatype = "ContPosNull" (continuous positive or null) and the model types associated (modeltypes). For other types of data, refer to the documentation (in particular, options in modelselect_opt) to correctly set your modelling parameters.
Note that most figures of the vignette are saved in "inst" so that model selection is not run during vignette building. However, code in the vignettes can be run on your own computer and should return the same outputs. To do so, open and run this Rmd file: r system.file("Covar_Selection", "Covar_Selection.Rmd", package = "SDMSelect").

Load libraries and other setup

library(SDMSelect)
set.seed(50)

# Temp directory for saving all outputs
tmpdir <- paste0(tempdir(), "/out_CovarSelection")
dir.create(tmpdir)

Dataset

data <- dplyr::mutate_at(mtcars, 8:11, as.character)
knitr::kable(head(data))

Modify the dataset to work with the library

In this case, this only consists in changing observation column name to dataY and append factor column names with factor_ for compatibility with following functions.

data.new <- Prepare_dataset(
  x = data, var = 1, cov = 2:ncol(data),
  datatype = "Cont", na.rm = TRUE
)
knitr::kable(head(data.new))

Covariates correlation

Due to identifiability issues, highly correlated environmental covariates should not be included in the same models. Correlations between all possible pairs of covariates are tested using the Spearman's rho coefficient. Here, a rho value exceeding 0.7 considers covariates too correlated to be included in the same model. However, the following cross-validation procedure also guarantees too correlated covariates not to be included in the same models as gain in prediction may be low. Thus, the present covariates correlation selection step mainly allows the number of tested models to be reduced.

corSpearman <- Param_corr(
  x = data.new, rm = 1, thd = 0.7, visual = FALSE,
  plot = TRUE, saveWD = tmpdir, img.size = 5)
knitr::include_graphics(file.path(tmpdir, "Covariate_correlation_crop.png"))

Find the best combination of covariates

Set options for presence-absence models. Have a look at options documentation ?modelselect_opt.

modelselect_opt(RESET = TRUE)
modelselect_opt$Max_nb_Var <- 5
modelselect_opt$datatype <- "ContPosNull"
modelselect_opt$modeltypes <- modelselect_opt$modeltypes[c(1, 5, 11)]

Procedure selects combination of covariates in each iteration from one covariate to the maximum defined (modelselect_opt("Max_nb_Var") = r modelselect_opt("Max_nb_Var")). This reproduces the procedure separately for all model types defined (modelselect_opt("modeltypes") = r paste(modelselect_opt("modeltypes"), collapse = ", ")). Here, for the example, I only run the model for three "modeltypes" (r paste(modelselect_opt$modeltypes[c(1, 5, 11)], collapse = ", ")).
The output of findBestModel function is the link to a zipped file of all outputs saved in the saveWD directory.

res.file <- findBestModel(
  x = data.new, datatype = "ContPosNull", 
  corSpearman = corSpearman, saveWD = tmpdir, 
  verbose = 1)

Order models according to quality of prediction

Model selection was realised separately for each distribution tested. The exact same k-fold cross-validation datasets have been used to keep the best model at each step of the iteration procedure. All indices of goodness of fit can thus be compared among the distributions tested with paired statistical tests. This allows to order all models tested.

BestModels <- ModelOrder(saveWD = res.file, plot = TRUE)
# To add in /inst for vignettes
readr::write_rds(BestModels, "BestModels.rds")

Two tables are available:

knitr::kable(BestModels$VeryBestModels_crossV, row.names = FALSE)
knitr::kable(BestModels$BestForModeltypes, row.names = FALSE)

Predictions of the best model

We can choose one model, the best one here, and create a set of figure outputs to explore what it is predicting. Figures are saved in saveWD and its compressed version (its path is the output of the function, here stored in res.file)

Num.Best <- BestModels$VeryBestModels_crossV$Num[1]
res.file <- ModelResults(saveWD = tmpdir, plot = TRUE, 
                 Num = Num.Best, Marginals = TRUE)

In the case of a model with positive values as here, the outputs are the following. If the best model is with a log-transformation, figures are mainly created in the log-scale

knitr::include_graphics(paste0(tmpdir, "/", basename(tmpdir), "-Param_Exp_Deviance_", Num.Best, ".png"))
# Save in inst
file.copy(paste0(tmpdir, "/", basename(tmpdir), "-Param_Exp_Deviance_", Num.Best, ".png"), "Param_Exp_Deviance.png", overwrite = TRUE)
knitr::include_graphics(paste0(tmpdir, "/", basename(tmpdir), "-Residual-Analysis_", Num.Best, ".png"))
# Save in inst
file.copy(paste0(tmpdir, "/", basename(tmpdir), "-Residual-Analysis_", Num.Best, ".png"), "Residual-Analysis.png", overwrite = TRUE)
knitr::include_graphics(paste0(tmpdir, "/", basename(tmpdir), "-CovForMeanPred_Marginals_", Num.Best, ".png"))
# Save in inst
file.copy(paste0(tmpdir, "/", basename(tmpdir), "-CovForMeanPred_Marginals_", Num.Best, ".png"),
          "CovForMeanPred_Marginals.png", overwrite = TRUE)
knitr::include_graphics(paste0(tmpdir, "/", basename(tmpdir), "-Obs-Pred_", Num.Best, ".png"))
# Save in inst
file.copy(paste0(tmpdir, "/", basename(tmpdir), "-Obs-Pred_", Num.Best, ".png"),
          "Obs-Pred.png", overwrite = TRUE)


statnmap/SDMSelect documentation built on April 1, 2021, 2:01 p.m.