build_DT: Build ecological Diagnostic Tool

Description Usage Arguments Details Value See Also

View source: R/build_DT.R


This function allows to build an ecological Diagnostic Tool (DT) that predicts an Impairment Probability for one or several anthropogenic pressures.


build_DT(metrics, pressures, low = "low", impaired = "impaired",
  pathDT, params = list(num.trees = 500, mtry = 25, sample.fraction =
  0.632, min.node.size = 25), CVfolds = 5, nIter = 1L, nCores = 1L,
  trainingFrac = 1, samplingUnit = NULL, calibPopSize = 100,
  calibGenNb = 1, seed = 20181025)



a data frame with samples in rows and biological metrics in columns


a data frame with samples in rows and pressure information in columns (one per pressure category). The table is filled with quality classes (e.g. low or impaired)

low, impaired

character vectors with the labels of the pressure classes (in pressures) corresponding to low impact and impaired situations, respectively.


character string, the path where the built models will be saved


a named list with the values, one or two (minimum, maximum) of the following parameters: - num.trees: Number of trees to grow; - mtry: Number of variables randomly sampled as candidates at each split; - sample.fraction: Proportion of samples to draw; - min.node.size: Minimum size of terminal nodes.


an integer indicating the number of parts made from the training data set and used to calibrate the model hyper-parameters.


an integer indicating the number of ranger RF models created for each pressure type. nIter larger than 1 allow to estimate prediction uncertainty and improve model robustness.


an integer indicating the number of CPU cores available to parallelize the calibration step


a number between 0 and 1 indicating which propotion of the data set will be used to train the model


a vector with a length equal to the number of rows of metrics and pressures indicating to which group each observation belongs to. The training and test data sets will be obtained by sampling these groups and not the observations (except if samplingUnit = NULL, the default).


numeric. The size of the population used by the genetic algorithm used to calibrate the parameters.


numeric. The number of generations used by the genetic algorithm used to calibrate the parameters. (calibGenNb + 1) * calibPopSize gives the total number of iterations performed by the calibration algorithm.


numeric. The seed used for the random number generator


The function takes as input two tables: one with a categorical description (quality classes) of samples by one or several anthropogenic pressures and the second with the values of biological metrics calculated from the community data from the same samples.

For each pressure (i.e. column in the pressures table), a model is built and saved in the directory given by the pathDT argument. The whole set of models (DT units) saved in this directory constitute the DT.

Each DT unit is a probability random forest model built using the ranger function to predict the probability of a community being impaired by the pressure considered based on the biological metrics exhibited by the communities.

For each DT unit, the given metrics and pressures tables are splitted in training and test data sets. This is performed using the trainingFrac argument that specify the proportion of the data (once observations with missing pressure are removed) from each pressure level (low or impaired) that are used to constitute the training data (stratified sampling). By default, trainingFrac refers to the observations (rows of metrics and pressures) but if a grouping vector (e.g. site ID) is given to the argument samplingUnit, then this the training data set is built by sampling among samplingUnit and not among the rows. If a site has observations with different pressure levels (low or impaired), then the level occuring with highest frequency is allocated to the site.

The hyper-parameters of the ranger model are given in the params argument that could accept one or several values per parameter. If several values are given, an optimization procedure using tuneParamsMultiCrit is performed to identify the parameter set exhibiting the best trade-off between performance (AUC) and execution time. Two optimization algorithms are implemented: a grid search and a genetic optimization algorithm. If the argument calibGenNb is larger than one then the genetic algorithm is used and the space search for each parameter is determined by the minimum and maximum values given in params. If calibGenNb is smaller or equal to 1, then a grid search testing all the params value combinations is performed.


nothing, the models and used data are saved as .rda objects in the directory corresponding to the pathDT argument.

See Also

ranger tuneParamsMultiCrit

CedricMondy/ecodiag documentation built on Nov. 7, 2018, 2:30 p.m.