perform_grid_evaluation: Perform a grid evaluation of parameters to tune the...

perform_grid_evaluationR Documentation

Perform a grid evaluation of parameters to tune the classification framework

Description

The performance of the framework, measured as Sensitivity (rate of relevant record found over all relevant records) and Efficiency (one minus the ratio of manually reviewed records) is strongly impacted by a number of parameters.

Usage

perform_grid_evaluation(
  records,
  sessions_folder = "Grid_Search",
  prev_classification = records,
  resample = c(FALSE, TRUE),
  n_init = c(50, 100, 250, 500),
  n_models = c(1, 5, 10, 20, 40, 60),
  pos_mult = c(1, 10, 20),
  pred_quants = list(c(0.1, 0.5, 0.9), c(0.05, 0.5, 0.95), c(0.01, 0.5, 0.99)),
  limits = list(stop_after = 4, pos_target = NULL, labeling_limit = NULL)
)

Arguments

records

A fully labelled Annotation data set (data frame or a path to a Excel / CSV file).

sessions_folder

A path to a folder where to store the grid search results.

prev_classification

An Annotation data set or file with labelled records. The labels in this data set will be used as ground truth for the records file, but the records themselves will not be used.

n_init

A vector of numbers enumerating the size of the initial training set. The initial training set simulates the initial manual labelling of records used to train the model. It is generated by the records data set selecting records in descending order.

pos_mult, n_models, resample, pred_quants

A vector of values for each parameter. For pred_quants a list of vectors. See enrich_annotation_file() for more details.

limits

The conditions on which a CR cycle is stopped. See enrich_annotation_file().

Details

These parameters are related to the framework only and independent by the specific Bayesian classification method used (which itself has other specific parameters). The parameters are the following:

  • n_init: The number of records in the manually labeled initial training set.

  • n_models: The number of models trained and then averaged to stabilize the posterior predictive distribution (PPD).

  • resample: Whether to bootstrap the data between model retraining if the number of models is more than one.

  • pos_mult: Oversampling rate of the positive labeled records.

  • pred_quants: The quantiles used to summarise the records' PPD and built the Uncertainty zone.

Check enrich_annotation_file() for more insight about their influence on the framework and the classification results. Since all records are pre-labelled, the manual review phase is performed automatically.

The algorithm starts from a fully labelled Annotation set and performs a Classification/Review cycle for each combination of parameters.

A great number of files will be created (40 GB with the default grid parameters for a input records file with 1200 labelled records), one session folder for each parameter combination. Therefore, be sure to have enough disk space before starting. Also, keep in mind that a full search may requires many days, even on powerful computers.

Value

A message with the number of parameter combinations evaluated.

Examples

## Not run: 

# First, the user needs to manually label a significant number of records; we
# suggest one thousand or more. The new record file can be stored anywhere,
# but putting it into the grid search folder is a better practice.

records <- file.path("Grid_Search", "Classification_data.xlsx")

Grid_search <- perform_grid_evaluation(
  records,
  sessions_folder = "Grid_Search",
  prev_classification = records,
  ## Model parameters (can be changed by users)
  resample = c(FALSE, TRUE),
  n_init = c(50, 100, 250, 500),
  n_models = c(1, 5, 10, 20, 40, 60),
  pos_mult = c(1, 10, 20),
  pred_quants = list(
    c(.1, .5, .9),
    c(.05, .5, .95),
    c(.01, .5, .99)
  )
)

## End(Not run)

bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.