perform_grid_evaluation | R Documentation |
The performance of the framework, measured as Sensitivity (rate of relevant record found over all relevant records) and Efficiency (one minus the ratio of manually reviewed records) is strongly impacted by a number of parameters.
perform_grid_evaluation( records, sessions_folder = "Grid_Search", prev_classification = records, resample = c(FALSE, TRUE), n_init = c(50, 100, 250, 500), n_models = c(1, 5, 10, 20, 40, 60), pos_mult = c(1, 10, 20), pred_quants = list(c(0.1, 0.5, 0.9), c(0.05, 0.5, 0.95), c(0.01, 0.5, 0.99)), limits = list(stop_after = 4, pos_target = NULL, labeling_limit = NULL) )
records |
A fully labelled Annotation data set (data frame or a path to a Excel / CSV file). |
sessions_folder |
A path to a folder where to store the grid search results. |
prev_classification |
An Annotation data set or file with labelled
records. The labels in this data set will be used as ground truth for the
|
n_init |
A vector of numbers enumerating the size of the initial
training set. The initial training set simulates the initial manual
labelling of records used to train the model. It is generated by the
|
pos_mult, n_models, resample, pred_quants |
A vector of values for each
parameter. For |
limits |
The conditions on which a CR cycle is stopped. See
|
These parameters are related to the framework only and independent by the specific Bayesian classification method used (which itself has other specific parameters). The parameters are the following:
n_init
: The number of records in the manually labeled
initial training set.
n_models
: The number of models trained and
then averaged to stabilize the posterior predictive distribution (PPD).
resample
: Whether to bootstrap the data between model retraining if
the number of models is more than one.
pos_mult
: Oversampling
rate of the positive labeled records.
pred_quants
: The quantiles
used to summarise the records' PPD and built the Uncertainty zone.
Check enrich_annotation_file()
for more insight about their influence on
the framework and the classification results. Since all records are
pre-labelled, the manual review phase is performed automatically.
The algorithm starts from a fully labelled Annotation set and performs a Classification/Review cycle for each combination of parameters.
A great number of files will be created (40 GB with the default grid
parameters for a input records
file with 1200 labelled records), one
session folder for each parameter combination. Therefore, be sure to have
enough disk space before starting. Also, keep in mind that a full search may
requires many days, even on powerful computers.
A message with the number of parameter combinations evaluated.
## Not run: # First, the user needs to manually label a significant number of records; we # suggest one thousand or more. The new record file can be stored anywhere, # but putting it into the grid search folder is a better practice. records <- file.path("Grid_Search", "Classification_data.xlsx") Grid_search <- perform_grid_evaluation( records, sessions_folder = "Grid_Search", prev_classification = records, ## Model parameters (can be changed by users) resample = c(FALSE, TRUE), n_init = c(50, 100, 250, 500), n_models = c(1, 5, 10, 20, 40, 60), pos_mult = c(1, 10, 20), pred_quants = list( c(.1, .5, .9), c(.05, .5, .95), c(.01, .5, .99) ) ) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.