enrich_annotation_file: Enrich an Annotation data set with predictions based on a...

enrich_annotation_fileR Documentation

Enrich an Annotation data set with predictions based on a Bayesian classification model

Description

This function is the central component of the framework. It takes as input a session name or the path to an Annotation data frame and trains a Bayesian model to predict the labels of all records.

Usage

enrich_annotation_file(
  session_name,
  file = NULL,
  DTM = NULL,
  pos_mult = 10,
  n_models = 10,
  resample = FALSE,
  pred_quants = c(0.01, 0.5, 0.99),
  sessions_folder = getOption("baysren.sessions_folder", "Sessions"),
  pred_batch_size = 5000,
  autorun = TRUE,
  stop_on_unreviewed = TRUE,
  dup_session_action = c("fill", "add", "replace", "stop"),
  limits = list(stop_after = 4, pos_target = NULL, labeling_limit = NULL),
  compute_performance = FALSE,
  use_prev_labels = TRUE,
  prev_classification = NULL,
  save_samples = TRUE,
  rebuild = FALSE,
  ...
)

Arguments

session_name

A session name which also identifies a subfolder of sessions_folder. The function will automatically retrieve the out of the last CR iteration to continue the cycle if the conditions in limits are not fulfilled.

file

In alternative to session_name, a direct path to an Annotation file can be used.

DTM

A path to a Document Term Matrix (DTM) as produced by create_training_set(). If NULL two conditions can happen: if the current iteration is a replication and an existing DTM is present into the session_name folder, it will be used; if the CR iteration is not a replication or no backup DTM exists a new one will be generated from the Annotation data.

pos_mult

A model parameter. Defines the oversampling rate of positive records before training. A higher number increases sensitivity at the cost of lower efficiency (more records to manually review) and training times.

n_models

A model parameter. The Bayesian model is run multiple times and the multiple generated PPDs are averaged, creating an ensemble PPD. A higher number of models decrease uncertainty and increases efficiency, but greatly increases computation times.

resample

A model parameter. Whether to bootstrap the training data before modelling. It makes sense only if n_models >> 1, otherwise it equates to loose data.

pred_quants

A model parameter. The levels of the PPD uncertainty intervals used to build the Uncertainty Zone. A larger uncertainty interval increases sensitivity but decreases efficiency. The middle level is only use to provide a descriptive point estimate for the record PPDs, it is not used to define the record labels.

sessions_folder

The path to the folder where all the sessions are stored.

pred_batch_size

Since creating the PPD has a big memory burden, records are separated in batches before computing predictions. Decrease this number if the function crashes due to Java memory problems.

autorun

If no unreviewed uncertain are present, start a new CR iteration automatically.

stop_on_unreviewed

Raise an error if there are uncertain records that have not been yet manually reviewed (i.e. "*" in the Rev_prediction_new column). It should be set to FALSE if the number of records to manually review is too high. In general we suggest reviewing manually no more than 250 records per iteration, since this number is usually enough to provide enough information for the model.

dup_session_action

Similar to the argument in link{create_session}. The default fill tells the function to create a Session folder with the data in file if the session does not exists, otherwise do nothing and keep adding Annotation files to the session at each iteration. As in link{create_session}, add creation a new session, replace replace the current session content, stop raises an error.

limits

A list of condition that would prevent the function from running; the function will return NULL with a message. They are useful especially if autorun is TRUE. stop_after: number of CR iterations without new positive predictions; pos_target: Number of positive matches found; labeling_limit ratio of all records manually reviewed.

compute_performance

Use link{estimate_performance} to estimate total posterior sensitivity and efficiency based on a surrogate logistic model. It is very expensive to compute so it is turned off by default.

use_prev_labels

If the Rev_previous is present in the data or a new one is created using prev_classification, use the labels in it to solve uncertain record labels.

prev_classification

A data frame with already labeled records to use to solve automatically uncertain record labels during a CR iteration. Only used if use_prev_labels is TRUE.

save_samples

Whether to save on disk the PPDs of the records. These are necessary (at least those related to the last CR iteration) to create a new search query automatically with extract_rules().

rebuild

Every time the function is run, a backup of the Bayesian model outputs (the PPDs samples and the variable importance) is saved in the file "Model_backup.rds". If rebuild is FALSE and such file exist, the backed up copy is used instead of refitting the model. If set to FALSE it allows to skip model retraining if the function failed before finishing.

...

Additional parameters passed to compute_BART_model().

Details

The outcome on which the model is trained on a coalesced column Target made by the initial manual labels in Rev_manual and previous predicted labels in Rev_prediction that were manually reviewed. The labels can be only positive (y) or negative (n).

The output is another Annotation data frame with an extra column Predicted_label which contains the labels predicted by the model (see later). Also a Rev_prediction_new column is created, which contains predictions that require manual review, and Rev_prediction which stores the previously reviewed predictions.

Based on the paradigm of "Active Learning", records in the Rev_prediction_new column will have a "" label, which means that they need manual review. The user will need to change the "" into a positive y or negative n label.

After the uncertain records in the output Annotation file are reviewed, the function can be called again, establishing a Classification/Review (CR) cycle. If no new positive record labels are found, the function should be still called a number of times (defined in the limits$stop_after argument) to produce "replication" CR iterations. These are needed to avoid missing uncertain record due to the stochasticity in the Bayesian model.

The Bayesian model generates posterior predictive distributions (PPD) of positive label for each record (saved into Samples in the session folder). The PPDs are used to define the "Uncertainty Zone", which is delimited by the lowermost of all PPDs' credibility intervals boundaries at a chosen level for the positive labelled records and the uppermost of the interval boundaries among the manually reviewed negative labelled records. A record is labelled as positive/negative in the Predicted_label column if both its uncertainty interval boundaries are above/below the Uncertainty Zone. Otherwise, it will be labelled as uncertain and will require manual review. If a prediction in Predicted_label contradicts one in Rev_manual, will need to be reviewed by the user.

If just a session name is passed, the function will identify automatically the last Annotation file to use and if its a new CR iteration or a replication and will label the output file accordingly.

The function take as arguments a number of parameters which may have an impact on the final CR cycle sensitivity and efficiency (fraction of total manually reviewed records): pos_mult, n_models, resample, pred_quants. Default values are proposed for these which should work in most cases; perform_grid_evaluation() can be used to evaluate the best parameter combination for a new data set if the user believes that there is margin of improvement, but the evaluation of the grid is extremely computationally intensive.

The function saves a number of output files on disk.

Value

An Annotation data frame with a number of extra columns: Rev_prediction and a Rev_prediction_new which contains respectively manually reviewed predictions from previous CR iterations and new predictions that require review; Predicted_label which stores the labels predicted using the Uncertainty Zone mechanism; Pred_Low, Pred_Med, and Pred_Up which describe a record PPD using the uncertainty levels defined into the pred_quants argument.

Examples

## Not run: 

## A simple call using a session name will automatically pick up the right
# annotation file (the initial one or an already classified one if existing)
# and start a CR iteration.

enrich_annotation_file("Session1")

# Alternatively, an Annotation file can be passed manually (discouraged)

records_file <- get_session_files("Session1")$Records

enrich_annotation_file("Session1", file = records_file)

## End(Not run)

bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.