enrich_annotation_file: Enrich an Annotation data set with predictions based on a...
In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

enrich_annotation_file

R Documentation

Enrich an Annotation data set with predictions based on a Bayesian classification model

Description

This function is the central component of the framework. It takes as input a session name or the path to an Annotation data frame and trains a Bayesian model to predict the labels of all records.

Usage

enrich_annotation_file(
  session_name,
  file = NULL,
  DTM = NULL,
  pos_mult = 10,
  n_models = 10,
  resample = FALSE,
  pred_quants = c(0.01, 0.5, 0.99),
  sessions_folder = getOption("baysren.sessions_folder", "Sessions"),
  pred_batch_size = 5000,
  autorun = TRUE,
  stop_on_unreviewed = TRUE,
  dup_session_action = c("fill", "add", "replace", "stop"),
  limits = list(stop_after = 4, pos_target = NULL, labeling_limit = NULL),
  compute_performance = FALSE,
  use_prev_labels = TRUE,
  prev_classification = NULL,
  save_samples = TRUE,
  rebuild = FALSE,
  ...
)

Arguments

`session_name`	A session name which also identifies a subfolder of `sessions_folder`. The function will automatically retrieve the out of the last CR iteration to continue the cycle if the conditions in `limits` are not fulfilled.
`file`	In alternative to `session_name`, a direct path to an Annotation file can be used.
`DTM`	A path to a Document Term Matrix (DTM) as produced by `create_training_set()`. If `NULL` two conditions can happen: if the current iteration is a replication and an existing DTM is present into the `session_name` folder, it will be used; if the CR iteration is not a replication or no backup DTM exists a new one will be generated from the Annotation data.
`pos_mult`	A model parameter. Defines the oversampling rate of positive records before training. A higher number increases sensitivity at the cost of lower efficiency (more records to manually review) and training times.
`n_models`	A model parameter. The Bayesian model is run multiple times and the multiple generated PPDs are averaged, creating an ensemble PPD. A higher number of models decrease uncertainty and increases efficiency, but greatly increases computation times.
`resample`	A model parameter. Whether to bootstrap the training data before modelling. It makes sense only if `n_models >> 1`, otherwise it equates to loose data.
`pred_quants`	A model parameter. The levels of the PPD uncertainty intervals used to build the Uncertainty Zone. A larger uncertainty interval increases sensitivity but decreases efficiency. The middle level is only use to provide a descriptive point estimate for the record PPDs, it is not used to define the record labels.
`sessions_folder`	The path to the folder where all the sessions are stored.
`pred_batch_size`	Since creating the PPD has a big memory burden, records are separated in batches before computing predictions. Decrease this number if the function crashes due to Java memory problems.
`autorun`	If no unreviewed uncertain are present, start a new CR iteration automatically.
`stop_on_unreviewed`	Raise an error if there are uncertain records that have not been yet manually reviewed (i.e. "*" in the `Rev_prediction_new` column). It should be set to `FALSE` if the number of records to manually review is too high. In general we suggest reviewing manually no more than 250 records per iteration, since this number is usually enough to provide enough information for the model.
`dup_session_action`	Similar to the argument in `link{create_session}`. The default `fill` tells the function to create a Session folder with the data in `file` if the session does not exists, otherwise do nothing and keep adding Annotation files to the session at each iteration. As in `link{create_session}`, `add` creation a new session, `replace` replace the current session content, `stop` raises an error.
`limits`	A list of condition that would prevent the function from running; the function will return `NULL` with a message. They are useful especially if `autorun` is `TRUE`. `stop_after`: number of CR iterations without new positive predictions; `pos_target`: Number of positive matches found; `labeling_limit` ratio of all records manually reviewed.
`compute_performance`	Use `link{estimate_performance}` to estimate total posterior sensitivity and efficiency based on a surrogate logistic model. It is very expensive to compute so it is turned off by default.
`use_prev_labels`	If the `Rev_previous` is present in the data or a new one is created using `prev_classification`, use the labels in it to solve uncertain record labels.
`prev_classification`	A data frame with already labeled records to use to solve automatically uncertain record labels during a CR iteration. Only used if `use_prev_labels` is `TRUE`.
`save_samples`	Whether to save on disk the PPDs of the records. These are necessary (at least those related to the last CR iteration) to create a new search query automatically with `extract_rules()`.
`rebuild`	Every time the function is run, a backup of the Bayesian model outputs (the PPDs samples and the variable importance) is saved in the file "Model_backup.rds". If `rebuild` is `FALSE` and such file exist, the backed up copy is used instead of refitting the model. If set to `FALSE` it allows to skip model retraining if the function failed before finishing.
`...`	Additional parameters passed to `compute_BART_model()`.

Details

The outcome on which the model is trained on a coalesced column Target made by the initial manual labels in Rev_manual and previous predicted labels in Rev_prediction that were manually reviewed. The labels can be only positive (y) or negative (n).

The output is another Annotation data frame with an extra column Predicted_label which contains the labels predicted by the model (see later). Also a Rev_prediction_new column is created, which contains predictions that require manual review, and Rev_prediction which stores the previously reviewed predictions.

Based on the paradigm of "Active Learning", records in the Rev_prediction_new column will have a "" label, which means that they need manual review. The user will need to change the "" into a positive y or negative n label.

After the uncertain records in the output Annotation file are reviewed, the function can be called again, establishing a Classification/Review (CR) cycle. If no new positive record labels are found, the function should be still called a number of times (defined in the limits$stop_after argument) to produce "replication" CR iterations. These are needed to avoid missing uncertain record due to the stochasticity in the Bayesian model.

The Bayesian model generates posterior predictive distributions (PPD) of positive label for each record (saved into Samples in the session folder). The PPDs are used to define the "Uncertainty Zone", which is delimited by the lowermost of all PPDs' credibility intervals boundaries at a chosen level for the positive labelled records and the uppermost of the interval boundaries among the manually reviewed negative labelled records. A record is labelled as positive/negative in the Predicted_label column if both its uncertainty interval boundaries are above/below the Uncertainty Zone. Otherwise, it will be labelled as uncertain and will require manual review. If a prediction in Predicted_label contradicts one in Rev_manual, will need to be reviewed by the user.

If just a session name is passed, the function will identify automatically the last Annotation file to use and if its a new CR iteration or a replication and will label the output file accordingly.

The function take as arguments a number of parameters which may have an impact on the final CR cycle sensitivity and efficiency (fraction of total manually reviewed records): pos_mult, n_models, resample, pred_quants. Default values are proposed for these which should work in most cases; perform_grid_evaluation() can be used to evaluate the best parameter combination for a new data set if the user believes that there is margin of improvement, but the evaluation of the grid is extremely computationally intensive.

The function saves a number of output files on disk.

Value

An Annotation data frame with a number of extra columns: Rev_prediction and a Rev_prediction_new which contains respectively manually reviewed predictions from previous CR iterations and new predictions that require review; Predicted_label which stores the labels predicted using the Uncertainty Zone mechanism; Pred_Low, Pred_Med, and Pred_Up which describe a record PPD using the uncertainty levels defined into the pred_quants argument.

Examples

## Not run: 

## A simple call using a session name will automatically pick up the right
# annotation file (the initial one or an already classified one if existing)
# and start a CR iteration.

enrich_annotation_file("Session1")

# Alternatively, an Annotation file can be passed manually (discouraged)

records_file <- get_session_files("Session1")$Records

enrich_annotation_file("Session1", file = records_file)

## End(Not run)

bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.

bakaburg1/BaySREn index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bakaburg1/BaySREn
BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

enrich_annotation_file: Enrich an Annotation data set with predictions based on a...
In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

Enrich an Annotation data set with predictions based on a Bayesian classification model

Description

Usage

Arguments

Details

Value

Examples

Related to enrich_annotation_file in bakaburg1/BaySREn...

R Package Documentation

Browse R Packages

We want your feedback!

bakaburg1/BaySREn BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

enrich_annotation_file: Enrich an Annotation data set with predictions based on a... In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

Enrich an Annotation data set with predictions based on a Bayesian classification model

Description

Usage

Arguments

Details

Value

Examples

Related to enrich_annotation_file in bakaburg1/BaySREn...

R Package Documentation

Browse R Packages

We want your feedback!

bakaburg1/BaySREn
BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

enrich_annotation_file: Enrich an Annotation data set with predictions based on a...
In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning