extract_rules: Extract screening rules from an Annotation data set

View source: R/Rule_building.R

extract_rulesR Documentation

Extract screening rules from an Annotation data set

Description

Starting from a Document Term Matrix (DTM) and a posterior predictive distribution (PPD) matrix produced by the Bayesian classification engine, a decision tree algorithm is used to extract rules that partition a subset of draws from the PPD. Beware that the generation of the rules may take a long time.

Usage

extract_rules(
  session_name,
  rebuild_dtm = FALSE,
  vimp.threshold = 1.25,
  n.trees = 800,
  sessions_folder = getOption("baysren.sessions_folder", "Sessions"),
  save_path = file.path(sessions_folder, session_name, "rule_data.rds"),
  ...
)

Arguments

session_name

A session identifier corresponding to folders into the sessions_folder folder.

rebuild_dtm

Whether to use the last DTM stored in the session_name folder (FALSE) or rebuild it from the last Annotation file (TRUE).

vimp.threshold

A threshold in the standardized variable importance score to filter out less relevant terms in the DTM.

n.trees

How many draws to use from the PPD matrix to build decision trees. This parameter strongly impacts computational time but increases sensitivity of the rules found.

sessions_folder

Where to find the sessions folders.

save_path

Since generating the rules is a computation intense process it's advisable to save the output in a .rds file placed inside the session_name folder. User need only to provide the name of the file.

...

Additional arguments passed to rpart::rpart.control().

Details

The algorithm allows to use only a subset of the terms in the DTM and of the samples in the PPD matrix to cut on computation time. In the first case, a threshold is used to filter only the most relevant features in the DTM. Before being used, terms in the DTM are aggregated if they appear in multiple fields of the citation records and only their general presence in the record will be stored.

Value

A list with:

SpecificDTM

The DTM with the less relevant terms being filtered out and terms in multiple record fields being aggregated.

DTM

The full DTM with the predicted classification.

rules

A data frame reporting the selected rules with the average PPD.


bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.