enrich_annotation_file | R Documentation |
This function is the central component of the framework. It takes as input a session name or the path to an Annotation data frame and trains a Bayesian model to predict the labels of all records.
enrich_annotation_file( session_name, file = NULL, DTM = NULL, pos_mult = 10, n_models = 10, resample = FALSE, pred_quants = c(0.01, 0.5, 0.99), sessions_folder = getOption("baysren.sessions_folder", "Sessions"), pred_batch_size = 5000, autorun = TRUE, stop_on_unreviewed = TRUE, dup_session_action = c("fill", "add", "replace", "stop"), limits = list(stop_after = 4, pos_target = NULL, labeling_limit = NULL), compute_performance = FALSE, use_prev_labels = TRUE, prev_classification = NULL, save_samples = TRUE, rebuild = FALSE, ... )
session_name |
A session name which also identifies a subfolder of
|
file |
In alternative to |
DTM |
A path to a Document Term Matrix (DTM) as produced by
|
pos_mult |
A model parameter. Defines the oversampling rate of positive records before training. A higher number increases sensitivity at the cost of lower efficiency (more records to manually review) and training times. |
n_models |
A model parameter. The Bayesian model is run multiple times and the multiple generated PPDs are averaged, creating an ensemble PPD. A higher number of models decrease uncertainty and increases efficiency, but greatly increases computation times. |
resample |
A model parameter. Whether to bootstrap the training data
before modelling. It makes sense only if |
pred_quants |
A model parameter. The levels of the PPD uncertainty intervals used to build the Uncertainty Zone. A larger uncertainty interval increases sensitivity but decreases efficiency. The middle level is only use to provide a descriptive point estimate for the record PPDs, it is not used to define the record labels. |
sessions_folder |
The path to the folder where all the sessions are stored. |
pred_batch_size |
Since creating the PPD has a big memory burden, records are separated in batches before computing predictions. Decrease this number if the function crashes due to Java memory problems. |
autorun |
If no unreviewed uncertain are present, start a new CR iteration automatically. |
stop_on_unreviewed |
Raise an error if there are uncertain records that
have not been yet manually reviewed (i.e. "*" in the
|
dup_session_action |
Similar to the argument in
|
limits |
A list of condition that would prevent the function from
running; the function will return |
compute_performance |
Use |
use_prev_labels |
If the |
prev_classification |
A data frame with already labeled records to use
to solve automatically uncertain record labels during a CR iteration. Only
used if |
save_samples |
Whether to save on disk the PPDs of the records. These
are necessary (at least those related to the last CR iteration) to create a
new search query automatically with |
rebuild |
Every time the function is run, a backup of the Bayesian model
outputs (the PPDs samples and the variable importance) is saved in the file
"Model_backup.rds". If |
... |
Additional parameters passed to |
The outcome on which the model is trained on a coalesced column Target
made by the initial manual labels in Rev_manual
and previous predicted
labels in Rev_prediction
that were manually reviewed. The labels can
be only positive (y
) or negative (n
).
The output is another Annotation data frame with an extra column
Predicted_label
which contains the labels predicted by the model (see
later). Also a Rev_prediction_new
column is created, which contains
predictions that require manual review, and Rev_prediction
which
stores the previously reviewed predictions.
Based on the paradigm of "Active Learning", records in the
Rev_prediction_new
column will have a "" label, which means that they
need manual review. The user will need to change the "" into a positive
y
or negative n
label.
After the uncertain records in the output Annotation file are reviewed, the
function can be called again, establishing a Classification/Review (CR)
cycle. If no new positive record labels are found, the function should be
still called a number of times (defined in the limits$stop_after
argument) to produce "replication" CR iterations. These are needed to avoid
missing uncertain record due to the stochasticity in the Bayesian model.
The Bayesian model generates posterior predictive distributions (PPD) of
positive label for each record (saved into Samples in the session folder).
The PPDs are used to define the "Uncertainty Zone", which is delimited by the
lowermost of all PPDs' credibility intervals boundaries at a chosen level for
the positive labelled records and the uppermost of the interval boundaries
among the manually reviewed negative labelled records. A record is labelled
as positive/negative in the Predicted_label
column if both its
uncertainty interval boundaries are above/below the Uncertainty Zone.
Otherwise, it will be labelled as uncertain and will require manual review.
If a prediction in Predicted_label
contradicts one in
Rev_manual
, will need to be reviewed by the user.
If just a session name is passed, the function will identify automatically the last Annotation file to use and if its a new CR iteration or a replication and will label the output file accordingly.
The function take as arguments a number of parameters which may have an
impact on the final CR cycle sensitivity and efficiency (fraction of total
manually reviewed records): pos_mult
, n_models
,
resample
, pred_quants
. Default values are proposed for these
which should work in most cases; perform_grid_evaluation()
can be used to
evaluate the best parameter combination for a new data set if the user
believes that there is margin of improvement, but the evaluation of the grid
is extremely computationally intensive.
The function saves a number of output files on disk.
An Annotation data frame with a number of extra columns:
Rev_prediction
and a Rev_prediction_new
which contains
respectively manually reviewed predictions from previous CR iterations and
new predictions that require review; Predicted_label
which stores
the labels predicted using the Uncertainty Zone mechanism; Pred_Low
,
Pred_Med
, and Pred_Up
which describe a record PPD using the
uncertainty levels defined into the pred_quants
argument.
## Not run: ## A simple call using a session name will automatically pick up the right # annotation file (the initial one or an already classified one if existing) # and start a CR iteration. enrich_annotation_file("Session1") # Alternatively, an Annotation file can be passed manually (discouraged) records_file <- get_session_files("Session1")$Records enrich_annotation_file("Session1", file = records_file) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.