active_label_wrapper: Active Learning EM Algorithm

View source: R/functions_active.R

active_label_wrapperR Documentation

Active Learning EM Algorithm

Description

Active learning for weighted-EM algorithm. After initial EM algorithm converges, oracle is queried for labels to documents that the EM algorithm was most unsure of. This process iterates until max iterations are reached, or there are no documents in the window of uncertainty.

Usage

active_label_wrapper(
  docs,
  labels = c(0, 1),
  doc_name = "text",
  index_name = "id",
  labels_name = NULL,
  lambda = 1,
  n_class = 2,
  n_cluster = 2,
  init_index = NULL,
  handlabel = TRUE,
  bound = 0,
  max_active = 5,
  init_size = 10,
  max_query = 10,
  lazy_eval = FALSE,
  force_list = FALSE,
  counter_on = TRUE,
  query_type = "basic_entropy",
  which_out_test = NULL,
  seed = NA,
  fixed_words = NULL,
  dfms = NULL,
  export_all_em = FALSE,
  export_all = FALSE,
  log_ratio_threshold = 0.001,
  log_ratio_conv_type = "maximand",
  mu = 1e-04,
  tau = 1e-04,
  regions = "both",
  lambda_decay = FALSE,
  ld_rate = 0.2,
  tune_lambda = FALSE,
  tune_lambda_prop_init = 0.1,
  tune_lambda_range = seq(0, 1, 0.1),
  tune_lambda_k = 10,
  tune_lambda_parallel = TRUE,
  NB_init = TRUE,
  export_val_stats_only = FALSE,
  model_name = "Model",
  agg_type = "best",
  n_cluster_collapse_type = "simple",
  beta = NA,
  active_eta_query = FALSE,
  keywords_list = list(NA, NA),
  keywords_scheme = NA,
  true_eta = NA,
  gamma = NA,
  validation_mode = FALSE,
  cont_metadata_varnames = NA,
  binary_metadata_varnames = NA,
  contextual_varnames = NA,
  mc_iter = NA,
  save_file_name = NA,
  save_directory = NA,
  load_saved = NA,
  ...
)

Arguments

docs

[matrix] Matrix of labeled and unlabeled documents, where each row has index values and a nested Matrix of word tokens.

labels

[vector] Vector of character strings indicating classification options for labeling.

doc_name

[character] Character string indicating the variable in 'docs' that denotes the text of the documents to be classified.

index_name

[character] Character string indicating the variable in 'docs' that denotes the index value of the document to be classified.

labels_name

[character] Character string indicating the variable in docs that denotes the already known labels of the documents. By default, value is set to NULL.

lambda

[numeric] Numeric value between 0 and 1. Used to weight unlabeled documents.

n_class

[numeric] Number of classes to be considered.

handlabel

[logical] Boolean logical value indicating whether to initiate user-input script. If set to FALSE, and if labels_name is provided, the script queries the document label directly from the column denoted by labels_name.

bound

[numeric] Minimum bound of entropy to call for additional labelling.

max_active

[numeric] Value of maximum allowed active learning iterations.

init_size

[numeric] Value of maximum allowed iterations within the EM algorithm.

max_query

[numeric] Maximum number of documents queried in each EM iteration.

lazy_eval

[logical] If lazy_eval == T, convergence is measured by comparing changes in log likelihood across model iterations rather than directly computing maximand.

force_list

[logical] Switch indicating whether to force the filtering of documents with no entropy. Set to FALSE by default.

counter_on

[logical] Switch indicating whether the progress of each sequence of the EM algorithm is reported. By default set to TRUE.

query_type

[string] String indicating which type of uncertainty sampling to use. Options are "standard_entropy" or "normalized_entropy", "tiered_entropy", or "tiered_entropy_weighted".

which_out_test

[vector] Vector of document index labels used to identify documents to be used for out of sample validation of the learned model. Set to NULL by default. If a vector of labels is provided, the function outputs an additional argument containing classification likelihoods for all documents identified by the vector.

seed

[numeric] Sets seed for model.

fixed_words

[matrix] Matrix of fixed words with class probabilities, where ncol is the number of classes.

dfms

[matrix] Option to manually supply a dfm from quanteda.

export_all_em

[logical] Switch indicating whether to export model If true, the function exports a list of lists containing all predictions.

export_all

[logical] Switch indicating whether to export model predictions from each stage of the algorithm.

log_ratio_threshold

[numeric] Threshold at which convergence is declared when using 'query_type="log_ratio"'.

log_ratio_conv_type

[string] If 'query_type="log_ratio"', this supplies the way that convergence is estimated. Set to 'maximand' by default.

mu

Parameters for error acceptance with 'query_type=log_ratio'.

tau

Parameters for error acceptaance with 'query_type=log_ratio'.

regions

[string] Can be set to "both", "pos", or "neg" to sample from certain regions during log ratio sampling.

lambda_decay

[logical] Determines whether lambda value decays over active learning iterations or not.

ld_rate

[float] If 'lambda_decay == TRUE', sets the rate at which decay occurs.

tune_lambda

[logical] Logical value indictating whether to tune lambda values with cross validation over active learning iterations.

tune_lambda_prop_init

[numeric] Float value indicating the proportion of documents to label supply rather than label with EM during lambda tuning.

tune_lambda_range

[vector] Vector of float values, indicating the range of lambda values to search over when tuning lambda at each active iteration.

tune_lambda_k

[integer] Integer value indicating what k-fold level to cross validate at when tuning lambda.

NB_init

[boolean] Indicates whether each active iteration should start with a naive step in the EM or whether to initialize with model predictions from previous active iteration.

export_val_stats_only

Boolean, indicating whether to export validation stats only from model runs.

model_name

[string] Model name string for exporting when 'export_val_stats_only == TRUE'.

agg_type

[string] Indicating how to aggregate model predictions.

n_cluster_collapse_type

[string] Indicates how to collapse multiple clusters into binary class. By default, set to "simple", which takes the negative class probablity as the difference between the positive class probability and 1. Can also be set to "max_neg", which calculates the normalized ratio of positive cluster to the largest negative cluster.

beta

[numeric] prior parameter for eta

active_eta_query

[boolean] Indicates whether to query oracle for eta tuning.

cont_metadata_varnames

Vector of continuous metadata varnames

binary_metadata_varnames

Vector of binary metadata varnames

...

Additional parameters to pass to 'get_dfm' and 'EM()' and 'get_uncertain_docs()'.

initIndex

[vector] Vector that indicates which documents to use to initialize the algorithm. By default set to NULL, which causes a random subset of the documents to be selected.

quantileBreaks

[vector] Vector of break points to distinguish entropy zones. The first value is the break point between the first and second tier, the second is the break point between the second and third tier.

sampleProps

[vector] Vector of sampling proportions for each entropy zone. The first value is the proportion of max_query to be sampled from the high entropy region, the second value is the proportion to be sampled from the middle entropy region, and the third value is the proportion to be sampled from the lowest entropy region.

supervise

[logical] T if supervised. F is unsupervised.

contextual_metadata_varnames

Vector of contextual metadata varnames

Value

[list] List containing labeled document matrix, prior weights, word likelihoods, and a vector of user-labeled documents ids.


activetext/activeR documentation built on May 31, 2024, 10:21 a.m.