View source: R/functions_active.R
active_label_wrapper | R Documentation |
Active learning for weighted-EM algorithm. After initial EM algorithm converges, oracle is queried for labels to documents that the EM algorithm was most unsure of. This process iterates until max iterations are reached, or there are no documents in the window of uncertainty.
active_label_wrapper(
docs,
labels = c(0, 1),
doc_name = "text",
index_name = "id",
labels_name = NULL,
lambda = 1,
n_class = 2,
n_cluster = 2,
init_index = NULL,
handlabel = TRUE,
bound = 0,
max_active = 5,
init_size = 10,
max_query = 10,
lazy_eval = FALSE,
force_list = FALSE,
counter_on = TRUE,
query_type = "basic_entropy",
which_out_test = NULL,
seed = NA,
fixed_words = NULL,
dfms = NULL,
export_all_em = FALSE,
export_all = FALSE,
log_ratio_threshold = 0.001,
log_ratio_conv_type = "maximand",
mu = 1e-04,
tau = 1e-04,
regions = "both",
lambda_decay = FALSE,
ld_rate = 0.2,
tune_lambda = FALSE,
tune_lambda_prop_init = 0.1,
tune_lambda_range = seq(0, 1, 0.1),
tune_lambda_k = 10,
tune_lambda_parallel = TRUE,
NB_init = TRUE,
export_val_stats_only = FALSE,
model_name = "Model",
agg_type = "best",
n_cluster_collapse_type = "simple",
beta = NA,
active_eta_query = FALSE,
keywords_list = list(NA, NA),
keywords_scheme = NA,
true_eta = NA,
gamma = NA,
validation_mode = FALSE,
cont_metadata_varnames = NA,
binary_metadata_varnames = NA,
contextual_varnames = NA,
mc_iter = NA,
save_file_name = NA,
save_directory = NA,
load_saved = NA,
...
)
docs |
[matrix] Matrix of labeled and unlabeled documents, where each row has index values and a nested Matrix of word tokens. |
labels |
[vector] Vector of character strings indicating classification options for labeling. |
doc_name |
[character] Character string indicating the variable in 'docs' that denotes the text of the documents to be classified. |
index_name |
[character] Character string indicating the variable in 'docs' that denotes the index value of the document to be classified. |
labels_name |
[character] Character string indicating the variable in |
lambda |
[numeric] Numeric value between 0 and 1. Used to weight unlabeled documents. |
n_class |
[numeric] Number of classes to be considered. |
handlabel |
[logical] Boolean logical value indicating whether to initiate user-input script.
If set to |
bound |
[numeric] Minimum bound of entropy to call for additional labelling. |
max_active |
[numeric] Value of maximum allowed active learning iterations. |
init_size |
[numeric] Value of maximum allowed iterations within the EM algorithm. |
max_query |
[numeric] Maximum number of documents queried in each EM iteration. |
lazy_eval |
[logical] If |
force_list |
[logical] Switch indicating whether to force the filtering of documents with
no entropy. Set to |
counter_on |
[logical] Switch indicating whether the progress of each sequence of the EM algorithm
is reported. By default set to |
query_type |
[string] String indicating which type of uncertainty sampling to use. Options are |
which_out_test |
[vector] Vector of document index labels used to identify documents to be used for
out of sample validation of the learned model. Set to |
seed |
[numeric] Sets seed for model. |
fixed_words |
[matrix] Matrix of fixed words with class probabilities, where ncol is the number of classes. |
dfms |
[matrix] Option to manually supply a dfm from quanteda. |
export_all_em |
[logical] Switch indicating whether to export model If true, the function exports a list of lists containing all predictions. |
export_all |
[logical] Switch indicating whether to export model predictions from each stage of the algorithm. |
log_ratio_threshold |
[numeric] Threshold at which convergence is declared when using 'query_type="log_ratio"'. |
log_ratio_conv_type |
[string] If 'query_type="log_ratio"', this supplies the way that convergence is estimated. Set to 'maximand' by default. |
mu |
Parameters for error acceptance with 'query_type=log_ratio'. |
tau |
Parameters for error acceptaance with 'query_type=log_ratio'. |
regions |
[string] Can be set to "both", "pos", or "neg" to sample from certain regions during log ratio sampling. |
lambda_decay |
[logical] Determines whether lambda value decays over active learning iterations or not. |
ld_rate |
[float] If 'lambda_decay == TRUE', sets the rate at which decay occurs. |
tune_lambda |
[logical] Logical value indictating whether to tune lambda values with cross validation over active learning iterations. |
tune_lambda_prop_init |
[numeric] Float value indicating the proportion of documents to label supply rather than label with EM during lambda tuning. |
tune_lambda_range |
[vector] Vector of float values, indicating the range of lambda values to search over when tuning lambda at each active iteration. |
tune_lambda_k |
[integer] Integer value indicating what k-fold level to cross validate at when tuning lambda. |
NB_init |
[boolean] Indicates whether each active iteration should start with a naive step in the EM or whether to initialize with model predictions from previous active iteration. |
export_val_stats_only |
Boolean, indicating whether to export validation stats only from model runs. |
model_name |
[string] Model name string for exporting when 'export_val_stats_only == TRUE'. |
agg_type |
[string] Indicating how to aggregate model predictions. |
n_cluster_collapse_type |
[string] Indicates how to collapse multiple clusters into binary class. By default, set to "simple", which takes the negative class probablity as the difference between the positive class probability and 1. Can also be set to "max_neg", which calculates the normalized ratio of positive cluster to the largest negative cluster. |
beta |
[numeric] prior parameter for eta |
active_eta_query |
[boolean] Indicates whether to query oracle for eta tuning. |
cont_metadata_varnames |
Vector of continuous metadata varnames |
binary_metadata_varnames |
Vector of binary metadata varnames |
... |
Additional parameters to pass to 'get_dfm' and 'EM()' and 'get_uncertain_docs()'. |
initIndex |
[vector] Vector that indicates which documents to use to initialize the
algorithm. By default set to |
quantileBreaks |
[vector] Vector of break points to distinguish entropy zones. The first value is the break point between the first and second tier, the second is the break point between the second and third tier. |
sampleProps |
[vector] Vector of sampling proportions for each entropy zone. The first value is
the proportion of |
supervise |
[logical] T if supervised. F is unsupervised. |
contextual_metadata_varnames |
Vector of contextual metadata varnames |
[list] List containing labeled document matrix, prior weights, word likelihoods, and a vector of user-labeled documents ids.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.