get_uncertain_docs: Get Uncertain Documents

get_uncertain_docsR Documentation

Get Uncertain Documents

Description

Get documents that the previous iteration of the EM algorithm is least sure about.

Usage

get_uncertain_docs(
  docs,
  bound,
  max_query,
  index_name,
  hand_labeled_index,
  force_list = F,
  query_type = "basic_entropy",
  quantileBreaks = c(75, 20),
  sampleProps = c(0.5, 0.3, 0.2),
  mu = 0.001,
  tau = 0.001,
  regions = "both",
  dfm = NULL,
  seed = NULL,
  n_cluster = NULL
)

Arguments

docs

[matrix] Matrix of labeled and unlabeled documents.

bound

[numeric] The choice of lower bound for entropy-based uncertainty selection.

max_query

[numeric] Maxmium number of uncertain documents that can be queried.

index_name

[character] Character string indicating the variable in 'docs' that denotes the index value of the documents .

hand_labeled_index

[vector] Vector of index values for hand labeled documents in docs.

force_list

[logical] Switch indicating whether to force the filtering of documents with no entropy. Set to FALSE by default.

query_type

[string] String indicating which type of uncertainty sampling to use. Options are "standard_entropy", "normalized_entropy", "tiered_entropy", or "tiered_entropy_weighted".

quantileBreaks

[vector] Vector of break points to distinguish entropy zones. The first value is the break point between the first and second tier, the second is the break point between the second and third tier.

sampleProps

[vector] Vector of sampling proportions for each entropy zone. The first value is the proportion of max_query to be sampled from the high entropy region, the second value is the proportion to be sampled from the middle entropy region, and the third value is the proportion to be sampled from the lowest entropy region.

n_cluster

[int] Number of clusters.

Value

[vector] Vector of id values of documents that the EM algorithm is uncertain about.


activetext/activeR documentation built on May 31, 2024, 10:21 a.m.