textmodel_newsmap: Semi-supervised Bayesian multinomial model for geographical...

View source: R/textmodel.R

textmodel_newsmapR Documentation

Semi-supervised Bayesian multinomial model for geographical document classification

Description

Train a Newsmap model to predict geographical focus of documents with labels given by a dictionary.

Usage

textmodel_newsmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  drop_label = TRUE,
  verbose = quanteda_options("verbose"),
  entropy = c("none", "global", "local", "average"),
  ...
)

Arguments

x

a dfm or fcm created by quanteda::dfm()

y

a dfm or a sparse matrix that record class membership of the documents. It can be created applying quanteda::dfm_lookup() to x.

label

if "max", uses only labels for the maximum value in each row of y.

smooth

a value added to the frequency of words to smooth likelihood ratios.

drop_label

if TRUE, drops empty columns of y and ignore their labels.

verbose

if TRUE, shows progress of training.

entropy

[experimental] the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if global or over documents with the same labels if local. Local entropy is averaged if average. See the details.

...

additional arguments passed to internal functions.

Details

Newsmap learns association between words and classes as likelihood ratios based on the features in x and the labels in y. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

References

Kohei Watanabe. 2018. "Newsmap: semi-supervised approach to geographical news classification." Digital Journalism 6(3): 294-309.

Examples

require(quanteda)
text_en <- c(text1 = "This is an article about Ireland.",
             text2 = "The South Korean prime minister was re-elected.")

toks_en <- tokens(text_en)
label_toks_en <- tokens_lookup(toks_en, data_dictionary_newsmap_en, levels = 3)
label_dfm_en <- dfm(label_toks_en)

feat_dfm_en <- dfm(toks_en, tolower = FALSE)

model_en <- textmodel_newsmap(feat_dfm_en, label_dfm_en)
predict(model_en)


koheiw/Newsmap documentation built on April 14, 2024, 3:26 a.m.