textmodel_newsmap: Semi-supervised Bayesian multinomial model for geographical...
In newsmap: Semi-Supervised Model for Geographical Document Classification

textmodel_newsmap

R Documentation

Semi-supervised Bayesian multinomial model for geographical document classification

Description

Train a Newsmap model to predict geographical focus of documents with labels given by a dictionary.

Usage

textmodel_newsmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  drop_label = TRUE,
  verbose = quanteda_options("verbose"),
  entropy = c("none", "global", "local", "average"),
  ...
)

Arguments

`x`	a dfm or fcm created by `quanteda::dfm()`
`y`	a dfm or a sparse matrix that record class membership of the documents. It can be created applying `quanteda::dfm_lookup()` to `x`.
`label`	if "max", uses only labels for the maximum value in each row of `y`.
`smooth`	a value added to the frequency of words to smooth likelihood ratios.
`drop_label`	if `TRUE`, drops empty columns of `y` and ignore their labels.
`verbose`	if `TRUE`, shows progress of training.
`entropy`	[experimental] the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if `global` or over documents with the same labels if `local`. Local entropy is averaged if `average`. See the details.
`...`	additional arguments passed to internal functions.

Details

Newsmap learns association between words and classes as likelihood ratios based on the features in x and the labels in y. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

References

Kohei Watanabe. 2018. "Newsmap: semi-supervised approach to geographical news classification." Digital Journalism 6(3): 294-309.

Examples

require(quanteda)
text_en <- c(text1 = "This is an article about Ireland.",
             text2 = "The South Korean prime minister was re-elected.")

toks_en <- tokens(text_en)
label_toks_en <- tokens_lookup(toks_en, data_dictionary_newsmap_en, levels = 3)
label_dfm_en <- dfm(label_toks_en)

feat_dfm_en <- dfm(toks_en, tolower = FALSE)

model_en <- textmodel_newsmap(feat_dfm_en, label_dfm_en)
predict(model_en)

newsmap documentation built on May 29, 2024, 7:09 a.m.