textmodel_wordmap: A model for multinomial feature extraction and document...
In wordmap: Feature Extraction and Document Classification with Noisy Labels

textmodel_wordmap

R Documentation

A model for multinomial feature extraction and document classification

Description

Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.

Usage

textmodel_wordmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 0.01,
  boolean = FALSE,
  drop_label = TRUE,
  entropy = c("none", "global", "local", "average"),
  residual = FALSE,
  verbose = quanteda_options("verbose"),
  ...
)

Arguments

`x`	a dfm or fcm created by `quanteda::dfm()`
`y`	a dfm or a sparse matrix that record class membership of the documents. It can be created applying `quanteda::dfm_lookup()` to `x`.
`label`	if "max", uses only labels for the maximum value in each row of `y`.
`smooth`	the amount of smoothing in computing coefficients. When `smooth = 0.01`, 1% of the mean frequency of words in each class is added to smooth likelihood ratios.
`boolean`	if `TRUE`, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.
`drop_label`	if `TRUE`, drops empty columns of `y` and ignore their labels.
`entropy`	the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if `global` or over documents with the same labels if `local`. Local entropy is averaged if `average`. See the details.
`residual`	if `TRUE`, a residual class is added to `y`. It is named "other" but can be changed via `base::options(wordmap_residual_name)`.
`verbose`	if `TRUE`, shows progress of training.
`...`	additional arguments passed to internal functions.

Details

Wordmap learns association between words in x and classes in y based on likelihood ratios. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

A residual class is created internally by adding a new column to y. The column is given 1 if the other values in the same row are all zero (i.e. rowSums(y) == 0); otherwise 0. It is useful when users cannot create an exhaustive dictionary that covers all the categories.

Value

Returns a fitted textmodel_wordmap object with the following elements:

`model`	a matrix that records the association between classes and features.
`data`	the original input of `x`.
`feature`	the feature set in `x`
`class`	the class labels in `y`.
`concatenator`	the concatenator in `x`.
`entropy`	the scheme to compute entropy weights.
`boolean`	the use of the Boolean transformation of `x`.
`call`	the command used to execute the function.
`version`	the version of the wordmap package.

References

Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

require(quanteda)

# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)

# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
   tokens_remove(stopwords("en"))

# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)

# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)

# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)

wordmap documentation built on June 29, 2025, 9:06 a.m.