textmodel_wordmap: A model for multinomial feature extraction and document...

View source: R/textmodel.R

textmodel_wordmapR Documentation

A model for multinomial feature extraction and document classification

Description

Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.

Usage

textmodel_wordmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  boolean = FALSE,
  drop_label = TRUE,
  entropy = c("none", "global", "local", "average"),
  residual = FALSE,
  verbose = quanteda_options("verbose"),
  ...
)

Arguments

x

a dfm or fcm created by quanteda::dfm()

y

a dfm or a sparse matrix that record class membership of the documents. It can be created applying quanteda::dfm_lookup() to x.

label

if "max", uses only labels for the maximum value in each row of y.

smooth

a value added to the frequency of words to smooth likelihood ratios.

boolean

if TRUE, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.

drop_label

if TRUE, drops empty columns of y and ignore their labels.

entropy

the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if global or over documents with the same labels if local. Local entropy is averaged if average. See the details.

residual

if TRUE, a residual class is added to y. It is named "other" but can be changed via base::options(wordmap_residual_name).

verbose

if TRUE, shows progress of training.

...

additional arguments passed to internal functions.

Details

Wordmap learns association between words in x and classes in y based on likelihood ratios. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

A residual class is created internally by adding a new column to y. The column is given 1 if the other values in the same row are all zero (i.e. rowSums(y) == 0); otherwise 0. It is useful when users cannot create an exhaustive dictionary that covers all the categories.

Value

Returns a fitted textmodel_wordmap object with the following elements:

model

a matrix that records the association between classes and features.

data

the original input of x.

feature

the feature set in x

class

the class labels in y.

concatenator

the concatenator in x.

entropy

the scheme to compute entropy weights.

boolean

the use of the Boolean transformation of x.

call

the command used to execute the function.

version

the version of the wordmap package.

References

Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

require(quanteda)

# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)

# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
   tokens_remove(stopwords("en"))

# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)

# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)

# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)


wordmap documentation built on Oct. 21, 2024, 1:07 a.m.