Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document metadata or keyword matching.
textmodel_wordmap(
x,
y,
label = c("all", "max"),
smooth = 1,
boolean = FALSE,
drop_label = TRUE,
verbose = quanteda_options("verbose"),
entropy = c("none", "global", "local", "average"),
...
)
x 
a dfm or fcm created by 
y 
a dfm or a sparse matrix that record class membership of the
documents. It can be created applying 
label 
if "max", uses only labels for the maximum value in each row of

smooth 
a value added to the frequency of words to smooth likelihood ratios. 
boolean 
if 
drop_label 
if 
verbose 
if 
entropy 
[experimental] the scheme to compute the entropy to
regularize likelihood ratios. The entropy of features are computed over
labels if 
... 
additional arguments passed to internal functions. 
Wordmap learns association between words and classes as likelihood
ratios based on the features in x
and the labels in y
. The large
likelihood ratios tend to concentrate to a small number of features but the
entropy of their frequencies over labels or documents helps to disperse the
distribution.
Returns a fitted textmodel_wordmap object with the following elements:
model 
a matrix that records the association between classes and features. 
data 
the original input of 
feature 
the feature set in the model. 
concatenator 
the
concatenator in 
entropy 
the type of entropy weights used. 
boolean 
the use of the Boolean transformation of 
call 
the command used to execute the function. 
version 
the version of the wordmap package. 
Watanabe, Kohei (2018). "Newsmap: semisupervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.
Watanabe, Kohei & Zhou, Yuan (2020). "TheoryDriven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
require(quanteda)
# split into sentences
corp < corpus_reshape(data_corpus_ungd2017)
# tokenize
toks < tokens(corp, remove_punct = TRUE) %>%
tokens_remove(stopwords("en"))
# apply seed dictionary
toks_dict < tokens_lookup(toks, data_dictionary_topic)
# form dfm
dfmt_feat < dfm(toks)
dfmt_dict < dfm(toks_dict)
# fit wordmap model
map < textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)
