textmodel_wordmap | R Documentation |
Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.
textmodel_wordmap(
x,
y,
label = c("all", "max"),
smooth = 1,
boolean = FALSE,
drop_label = TRUE,
entropy = c("none", "global", "local", "average"),
residual = FALSE,
verbose = quanteda_options("verbose"),
...
)
x |
a dfm or fcm created by |
y |
a dfm or a sparse matrix that record class membership of the
documents. It can be created applying |
label |
if "max", uses only labels for the maximum value in each row of
|
smooth |
a value added to the frequency of words to smooth likelihood ratios. |
boolean |
if |
drop_label |
if |
entropy |
the scheme to compute the entropy to
regularize likelihood ratios. The entropy of features are computed over
labels if |
residual |
if |
verbose |
if |
... |
additional arguments passed to internal functions. |
Wordmap learns association between words in x
and classes in y
based on likelihood ratios. The large
likelihood ratios tend to concentrate to a small number of features but the
entropy of their frequencies over labels or documents helps to disperse the
distribution.
A residual class is created internally by adding a new column to y
.
The column is given 1 if the other values in the same row are all zero
(i.e. rowSums(y) == 0
); otherwise 0. It is useful when users cannot create
an exhaustive dictionary that covers all the categories.
Returns a fitted textmodel_wordmap object with the following elements:
model |
a matrix that records the association between classes and features. |
data |
the original input of |
feature |
the feature set in |
class |
the class labels in |
concatenator |
the concatenator in |
entropy |
the scheme to compute entropy weights. |
boolean |
the use of the Boolean transformation of |
call |
the command used to execute the function. |
version |
the version of the wordmap package. |
Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
require(quanteda)
# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)
# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
tokens_remove(stopwords("en"))
# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)
# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)
# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.