dfm_lookup | R Documentation |
Apply a dictionary to a dfm by looking up all dfm features for matches in a a
set of dictionary values, and replace those features with a count of
the dictionary's keys. If exclusive = FALSE
then the behaviour is to
apply a "thesaurus", where each value match is replaced by the dictionary
key, converted to capitals if capkeys = TRUE
(so that the replacements
are easily distinguished from features that were terms found originally in
the document).
dfm_lookup(
x,
dictionary,
levels = 1:5,
exclusive = TRUE,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
capkeys = !exclusive,
nomatch = NULL,
verbose = quanteda_options("verbose")
)
x |
the dfm to which the dictionary will be applied |
dictionary |
a dictionary-class object |
levels |
levels of entries in a hierarchical dictionary that will be applied |
exclusive |
if |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
capkeys |
if |
nomatch |
an optional character naming a new feature that will contain
the counts of features of |
verbose |
print status messages if |
If using dfm_lookup
with dictionaries containing multi-word
values, matches will only occur if the features themselves are multi-word
or formed from n-grams. A better way to match dictionary values that include
multi-word patterns is to apply tokens_lookup()
to the tokens,
and then construct the dfm.
dfm_replace
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")))
dfmat
# glob format
dfm_lookup(dfmat, dict, valuetype = "glob")
dfm_lookup(dfmat, dict, valuetype = "glob", case_insensitive = FALSE)
# regex v. glob format: note that "united_states" is a regex match for "tax*"
dfm_lookup(dfmat, dict, valuetype = "glob")
dfm_lookup(dfmat, dict, valuetype = "regex", case_insensitive = TRUE)
# fixed format: no pattern matching
dfm_lookup(dfmat, dict, valuetype = "fixed")
dfm_lookup(dfmat, dict, valuetype = "fixed", case_insensitive = FALSE)
# show unmatched tokens
dfm_lookup(dfmat, dict, nomatch = "_UNMATCHED")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.