View source: R/tokens_lookup.R
tokens_lookup | R Documentation |
Convert tokens into equivalence classes defined by values of a dictionary object.
tokens_lookup(
x,
dictionary,
levels = 1:5,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
capkeys = !exclusive,
exclusive = TRUE,
nomatch = NULL,
append_key = FALSE,
separator = "/",
concatenator = concat(x),
nested_scope = c("key", "dictionary"),
apply_if = NULL,
verbose = quanteda_options("verbose")
)
x |
the tokens object to which the dictionary will be applied |
dictionary |
the dictionary-class object that will be applied to
|
levels |
integers specifying the levels of entries in a hierarchical
dictionary that will be applied. The top level is 1, and subsequent levels
describe lower nesting levels. Values may be combined, even if these
levels are not contiguous, e.g. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
capkeys |
if |
exclusive |
if |
nomatch |
an optional character naming a new key for tokens that do not
matched to a dictionary values If |
append_key |
if |
separator |
a character to separate tokens and keys when |
concatenator |
the concatenation character that will connect the words making up the multi-word sequences. |
nested_scope |
how to treat matches from different dictionary keys that
are nested. When one value is nested within another, such as "a b" being
nested within "a b c", then |
apply_if |
logical vector of length |
verbose |
print status messages if |
Dictionary values may consist of sequences, and there are different methods of counting key matches based on values that are nested or that overlap.
When two different keys in a dictionary are nested matches of one another,
the nested_scope
options provide the choice of matching each key's
values independently (the "key"
) option, or just counting the
longest match (the "dictionary"
option). Values that are nested
within the same key are always counted as a single match. See the
last example below comparing the New York and New York Times
for these two different behaviours.
Overlapping values, such as "a b"
and "b a"
are
currently always considered as separate matches if they are in different
keys, or as one match if the overlap is within the same key.
Note: apply_if
This applies the dictionary lookup only to documents that
match the logical condition. When exclusive = TRUE
(the default),
however, this means that empty documents will be returned for those not
meeting the condition, since no lookup will be applied and hence no tokens
replaced by matching keys.
tokens_replace
toks1 <- tokens(data_corpus_inaugural)
dict1 <- dictionary(list(country = "united states",
law=c("law*", "constitution"),
freedom=c("free*", "libert*")))
dfm(tokens_lookup(toks1, dict1, valuetype = "glob", verbose = TRUE))
dfm(tokens_lookup(toks1, dict1, valuetype = "glob", verbose = TRUE, nomatch = "NONE"))
dict2 <- dictionary(list(country = "united states",
law = c("law", "constitution"),
freedom = c("freedom", "liberty")))
# dfm(applyDictionary(toks1, dict2, valuetype = "fixed"))
dfm(tokens_lookup(toks1, dict2, valuetype = "fixed"))
# hierarchical dictionary example
txt <- c(d1 = "The United States has the Atlantic Ocean and the Pacific Ocean.",
d2 = "Britain and Ireland have the Irish Sea and the English Channel.")
toks2 <- tokens(txt)
dict3 <- dictionary(list(US = list(Countries = c("States"),
oceans = c("Atlantic", "Pacific")),
Europe = list(Countries = c("Britain", "Ireland"),
oceans = list(west = "Irish Sea",
east = "English Channel"))))
tokens_lookup(toks2, dict3, levels = 1)
tokens_lookup(toks2, dict3, levels = 2)
tokens_lookup(toks2, dict3, levels = 1:2)
tokens_lookup(toks2, dict3, levels = 3)
tokens_lookup(toks2, dict3, levels = c(1,3))
tokens_lookup(toks2, dict3, levels = c(2,3))
# show unmatched tokens
tokens_lookup(toks2, dict3, nomatch = "_UNMATCHED")
# nested matching differences
dict4 <- dictionary(list(paper = "New York Times", city = "New York"))
toks4 <- tokens("The New York Times is a New York paper.")
tokens_lookup(toks4, dict4, nested_scope = "key", exclusive = FALSE)
tokens_lookup(toks4, dict4, nested_scope = "dictionary", exclusive = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.