| dictionary | R Documentation |
Create a quanteda dictionary object to perform pattern matching on tokens, dfm and fcm.
dictionary(
x,
file = NULL,
format = NULL,
separator = " ",
tolower = TRUE,
tokenize = FALSE,
levels = 1:100,
encoding = "utf-8"
)
x |
a named list of valuetype patterns or an existing dictionary
object. See examples. This argument should be omitted if |
file |
file identifier for a foreign dictionary. |
format |
character identifier for the format of the foreign dictionary. If not supplied, the format is guessed from the dictionary file's extension. Available options are:
|
separator |
the character in between multi-word dictionary values. This
defaults to |
tolower |
if |
tokenize |
if |
levels |
integers specifying the levels of entries in |
encoding |
additional optional encoding value for reading in imported dictionaries. This uses the iconv labels for encoding. See the "Encoding" section of the help for file. |
A dictionary object can include multi-word expressions segmented by
separator. When it is applied to tokens object, they match both sequences
of separate tokens and compounded tokens.
Dictionary objects can be subsetted using
[ and
[[, operating the same as the equivalent
list operators. If dictionary() is applied to existing
objects, it is possible to select levels.
Dictionary objects can be coerced from and to lists using as.dictionary()
and as.list(), and checked using is.dictionary().
Currently supported input file formats are the WordStat, LIWC, Lexicoder v2 and v3, and Yoshikoder formats. The import using the LIWC format works with all currently available dictionary files supplied as part of the LIWC 2001, 2007, and 2015 software (see References).
A dictionary class object, essentially a specially classed named list of characters.
WordStat dictionaries page, from Provalis Research https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/.
Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., & Booth, R.J. (2007). The development and psychometric properties of LIWC2007. [Software manual]. Austin, TX (https://www.liwc.app/).
Yoshikoder page, from Will Lowe https://conjugateprior.org/software/yoshikoder/.
Lexicoder format, https://www.snsoroka.com/data-lexicoder/
as.dictionary(),
as.list(), is.dictionary()
corp <- corpus_subset(data_corpus_inaugural, Year > 2000)
toks <- tokens(corp)
dict <- dictionary(list(
tax = c("tax", "taxes", "taxing"), # fixed patterns
economy = list("econom*", # glob patterns
job = c("work*", "job*")), # nested keys
health = c("health care", "public health") # multi-word expressions
))
# compound tokens
tokens_compound(toks, pattern = dict) |>
dfm() |>
dfm_select(dict)
tokens_lookup(toks, dictionary = dict, levels = 1) |>
dfm()
# subset a dictionary
dict[1:2]
dict[c("economy")]
# update a dictionary
dictionary(dict, levels = 2)
## Not run:
dfmat <- dfm(tokens(data_corpus_inaugural))
# import the Laver-Garry dictionary from Provalis Research
download.file("https://provalisresearch.com/Download/LaverGarry.zip",
tf <- tempfile(), mode = "wb")
unzip(tf, exdir = (td <- tempdir()))
dict_lg <- dictionary(file = paste(td, "LaverGarry.cat", sep = "/"))
dfm_lookup(dfmat, dict_lg)
# import a LIWC formatted dictionary from http://www.moralfoundations.org
download.file("http://bit.ly/37cV95h", tf <- tempfile())
dict_liwc <- dictionary(file = tf, format = "LIWC")
dfm_lookup(dfmat, dict_liwc)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.