tCorpus-cash-code_dictionary: Dictionary lookup
In corpustools: Managing, Querying and Analyzing Tokenized Text

tCorpus$code_dictionary

R Documentation

Dictionary lookup

Description

Add a column to the token data that contains a code (the query label) for tokens that match the dictionary

Usage:

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

code_dictionary(...)

Arguments

`dict`	A dictionary. Can be either a data.frame or a quanteda dictionary. If a data.frame is given, it has to have a column named "string" (or use string_col argument) that contains the dictionary terms. All other columns are added to the tCorpus $tokens data. Each row has a single string, that can be a single word or a sequence of words seperated by a whitespace (e.g., "not bad"), and can have the common ? and * wildcards. If a quanteda dictionary is given, it is automatically converted to this type of data.frame with the `melt_quanteda_dict` function. This can be done manually for more control over labels.
`token_col`	The feature in tc that contains the token text.
`string_col`	If dict is a data.frame, the name of the column in dict that contains the dictionary lookup string
`sep`	A regular expression for separating multi-word lookup strings (default is " ", which is what quanteda dictionaries use). For example, if the dictionary contains "Barack Obama", sep should be " " so that it matches the consequtive tokens "Barack" and "Obama". In some dictionaries, however, it might say "Barack+Obama", so in that case sep = '\\+' should be used.
`case_sensitive`	logical, should lookup be case sensitive?
`column`	The name of the column added to $tokens. [column]_id contains the unique id of the match. If a quanteda dictionary is given, the label for the match is in the column named [column]. If a dictionary has multiple levels, these are added as [column]_l[level].
`use_wildcards`	Use the wildcards * (any number including none of any character) and ? (one or none of any character). If FALSE, exact string matching is used. (":-)" versus ":" "-" ")"). This is only behind the scenes for the dictionary lookup, and will not affect tokenization in the corpus.
`ascii`	If true, convert text to ascii before matching
`verbose`	If true, report progress

Value

the tCorpus

Examples

dict = data.frame(string = c('good','bad','ugl*','nice','not pret*', ':)', ':('), 
                  sentiment=c(1,-1,-1,1,-1,1,-1))
tc = create_tcorpus(c('The good, the bad and the ugly, is nice :) but not pretty :('))
tc$code_dictionary(dict)
tc$tokens

corpustools documentation built on Aug. 8, 2025, 6:08 p.m.