tCorpus-cash-replace_dictionary: Replace tokens with dictionary match

tCorpus$replace_dictionaryR Documentation

Replace tokens with dictionary match

Description

Uses search_dictionary, and replaces tokens that match the dictionary lookup term with the dictionary code. Multi-token matches (e.g., "Barack Obama") will become single tokens. Multiple lookup terms per code can be used to deal with alternatives such as "Barack Obama", "president Obama" and "Obama".

This method can also be use to concatenate ASCII symbols into emoticons, given a dictionary of emoticons.

Usage:

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

replace_dictionary(...)

Arguments

dict

A dictionary. Can be either a data.frame or a quanteda dictionary. If a data.frame is given, it has to have a column named "string" (or use string_col argument) that contains the dictionary terms, and a column "code" (or use code_col argument) that contains the label/code represented by this string. Each row has a single string, that can be a single word or a sequence of words seperated by a whitespace (e.g., "not bad"), and can have the common ? and * wildcards. If a quanteda dictionary is given, it is automatically converted to this type of data.frame with the melt_quanteda_dict function. This can be done manually for more control over labels. Finally, you can also just pass a character vector. All multi word strings (like emoticons) will then be collapsed into single tokens.

token_col

The feature in tc that contains the token text.

string_col

If dict is a data.frame, the name of the column in dict with the dictionary lookup string. Default is "string"

code_col

The name of the column in dict with the dictionary code/label. Default is "code". If dict is a quanteda dictionary with multiple levels, "code_l2", "code_l3", etc. can be used to select levels.

replace_cols

The names of the columns in tc$tokens that will be replaced by the dictionary code. Default is the column on which the dictionary is applied, but in some cases it might make sense to replace multiple columns (like token and lemma)

sep

A regular expression for separating multi-word lookup strings (default is " ", which is what quanteda dictionaries use). For example, if the dictionary contains "Barack Obama", sep should be " " so that it matches the consequtive tokens "Barack" and "Obama". In some dictionaries, however, it might say "Barack+Obama", so in that case sep = '\+' should be used.

code_from_features

If TRUE, instead of replacing features with the matched code columnm, use the most frequent occuring string in the features.

code_sep

If code_from_features is TRUE, the separator for pasting features together. Default is an underscore, which is recommended because it has special features in corpustools. Most importantly, if a query or dictionary search is performed, multi-word tokens concatenated with an underscore are treated as separate consecutive words. So, "Bob_Smith" would still match a lookup for the two consequtive words "bob smith"

decrement_ids

If TRUE (default), decrement token ids after concatenating multi-token matches. So, if the tokens c(":", ")", "yay") have token_id c(1,2,3), then after concatenating ASCII emoticons, the tokens will be c(":)", "yay") with token_id c(1,2)

case_sensitive

logical, should lookup be case sensitive?

use_wildcards

Use the wildcards * (any number including none of any character) and ? (one or none of any character). If FALSE, exact string matching is used

ascii

If true, convert text to ascii before matching

verbose

If true, report progress

Value

A vector with the id value (taken from dict$id) for each row in tc$tokens

Examples

tc = create_tcorpus('happy :) sad :( happy 8-)')
tc$tokens   ## tokenization has broken up emoticons (as it should)

# corpustools dictionary lookup automatically normalizes tokenization of 
# tokens and dictionary strings. The dictionary string ":)" would match both
# the single token ":)" and two consequtive tokens c(":", ")"). This 
# makes it easy and foolproof to look for emoticons like this:
emoticon_dict = data.frame(
   code   = c('happy_emo','happy_emo', 'sad_emo'), 
   string = c(':)',             '8-)',      ':(')) 
   
tc$replace_dictionary(emoticon_dict)
tc$tokens

# If a string is passed to replace dictionary, it will collapse multi-word
# strings. .
tc = create_tcorpus('happy :) sad :( Barack Obama')
tc$tokens
tc$replace_dictionary(c(':)', '8-)', 'Barack Obama'))
tc$tokens


corpustools documentation built on May 31, 2023, 8:45 p.m.