tokens_replace: Replace tokens in a tokens object

Description Usage Arguments See Also Examples

Description

Substitute token types based on vectorized one-to-one matching. Since this function is created for lemmatization or user-defined stemming. It support substitution of multi-word features by multi-word features, but substitution is fastest when pattern and replacement are character vectors and valuetype = "fixed" as the function only substitute types of tokens. Please use tokens_lookup with exclusive = FALSE to replace dictionary values.

Usage

1
2
tokens_replace(x, pattern, replacement, valuetype = "glob",
  case_insensitive = TRUE, verbose = quanteda_options("verbose"))

Arguments

x

tokens object whose token elements will be replaced

pattern

a character vector or list of character vectors. See pattern for more details.

replacement

a character vector or (if pattern is a list) list of character vectors of the same length as pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

verbose

print status messages if TRUE

See Also

tokens_lookup

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
toks1 <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)

# lemmatization
infle <- c("foci", "focus", "focused", "focuses", "focusing", "focussed", "focusses")
lemma <- rep("focus", length(infle))
toks2 <- tokens_replace(toks1, infle, lemma, valuetype = "fixed")
kwic(toks2, "focus*")

# stemming
type <- types(toks1)
stem <- char_wordstem(type, "porter")
toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE)
identical(toks3, tokens_wordstem(toks1, "porter"))

# multi-multi substitution
toks4 <- tokens_replace(toks1, phrase(c("Minister Deputy Lenihan")), 
                              phrase(c("Minister Deputy Conor Lenihan")))
kwic(toks4, phrase(c("Minister Deputy Conor Lenihan")))

quanteda/quanteda documentation built on June 15, 2019, 8:36 a.m.