tokens_recompile: recompile a serialized tokens object

Description Usage Arguments Author(s) Examples

View source: R/tokens.R

Description

This function recompiles a serialized tokens object when the vocabulary has been changed in a way that makes some of its types identical, such as lowercasing when a lowercased version of the type already exists in the type table, or introduces gaps in the integer map of the types. It also re-indexes the types attribute to account for types that may have become duplicates, through a procedure such as stemming or lowercasing; or the addition of new tokens through compounding.

Usage

1
tokens_recompile(x, method = c("C++", "R"), gap = TRUE, dup = TRUE)

Arguments

x

the tokens object to be recompiled

method

"C++" for C++ implementation or "R" for an older R-based method

gap

if TRUE, remove gaps between token IDs

dup

if TRUE, merge duplicated token types into the same ID

Author(s)

Kenneth Benoit and Kohei Watanabe

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# lowercasing
toks1 <- tokens(c(one = "a b c d A B C D",
                 two = "A B C d"))
attr(toks1, "types") <- char_tolower(attr(toks1, "types"))
unclass(toks1)
unclass(quanteda:::tokens_recompile(toks1))

# stemming
toks2 <- tokens("Stemming stemmed many word stems.")
unclass(toks2)
unclass(quanteda:::tokens_recompile(tokens_wordstem(toks2)))

# compounding
toks3 <- tokens("One two three four.")
unclass(toks3)
unclass(tokens_compound(toks3, "two three"))

# lookup
dict <- dictionary(list(test = c("one", "three")))
unclass(tokens_lookup(toks3, dict))

# empty pads
unclass(tokens_select(toks3, dict))
unclass(tokens_select(toks3, dict, pad = TRUE))

# ngrams
unclass(tokens_ngrams(toks3, n = 2:3))

quanteda documentation built on Nov. 20, 2018, 1:04 a.m.