tokens_recompile | R Documentation |
This function recompiles a serialized tokens object when the vocabulary has been changed in a way that makes some of its types identical, such as lowercasing when a lowercased version of the type already exists in the type table, or introduces gaps in the integer map of the types. It also re-indexes the types attribute to account for types that may have become duplicates, through a procedure such as stemming or lowercasing; or the addition of new tokens through compounding.
tokens_recompile(x, method = c("C++", "R"))
x |
the tokens object to be recompiled |
method |
|
# lowercasing
toks1 <- tokens(c(one = "a b c d A B C D",
two = "A B C d"))
attr(toks1, "types") <- char_tolower(attr(toks1, "types"))
unclass(toks1)
unclass(quanteda:::tokens_recompile(toks1))
# stemming
toks2 <- tokens("Stemming stemmed many word stems.")
unclass(toks2)
unclass(quanteda:::tokens_recompile(tokens_wordstem(toks2)))
# compounding
toks3 <- tokens("One two three four.")
unclass(toks3)
unclass(tokens_compound(toks3, "two three"))
# lookup
dict <- dictionary(list(test = c("one", "three")))
unclass(tokens_lookup(toks3, dict))
# empty pads
unclass(tokens_select(toks3, dict))
unclass(tokens_select(toks3, dict, padding = TRUE))
# ngrams
unclass(tokens_ngrams(toks3, n = 2:3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.