Description Usage Arguments Value Note Examples
View source: R/tokens_compound.R
Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with concatenator
(by default, the "_
" character) to form a
single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
1 2 3 4 5 6 7 8 9 | tokens_compound(
x,
pattern,
concatenator = "_",
valuetype = c("glob", "regex", "fixed"),
window = 0,
case_insensitive = TRUE,
join = TRUE
)
|
x |
an input tokens object |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
concatenator |
the concatenation character that will connect the words
making up the multi-word sequences. The default |
valuetype |
the type of pattern matching: |
window |
integer; a vector of length 1 or 2 that specifies size of the
window of tokens adjacent to |
case_insensitive |
logical; if |
join |
logical; if |
A tokens object in which the token sequences matching pattern
have been replaced by new compounded "tokens" joined by the concatenator.
Patterns to be compounded (naturally) consist of multi-word sequences,
and how these are expected in pattern
is very specific. If the elements
to be compounded are supplied as space-delimited elements of a character
vector, wrap the vector in phrase()
. If the elements to be compounded
are separate elements of a character vector, supply it as a list where each
list element is the sequence of character elements.
See the examples below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | txt <- "The United Kingdom is leaving the European Union."
toks <- tokens(txt, remove_punct = TRUE)
# character vector - not compounded
tokens_compound(toks, c("United", "Kingdom", "European", "Union"))
# elements separated by spaces - not compounded
tokens_compound(toks, c("United Kingdom", "European Union"))
# list of characters - is compounded
tokens_compound(toks, list(c("United", "Kingdom"), c("European", "Union")))
# elements separated by spaces, wrapped in phrase)() - is compounded
tokens_compound(toks, phrase(c("United Kingdom", "European Union")))
# supplied as values in a dictionary (same as list) - is compounded
# (keys do not matter)
tokens_compound(toks, dictionary(list(key1 = "United Kingdom",
key2 = "European Union")))
# pattern as dictionaries with glob matches
tokens_compound(toks, dictionary(list(key1 = c("U* K*"))), valuetype = "glob")
# supplied as collocations - is compounded
colls <- tokens("The new European Union is not the old European Union.") %>%
textstat_collocations(size = 2, min_count = 1, tolower = FALSE)
tokens_compound(toks, colls, case_insensitive = FALSE)
# note the differences caused by join = FALSE
compounds <- list(c("the", "European"), c("European", "Union"))
tokens_compound(toks, pattern = compounds, join = TRUE)
tokens_compound(toks, pattern = compounds, join = FALSE)
# use window to form ngrams
tokens_remove(toks, pattern = stopwords("en")) %>%
tokens_compound(pattern = "leav*", join = FALSE, window = c(0, 3))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.