Segment text into tokens, each of which is an instance of a particular ‘type’.
1 2 3
object to be tokenized.
additional properties to set on the text filter.
text_tokens splits texts into token sequences. Each token is an
instance of a particular type. This operation proceeds in a series
of stages, controlled by the
First, we segment the text into words and spaces using the boundaries defined by Unicode Standard Annex #29, Section 4, with special handling for @mentions, #hashtags, and URLs.
Next, we normalize the words by applying the character mappings
indicated by the
remove_ignorable properties. We replace sequences of spaces
by a space (U+0020). At the end of the second stage,
we have segmented the text into a sequence of normalized words and
spaces, in Unicode composed normal form (NFC).
In the third stage, if the
combine property is non-
we scan the word sequence from left to right, searching for the longest
possible match in the
combine list. If a match exists, we
replace the word sequence with a single token for that term;
otherwise, we leave the word as-is. We drop spaces at this point, unless
they are part of a multi-word term. See the ‘Combining words’
section below for more details.
Next, if the
stemmer property is non-
NULL, we apply
the indicated stemming algorithm to each word that does not match
one of the elements of the
stem_except character vector. Terms
that stem to
NA get dropped from the sequence.
After stemming, we categorize each remaining token as
according to the first character in the word. For words that start with
extenders like underscore (
_), use the first non-extender to
If any of
TRUE, then we drop the tokens in the
corresponding categories. We also drop any terms that match an element
drop character vector. We can add exceptions to the
drop rules by specifying a non-
NULL value for the
drop_except is a character
vector, then we we restore tokens that match elements of vector to
their values prior to dropping.
Finally, we replace sequences of white-space in the terms with
connector, which defaults to a low line character
Multi-word terms specified by the
combine property can be specified as
tokens, prior to normalization. Terms specified by the
drop_except need to be normalized and stemmed (if
stemmer is non-
NULL). Thus, for example, if
map_case = TRUE, then a token filter with
combine = "Mx."
produces the same results as a token filter with
combine = "mx.".
drop = "Mx." behaves different from
drop = "mx.".
text_tokens returns a list of the same length as
the same names. Each list item is a character vector with the tokens
for the corresponding element of
text_ntoken returns a numeric vector the same length as
with each element giving the number of tokens in the corresponding text.
combine property of a
transformations that combine two or more words into a single token. For
combine = "new york" will
cause consecutive instances of the words
to get replaced by a single token,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?") # count tokens: text_ntoken("The quick ('brown') fox can't jump 32.3 feet, right?") # don't change case or quotes: f <- text_filter(map_case = FALSE, map_quote = FALSE) text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?", f) # drop common function words ('stop' words): text_tokens("Able was I ere I saw Elba.", text_filter(drop = stopwords_en)) # drop numbers, with some exceptions:" text_tokens("0, 1, 2, 3, 4, 5", text_filter(drop_number = TRUE, drop_except = c("0", "2", "4"))) # apply stemming... text_tokens("Mary is running", text_filter(stemmer = "english")) # ...except for certain words text_tokens("Mary is running", text_filter(stemmer = "english", stem_except = "mary")) # default tokenization text_tokens("Ms. Jones") # combine abbreviations text_tokens("Ms. Jones", text_filter(combine = abbreviations_en)) # add custom combinations text_tokens("Ms. Jones is from New York City, New York.", text_filter(combine = c(abbreviations_en, "new york", "new york city")))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.