tokenize_custom | R Documentation |
Allows users to tokenize texts using customized boundary rules. See the ICU website for how to define boundary rules.
Tools for custom word and sentence breakrules, to retrieve, set, or reset them to package defaults.
tokenize_custom(x, rules)
breakrules_get(what = c("word", "sentence"))
breakrules_set(x, what = c("word", "sentence"))
breakrules_reset(what = c("word", "sentence"))
x |
character vector for texts to tokenize |
rules |
a list of rules for rule-based boundary detection |
what |
character; which set of rules to return, one of |
The package contains internal sets of rules for word and sentence
breaks, which are lists
of rules for word and sentence boundary detection. base
is copied from
the ICU library. Other rules are created by the package maintainers in
system.file("breakrules/breakrules_custom.yml")
.
This function allows modification of those rules, and applies them as a new tokenizer.
Custom word rules:
base
ICU's rules for detecting word/sentence boundaries
keep_hyphens
quanteda's rule for preserving hyphens
keep_url
quanteda's rule for preserving URLs
keep_email
quanteda's rule for preserving emails
keep_tags
quanteda's rule for preserving tags
split_elisions
quanteda's rule for splitting elisions
split_tags
quanteda's rule for splitting tags
tokenize_custom()
returns a list of characters containing tokens.
breakrules_get()
returns the existing break rules as a list.
breakrules_set()
returns nothing but reassigns the global
breakrules to x
.
breakrules_reset()
returns nothing but reassigns the global
breakrules to the system defaults. These rules are defined in
system.file("breakrules/")
.
https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/word.txt
https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/sent.txt
lis <- tokenize_custom("a well-known http://example.com", rules = breakrules_get("word"))
tokens(lis, remove_separators = TRUE)
breakrules_get("word")
breakrules_get("sentence")
brw <- breakrules_get("word")
brw$keep_email <- "@[a-zA-Z0-9_]+"
breakrules_set(brw, what = "word")
breakrules_reset("sentence")
breakrules_reset("word")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.