tokenize_custom: Customizable tokenizer
In quanteda: Quantitative Analysis of Textual Data

tokenize_custom

R Documentation

Customizable tokenizer

Description

Allows users to tokenize texts using customized boundary rules. See the ICU website for how to define boundary rules.

Tools for custom word and sentence breakrules, to retrieve, set, or reset them to package defaults.

Usage

tokenize_custom(x, rules)

breakrules_get(what = c("word", "sentence"))

breakrules_set(x, what = c("word", "sentence"))

breakrules_reset(what = c("word", "sentence"))

Arguments

`x`	character vector for texts to tokenize
`rules`	a list of rules for rule-based boundary detection
`what`	character; which set of rules to return, one of `"word"` or `"sentence"`

Details

The package contains internal sets of rules for word and sentence breaks, which are lists of rules for word and sentence boundary detection. base is copied from the ICU library. Other rules are created by the package maintainers in system.file("breakrules/breakrules_custom.yml").

This function allows modification of those rules, and applies them as a new tokenizer.

Custom word rules:

base: ICU's rules for detecting word/sentence boundaries
keep_hyphens: quanteda's rule for preserving hyphens
keep_url: quanteda's rule for preserving URLs
keep_email: quanteda's rule for preserving emails
keep_tags: quanteda's rule for preserving tags
split_elisions: quanteda's rule for splitting elisions
split_tags: quanteda's rule for splitting tags

Value

tokenize_custom() returns a list of characters containing tokens.

breakrules_get() returns the existing break rules as a list.

breakrules_set() returns nothing but reassigns the global breakrules to x.

breakrules_reset() returns nothing but reassigns the global breakrules to the system defaults. These rules are defined in system.file("breakrules/").

Source

https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/word.txt

https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/sent.txt

Examples

lis <- tokenize_custom("a well-known http://example.com", rules = breakrules_get("word"))
tokens(lis, remove_separators = TRUE)
breakrules_get("word")
breakrules_get("sentence")

brw <- breakrules_get("word")
brw$keep_email <- "@[a-zA-Z0-9_]+"
breakrules_set(brw, what = "word")
breakrules_reset("sentence")
breakrules_reset("word")

quanteda documentation built on June 8, 2025, 9:41 p.m.