tokenize_custom: Customizable tokenizer

View source: R/tokenizers.R

tokenize_customR Documentation

Customizable tokenizer

Description

Allows users to tokenize texts using customized boundary rules. See the ICU website for how to define boundary rules.

Tools for custom word and sentence breakrules, to retrieve, set, or reset them to package defaults.

Usage

tokenize_custom(x, rules)

breakrules_get(what = c("word", "sentence"))

breakrules_set(x, what = c("word", "sentence"))

breakrules_reset(what = c("word", "sentence"))

Arguments

x

character vector for texts to tokenize

rules

a list of rules for rule-based boundary detection

what

character; which set of rules to return, one of "word" or "sentence"

Details

The package contains internal sets of rules for word and sentence breaks, which are lists of rules for word and sentence boundary detection. base is copied from the ICU library. Other rules are created by the package maintainers in system.file("breakrules/breakrules_custom.yml").

This function allows modification of those rules, and applies them as a new tokenizer.

Custom word rules:

base

ICU's rules for detecting word/sentence boundaries

keep_hyphens

quanteda's rule for preserving hyphens

keep_url

quanteda's rule for preserving URLs

keep_email

quanteda's rule for preserving emails

keep_tags

quanteda's rule for preserving tags

split_elisions

quanteda's rule for splitting elisions

split_tags

quanteda's rule for splitting tags

Value

tokenize_custom() returns a list of characters containing tokens.

breakrules_get() returns the existing break rules as a list.

breakrules_set() returns nothing but reassigns the global breakrules to x.

breakrules_reset() returns nothing but reassigns the global breakrules to the system defaults. These rules are defined in system.file("breakrules/").

Source

https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/word.txt

https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/brkitr/rules/sent.txt

Examples

lis <- tokenize_custom("a well-known http://example.com", rules = breakrules_get("word"))
tokens(lis, remove_separators = TRUE)
breakrules_get("word")
breakrules_get("sentence")

brw <- breakrules_get("word")
brw$keep_email <- "@[a-zA-Z0-9_]+"
breakrules_set(brw, what = "word")
breakrules_reset("sentence")
breakrules_reset("word")

quanteda documentation built on Sept. 11, 2024, 6:08 p.m.