tokenizers: Regexp tokenizers
In NLP: Natural Language Processing Infrastructure

tokenizers

R Documentation

Regexp tokenizers

Description

Tokenizers using regular expressions to match either tokens or separators between tokens.

Usage

Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)

Arguments

`pattern`	a character string giving the regular expression to use for matching.
`invert`	a logical indicating whether to match separators between tokens.
`...`	further arguments to be passed to `gregexpr()`.
`meta`	a named or empty list of tokenizer metadata tag-value pairs.
`s`	a `String` object, or something coercible to this using `as.String()` (e.g., a character string with appropriate encoding information).

Details

Regexp_Tokenizer() creates regexp span tokenizers which use the given pattern and ... arguments to match tokens or separators between tokens via gregexpr(), and then transform the results of this into character spans of the tokens found.

whitespace_tokenizer() tokenizes by treating any sequence of whitespace characters as a separator.

blankline_tokenizer() tokenizes by treating any sequence of blank lines as a separator.

wordpunct_tokenizer() tokenizes by matching sequences of alphabetic characters and sequences of (non-whitespace) non-alphabetic characters.

Value

Regexp_Tokenizer() returns the created regexp span tokenizer.

blankline_tokenizer(), whitespace_tokenizer() and wordpunct_tokenizer() return the spans of the tokens found in s.

Examples

## A simple text.
s <- String("  First sentence.  Second sentence.  ")
##           ****5****0****5****0****5****0****5**

spans <- whitespace_tokenizer(s)
spans
s[spans]

spans <- wordpunct_tokenizer(s)
spans
s[spans]

NLP documentation built on April 12, 2025, 1:36 a.m.