tokenizers | R Documentation |
Tokenizers using regular expressions to match either tokens or separators between tokens.
Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)
pattern |
a character string giving the regular expression to use for matching. |
invert |
a logical indicating whether to match separators between tokens. |
... |
further arguments to be passed to |
meta |
a named or empty list of tokenizer metadata tag-value pairs. |
s |
a |
Regexp_Tokenizer()
creates regexp span tokenizers which use the
given pattern
and ...
arguments to match tokens or
separators between tokens via gregexpr()
, and then
transform the results of this into character spans of the tokens
found.
whitespace_tokenizer()
tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer()
tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer()
tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.
Regexp_Tokenizer()
returns the created regexp span tokenizer.
blankline_tokenizer()
, whitespace_tokenizer()
and
wordpunct_tokenizer()
return the spans of the tokens found in
s
.
Span_Tokenizer()
for general information on span
tokenizer objects.
## A simple text.
s <- String(" First sentence. Second sentence. ")
## ****5****0****5****0****5****0****5**
spans <- whitespace_tokenizer(s)
spans
s[spans]
spans <- wordpunct_tokenizer(s)
spans
s[spans]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.