Description Usage Arguments Details See Also Examples
Create tokenizer objects.
1 2 3 4 5
Span_Tokenizer(f, meta = list()) as.Span_Tokenizer(x, ...) Token_Tokenizer(f, meta = list()) as.Token_Tokenizer(x, ...)
a tokenizer function taking the string to tokenize as
argument, and returning either the tokens (for
a named or empty list of tokenizer metadata tag-value pairs.
an R object.
further arguments passed to or from other methods.
Tokenization is the process of breaking a text string up into words, phrases, symbols, or other meaningful elements called tokens. This can be accomplished by returning the sequence of tokens, or the corresponding spans (character start and end positions). We refer to tokenization resources of the respective kinds as “token tokenizers” and “span tokenizers”.
Token_Tokenizer() return tokenizer
objects which are functions with metadata and suitable class
information, which in turn can be used for converting between the two
It is also possible to coerce annotator (pipeline) objects to
tokenizer objects, provided that the annotators provide suitable
token annotations. By default, word tokens are used; this can be
controlled via the
type argument of the coercion methods (e.g.,
type = "sentence" to extract sentence tokens).
There are also
format() methods for
tokenizer objects, which use the
description element of the
metadata if available.
Regexp_Tokenizer() for creating regexp span tokenizers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## A simple text. s <- String(" First sentence. Second sentence. ") ## ****5****0****5****0****5****0****5** ## Use a pre-built regexp (span) tokenizer: wordpunct_tokenizer wordpunct_tokenizer(s) ## Turn into a token tokenizer: tt <- as.Token_Tokenizer(wordpunct_tokenizer) tt tt(s) ## Of course, in this case we could simply have done s[wordpunct_tokenizer(s)] ## to obtain the tokens from the spans. ## Conversion also works the other way round: package 'tm' provides ## the following token tokenizer function: scan_tokenizer <- function(x) scan(text = as.character(x), what = "character", quote = "", quiet = TRUE) ## Create a token tokenizer from this: tt <- Token_Tokenizer(scan_tokenizer) tt(s) ## Turn into a span tokenizer: st <- as.Span_Tokenizer(tt) st(s) ## Checking tokens from spans: s[st(s)]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.