Simple tokenization functions, which performs string splitting

Description

simple wrappers around base regular expressions. For much more faster and functional tokenizers see tokenizers package: https://cran.r-project.org/package=tokenizers. Also see str_split_* functions in stringi and stringr packages. The reason for not including this packages to text2vec dependencies is our desare to keep number of dependencies as small as possible.

Usage

1
2
3
4
5
6
7
word_tokenizer(strings, ...)

regexp_tokenizer(strings, pattern, ...)

char_tokenizer(strings, ...)

space_tokenizer(strings, ...)

Arguments

strings

character vector

...

other parameters to strsplit function, which is used under the hood.

pattern

character pattern symbol.

Value

list of character vectors. Each element of list containts vector of tokens.

Examples

1
2
3
4
5
doc = c("first  second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
regexp_tokenizer(doc, " ", TRUE)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.