tokenizers: A Consistent Interface to Tokenize Natural Language Text

Convert natural language text into tokens. The tokenizers have a consistent interface and are compatible with Unicode, thanks to being built on the 'stringi' package. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions.

Author
Lincoln Mullen [aut, cre], Dmitriy Selivanov [ctb]
Date of publication
2016-08-29 22:59:29
Maintainer
Lincoln Mullen <lincoln@lincolnmullen.com>
License
MIT + file LICENSE
Version
0.1.4
URLs

View on CRAN

Man pages

basic-tokenizers
Basic tokenizers
ngram-tokenizers
N-gram tokenizers
shingle-tokenizers
Character shingle tokenizers
stem-tokenizers
Word stem tokenizer
stopwords
Stopword lists
tokenizers
Tokenizers

Files in this package

tokenizers
tokenizers/inst
tokenizers/inst/doc
tokenizers/inst/doc/introduction-to-tokenizers.Rmd
tokenizers/inst/doc/introduction-to-tokenizers.html
tokenizers/inst/doc/introduction-to-tokenizers.R
tokenizers/tests
tokenizers/tests/testthat.R
tokenizers/tests/testthat
tokenizers/tests/testthat/test-utils.R
tokenizers/tests/testthat/test-shingles.R
tokenizers/tests/testthat/moby-ch2.txt
tokenizers/tests/testthat/test-basic.R
tokenizers/tests/testthat/moby-ch1.txt
tokenizers/tests/testthat/moby-ch3.txt
tokenizers/tests/testthat/test-ngrams.R
tokenizers/tests/testthat/test-stem.R
tokenizers/tests/testthat/helper-data.R
tokenizers/src
tokenizers/src/Makevars
tokenizers/src/skip_ngrams.cpp
tokenizers/src/shingle_ngrams.cpp
tokenizers/src/Makevars.win
tokenizers/src/RcppExports.cpp
tokenizers/NAMESPACE
tokenizers/NEWS.md
tokenizers/R
tokenizers/R/utils.R
tokenizers/R/stem-tokenizers.R
tokenizers/R/stopwords.R
tokenizers/R/sysdata.rda
tokenizers/R/character-shingles-tokenizers.R
tokenizers/R/RcppExports.R
tokenizers/R/tokenizers-package.r
tokenizers/R/ngram-tokenizers.R
tokenizers/R/basic-tokenizers.R
tokenizers/vignettes
tokenizers/vignettes/introduction-to-tokenizers.Rmd
tokenizers/README.md
tokenizers/MD5
tokenizers/build
tokenizers/build/vignette.rds
tokenizers/DESCRIPTION
tokenizers/man
tokenizers/man/stem-tokenizers.Rd
tokenizers/man/tokenizers.Rd
tokenizers/man/shingle-tokenizers.Rd
tokenizers/man/ngram-tokenizers.Rd
tokenizers/man/stopwords.Rd
tokenizers/man/basic-tokenizers.Rd
tokenizers/LICENSE