tokenizers: A Consistent Interface to Tokenize Natural Language Text

Convert natural language text into tokens. The tokenizers have a consistent interface and are compatible with Unicode, thanks to being built on the 'stringi' package. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions.

AuthorLincoln Mullen [aut, cre], Dmitriy Selivanov [ctb]
Date of publication2016-08-29 22:59:29
MaintainerLincoln Mullen <lincoln@lincolnmullen.com>
LicenseMIT + file LICENSE
Version0.1.4
https://github.com/ropensci/tokenizers

View on CRAN

Functions

basic-tokenizers Man page
ngram-tokenizers Man page
stopwords Man page
tokenize_characters Man page
tokenize_character_shingles Man page
tokenize_lines Man page
tokenize_ngrams Man page
tokenize_paragraphs Man page
tokenize_regex Man page
tokenizers Man page
tokenizers-package Man page
tokenize_sentences Man page
tokenize_skip_ngrams Man page
tokenize_words Man page
tokenize_word_stems Man page

Files

tokenizers
tokenizers/inst
tokenizers/inst/doc
tokenizers/inst/doc/introduction-to-tokenizers.Rmd
tokenizers/inst/doc/introduction-to-tokenizers.html
tokenizers/inst/doc/introduction-to-tokenizers.R
tokenizers/tests
tokenizers/tests/testthat.R
tokenizers/tests/testthat
tokenizers/tests/testthat/test-utils.R
tokenizers/tests/testthat/test-shingles.R
tokenizers/tests/testthat/moby-ch2.txt
tokenizers/tests/testthat/test-basic.R
tokenizers/tests/testthat/moby-ch1.txt
tokenizers/tests/testthat/moby-ch3.txt
tokenizers/tests/testthat/test-ngrams.R
tokenizers/tests/testthat/test-stem.R
tokenizers/tests/testthat/helper-data.R
tokenizers/src
tokenizers/src/Makevars
tokenizers/src/skip_ngrams.cpp
tokenizers/src/shingle_ngrams.cpp
tokenizers/src/Makevars.win
tokenizers/src/RcppExports.cpp
tokenizers/NAMESPACE
tokenizers/NEWS.md
tokenizers/R
tokenizers/R/utils.R tokenizers/R/stem-tokenizers.R tokenizers/R/stopwords.R
tokenizers/R/sysdata.rda
tokenizers/R/character-shingles-tokenizers.R tokenizers/R/RcppExports.R
tokenizers/R/tokenizers-package.r
tokenizers/R/ngram-tokenizers.R tokenizers/R/basic-tokenizers.R
tokenizers/vignettes
tokenizers/vignettes/introduction-to-tokenizers.Rmd
tokenizers/README.md
tokenizers/MD5
tokenizers/build
tokenizers/build/vignette.rds
tokenizers/DESCRIPTION
tokenizers/man
tokenizers/man/stem-tokenizers.Rd tokenizers/man/tokenizers.Rd tokenizers/man/shingle-tokenizers.Rd tokenizers/man/ngram-tokenizers.Rd tokenizers/man/stopwords.Rd tokenizers/man/basic-tokenizers.Rd
tokenizers/LICENSE

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.