tokenizers: A Consistent Interface to Tokenize Natural Language Text
Version 0.1.4

Convert natural language text into tokens. The tokenizers have a consistent interface and are compatible with Unicode, thanks to being built on the 'stringi' package. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions.

Browse man pages Browse package API and functions Browse package files

AuthorLincoln Mullen [aut, cre], Dmitriy Selivanov [ctb]
Date of publication2016-08-29 22:59:29
MaintainerLincoln Mullen <lincoln@lincolnmullen.com>
LicenseMIT + file LICENSE
Version0.1.4
URL https://github.com/ropensci/tokenizers
Package repositoryView on CRAN
InstallationInstall the latest version of this package by entering the following in R:
install.packages("tokenizers")

Man pages

basic-tokenizers: Basic tokenizers
ngram-tokenizers: N-gram tokenizers
shingle-tokenizers: Character shingle tokenizers
stem-tokenizers: Word stem tokenizer
stopwords: Stopword lists
tokenizers: Tokenizers

Functions

basic-tokenizers Man page
check_input Source code
generate_ngrams_batch Source code
ngram-tokenizers Man page
remove_stopwords Source code
simplify_list Source code
skip_ngrams Source code
stopwords Man page Source code
tokenize_character_shingles Man page Source code
tokenize_characters Man page Source code
tokenize_lines Man page Source code
tokenize_ngrams Man page Source code
tokenize_paragraphs Man page Source code
tokenize_regex Man page Source code
tokenize_sentences Man page Source code
tokenize_skip_ngrams Man page Source code
tokenize_word_stems Man page Source code
tokenize_words Man page Source code
tokenizers Man page
tokenizers-package Man page

Files

inst
inst/doc
inst/doc/introduction-to-tokenizers.Rmd
inst/doc/introduction-to-tokenizers.html
inst/doc/introduction-to-tokenizers.R
tests
tests/testthat.R
tests/testthat
tests/testthat/test-utils.R
tests/testthat/test-shingles.R
tests/testthat/moby-ch2.txt
tests/testthat/test-basic.R
tests/testthat/moby-ch1.txt
tests/testthat/moby-ch3.txt
tests/testthat/test-ngrams.R
tests/testthat/test-stem.R
tests/testthat/helper-data.R
src
src/Makevars
src/skip_ngrams.cpp
src/shingle_ngrams.cpp
src/Makevars.win
src/RcppExports.cpp
NAMESPACE
NEWS.md
R
R/utils.R
R/stem-tokenizers.R
R/stopwords.R
R/sysdata.rda
R/character-shingles-tokenizers.R
R/RcppExports.R
R/tokenizers-package.r
R/ngram-tokenizers.R
R/basic-tokenizers.R
vignettes
vignettes/introduction-to-tokenizers.Rmd
README.md
MD5
build
build/vignette.rds
DESCRIPTION
man
man/stem-tokenizers.Rd
man/tokenizers.Rd
man/shingle-tokenizers.Rd
man/ngram-tokenizers.Rd
man/stopwords.Rd
man/basic-tokenizers.Rd
LICENSE
tokenizers documentation built on May 20, 2017, 4:13 a.m.