NEWS.md
In tokenizers: Fast, Consistent Tokenization of Natural Language Text

tokenizers 0.3.0

Remove the tokenize_tweets() function, which is no longer supported.

tokenizers 0.2.3

Bug fixes and performance enhancements.

tokenizers 0.2.1

Add citation information to JOSS paper.

tokenizers 0.2.0

Add the tokenize_ptb() function for Penn Treebank tokenizations (@jrnold) (#12).
Add a function chunk_text() to split long documents into pieces (#30).
New functions to count words, characters, and sentences without tokenization (#36).
New function tokenize_tweets() preserves usernames, hashtags, and URLS (@kbenoit) (#44).
The stopwords() function has been removed in favor of using the stopwords package (#46).
The package now complies with the basic recommendations of the Text Interchange Format. All tokenization functions are now methods. This enables them to take corpus inputs as either TIF-compliant named character vectors, named lists, or data frames. All outputs are still named lists of tokens, but these can be easily coerced to data frames of tokens using the tif package. (#49)
Add a new vignette "The Text Interchange Formats and the tokenizers Package" (#49).

tokenize_skip_ngrams has been improved to generate unigrams and bigrams, according to the skip definition (#24).
C++98 has replaced the C++11 code used for n-gram generation, widening the range of compilers tokenizers supports (@ironholds) (#26).
tokenize_skip_ngrams now supports stopwords (#31).
If tokenisers fail to generate tokens for a particular entry, they return NA consistently (#33).
Keyboard interrupt checks have been added to Rcpp-backed functions to enable users to terminate them before completion (#37).
tokenize_words() gains arguments to preserve or strip punctuation and numbers (#48).
tokenize_skip_ngrams() and tokenize_ngrams() to return properly marked UTF8 strings on Windows (@patperry) (#58).
tokenize_tweets() now removes stopwords prior to stripping punctuation, making its behavior more consistent with tokenize_words() (#76).

tokenizers 0.1.4

Add the tokenize_character_shingles() tokenizer.
Improvements to documentation.

tokenizers 0.1.3

Add vignette.
Improvements to n-gram tokenizers.

tokenizers 0.1.2

Add stopwords for several languages.
New stopword options to tokenize_words() and tokenize_word_stems().

tokenizers 0.1.1

Fix failing test in non-UTF-8 locales.

tokenizers 0.1.0

Initial release with tokenizers for characters, words, word stems, sentences paragraphs, n-grams, skip n-grams, lines, and regular expressions.

Any scripts or data that you put into this service are public.

tokenizers documentation built on Dec. 28, 2022, 2:34 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com