token: Tokenize (or split) text and emit n-word combinations from a...

Description Usage Arguments Value

Description

When n=1 simply tokenize text and emit words with counts. When n>1 tokenized words are combined into permutations of length n within each document.

Usage

1
2
3
token(n, tokenSep = "+", ignoreCase = FALSE,
  delimiter = "[ \\t\\b\\f\\r]+", punctuation = NULL,
  stemming = FALSE, stopWords = FALSE, sep = " ", minLength = 1)

Arguments

n

number of words

tokenSep

a character string to separate the tokens when n > 1

ignoreCase

logical: treat text as-is (FALSE) or convert to all lowercase (true); Default is TRUE. Note that if the stemming is set to TRUE, tokens will always be converted to lowercase, so this option will be ignored.

delimiter

character or string that divides one word from the next. You can use a regular expression as the delimiter value.

punctuation

a regular expression that specifies the punctuation characters parser will remove before it evaluates the input text.

stemming

logical: If true, apply Porter2 Stemming to each token to reduce it to its root form. Default is FALSE.

stopWords

logical or string with the name of the file that contains stop words. If TRUE then that should be ignored when parsing text. Each stop word is specified on a separate line.

sep

a character string to separate multiple text columns.

minLength

exclude tokens shorter than minLength characters.

Value

pluggable token parser



Search within the toaster package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.