Tokenize (or split) text and emit n-word combinations from a document.

Share:

Description

When n=1 simply tokenize text and emit words with counts. When n>1 tokenized words are combined into permutations of length n within each document.

Usage

1
2
3
token(n, tokenSep = "+", ignoreCase = FALSE,
  delimiter = "[ \\t\\b\\f\\r]+", punctuation = NULL,
  stemming = FALSE, stopWords = FALSE, sep = " ", minLength = 1)

Arguments

n

number of words

tokenSep

a character string to separate the tokens when n > 1

ignoreCase

logical: treat text as-is (FALSE) or convert to all lowercase (true); Default is TRUE. Note that if the stemming is set to TRUE, tokens will always be converted to lowercase, so this option will be ignored.

delimiter

character or string that divides one word from the next. You can use a regular expression as the delimiter value.

punctuation

a regular expression that specifies the punctuation characters parser will remove before it evaluates the input text.

stemming

logical: If true, apply Porter2 Stemming to each token to reduce it to its root form. Default is FALSE.

stopWords

logical or string with the name of the file that contains stop words. If TRUE then that should be ignored when parsing text. Each stop word is specified on a separate line.

sep

a character string to separate multiple text columns.

minLength

exclude tokens shorter than minLength characters.

Value

pluggable token parser

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.