nGram: Tokenize (or split) text and emit multi-grams.

Description Usage Arguments Value

Description

Tokenize (or split) text and emit multi-grams.

Usage

1
2
3
nGram(n, ignoreCase = FALSE, delimiter = "[ \\t\\b\\f\\r]+",
  punctuation = NULL, overlapping = TRUE, reset = NULL, sep = " ",
  minLength = 1)

Arguments

n

length, in words, of each n-gram

ignoreCase

logical: if FALSE, the n-gram matching is case sensitive and if TRUE, case is ignored during matching.

delimiter

character or string that divides one word from the next. You can use a regular expression as the delimiter value.

punctuation

a regular expression that specifies the punctuation characters parser will remove before it evaluates the input text.

overlapping

logical: true value allows for overlapping n-grams.

reset

a regular expression listing one or more punctuation characters or strings, any of which the nGram parser will recognize as the end of a sentence of text. The end of each sentence resets the search for n-grams, meaning that nGram discards any partial n-grams and proceeds to the next sentence to search for the next n-gram. In other words, no n-gram can span two sentences.

sep

a character string to separate multiple text columns.

minLength

minimum length of words in ngram. Ngrams that contains words below shorter than the limit are omitted. Current implementation is not complete: it filters out ngrams where each word is below the minimum length, i.e. total length of ngram is below n*minLength + (n-1).

Value

pluggable n-gram parser


teradata-aster-field/toaster documentation built on May 31, 2019, 8:36 a.m.