nGram: Tokenize (or split) text and emit multi-grams.
In teradata-aster-field/toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Tokenize (or split) text and emit multi-grams.

1
2
3

nGram(n, ignoreCase = FALSE, delimiter = "[ \\t\\b\\f\\r]+",
  punctuation = NULL, overlapping = TRUE, reset = NULL, sep = " ",
  minLength = 1)

`n`	length, in words, of each n-gram
`ignoreCase`	logical: if FALSE, the n-gram matching is case sensitive and if TRUE, case is ignored during matching.
`delimiter`	character or string that divides one word from the next. You can use a regular expression as the `delimiter` value.
`punctuation`	a regular expression that specifies the punctuation characters parser will remove before it evaluates the input text.
`overlapping`	logical: true value allows for overlapping n-grams.
`reset`	a regular expression listing one or more punctuation characters or strings, any of which the `nGram` parser will recognize as the end of a sentence of text. The end of each sentence resets the search for n-grams, meaning that `nGram` discards any partial n-grams and proceeds to the next sentence to search for the next n-gram. In other words, no n-gram can span two sentences.
`sep`	a character string to separate multiple text columns.
`minLength`	minimum length of words in ngram. Ngrams that contains words below shorter than the limit are omitted. Current implementation is not complete: it filters out ngrams where each word is below the minimum length, i.e. total length of ngram is below n*minLength + (n-1).

pluggable n-gram parser

teradata-aster-field/toaster documentation built on May 31, 2019, 8:36 a.m.

teradata-aster-field/toaster index

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Description