tokenize-asweka: Weka-like n-gram Tokenization

Tokenize-AsWekaR Documentation

Weka-like n-gram Tokenization

Description

An n-gram tokenizer with identical output to the NGramTokenizer function from the RWeka package.

Usage

ngram_asweka(str, min = 2, max = 2, sep = " ")

Arguments

str

The input text.

min, max

The minimum and maximum 'n' as in 'n-gram'.

sep

A set of separator characters for the "words". See details for information about how this works; it works a little differently from sep arguments in R functions.

Details

This n-gram tokenizer behaves similarly in both input and return to the tokenizer in RWeka. Unlike the tokenizer ngram(), the return is not a special class of external pointers; it is a vector, and therefore can be serialized via save() or saveRDS().

Value

A vector of n-grams listed in decreasing blocks of n, in order within a block. The output matches that of RWeka's n-gram tokenizer.

See Also

ngram

Examples

library(ngram)

str = "A B A C A B B"
ngram_asweka(str, min=2, max=4)


ngram documentation built on Nov. 1, 2022, 1:06 a.m.