ngram: n-gram Tokenization

View source: R/01-constructor.r

ngramR Documentation

n-gram Tokenization

Description

The ngram() function is the main workhorse of this package. It takes an input string and converts it into the internal n-gram representation.

Usage

ngram(str, n = 2, sep = " ")

Arguments

str

The input text.

n

The 'n' as in 'n-gram'.

sep

A set of separator characters for the "words". See details for information about how this works; it works a little differently from sep arguments in R functions.

Details

On evaluation, a copy of the input string is produced and stored as an external pointer. This is necessary because the internal list representation just points to the first char of each word in the input string. So if you (or R's gc) deletes the input string, basically all hell breaks loose.

The sep parameter splits at any of the characters in the string. So sep=", " splits at a comma or a space.

Value

An ngram class object.

See Also

ngram-class, getters, phrasetable, babble

Examples

library(ngram)

str = "A B A C A B B"
ngram(str, n=2)

str = "A,B,A,C A B B"
### Split at a space
print(ngram(str), output="full")
### Split at a comma
print(ngram(str, sep=","), output="full")
### Split at a space or a comma
print(ngram(str, sep=", "), output="full")


wrathematics/ngram documentation built on Jan. 28, 2024, 12:14 p.m.