ngramrr: General purpose n-gram tokenizer

Description Usage Arguments Value Examples

Description

A non-Java based n-gram tokenizer to be used with the tm package. Support both character and word n-gram.

Usage

1
ngramrr(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE)

Arguments

x

input string.

char

logical, using character n-gram. char = FALSE denotes word n-gram.

ngmin

integer, minimun order of n-gram

ngmax

integer, maximun order of n-gram

rmEOL

logical, remove ngrams wih EOL character

Value

vector of n-grams

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
require(tm)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)
ngramrr(nirvana[1], ngmax = 3, char = TRUE)
nirvanacor <- Corpus(VectorSource(nirvana))
TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, ngmax =3)))

# Character ngram

TermDocumentMatrix(nirvanacor, control = list(tokenize =
function(x) ngramrr(x, char = TRUE, ngmax =3), wordLengths = c(1, Inf)))

chainsawriot/ngramrr documentation built on May 13, 2019, 3:11 p.m.