ngramrr: General purpose n-gram tokenizer
In ngramrr: A Simple General Purpose N-Gram Tokenizer

Description Usage Arguments Value Examples

A non-Java based n-gram tokenizer to be used with the tm package. Support both character and word n-gram.

1	ngramrr(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE)

`x`	input string.
`char`	logical, using character n-gram. char = FALSE denotes word n-gram.
`ngmin`	integer, minimun order of n-gram
`ngmax`	integer, maximun order of n-gram
`rmEOL`	logical, remove ngrams wih EOL character

vector of n-grams

require(tm)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)
ngramrr(nirvana[1], ngmax = 3, char = TRUE)
nirvanacor <- Corpus(VectorSource(nirvana))
TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, ngmax =3)))

# Character ngram

TermDocumentMatrix(nirvanacor, control = list(tokenize =
function(x) ngramrr(x, char = TRUE, ngmax =3), wordLengths = c(1, Inf)))

Loading required package: tm
Loading required package: NLP
 [1] "hello"             "hello"             "hello"            
 [4] "how"               "low"               "hello hello"      
 [7] "hello hello"       "hello how"         "how low"          
[10] "hello hello hello" "hello hello how"   "hello how low"    
 [1] "e"   "e"   "e"   "el"  "el"  "el"  "ell" "ell" "ell" "h"   "h"   "h"  
[13] "h"   "he"  "he"  "he"  "hel" "hel" "hel" "ho"  "how" "l"   "l"   "l"  
[25] "l"   "l"   "l"   "l"   "ll"  "ll"  "ll"  "llo" "llo" "llo" "lo"  "lo" 
[37] "lo"  "lo"  "low" "o"   "o"   "o"   "o"   "o"   "ow"  "ow"  "w"   "w"  
<<TermDocumentMatrix (terms: 24, documents: 18)>>
Non-/sparse entries: 35/397
Sparsity           : 92%
Maximal term length: 10
Weighting          : term frequency (tf)
<<TermDocumentMatrix (terms: 32, documents: 18)>>
Non-/sparse entries: 46/530
Sparsity           : 92%
Maximal term length: 10
Weighting          : term frequency (tf)