ngram_tokenize: N-gram tokenizer

View source: R/ngrams.R

ngram_tokenizeR Documentation

N-gram tokenizer

Description

A tokenizer for use with a document-term matrix from the tm package. Supports both character and word ngrams, including own wrapper to handle non-Latin encodings

Usage

ngram_tokenize(x, char = FALSE, ngmin = 1, ngmax = 3)

Arguments

x

input string

char

boolean value specifying whether to use character (char = TRUE) or word n-grams (char = FALSE, default)

ngmin

integer giving the minimum order of n-gram (default: 1)

ngmax

integer giving the maximum order of n-gram (default: 3)

Examples

library(tm)
en <- c("Romeo loves Juliet", "Romeo loves a girl")
en.corpus <- VCorpus(VectorSource(en))
tdm <- TermDocumentMatrix(en.corpus, 
                          control=list(wordLengths=c(1,Inf), 
                                       tokenize=function(x) ngram_tokenize(x, char=TRUE, 
                                                                           ngmin=3, ngmax=3)))
inspect(tdm)

ch <- c("abab", "aabb")
ch.corpus <- VCorpus(VectorSource(ch))
tdm <- TermDocumentMatrix(ch.corpus, 
                          control=list(wordLengths=c(1,Inf), 
                                       tokenize=function(x) ngram_tokenize(x, char=TRUE, 
                                                                           ngmin=1, ngmax=2)))
inspect(tdm)

SentimentAnalysis documentation built on Aug. 24, 2023, 1:07 a.m.