tokenize: Tokenize a character vector Parse the elements of a character...

Description Usage Arguments Examples

View source: R/tokenize.R

Description

Tokenize a character vector Parse the elements of a character vector into a list of cleaned tokens.

Usage

1
2
tokenize(text, removePunc = TRUE, removeNum = TRUE, toLower = TRUE,
  stemWords = TRUE, rmStopWords = TRUE)

Arguments

text

The character vector to be tokenized

removePunc

TRUE or FALSE indicating whether or not to remove punctuation from text. If TRUE, puncuation will be removed. Defaults to TRUE.

removeNum

TRUE or FALSE indicating whether or not to remove numbers from text. If TRUE, numbers will be removed. Defaults to TRUE.

toLower

TRUE or FALSE indicating whether or not to coerce all of text to lowercase. If TRUE, text will be coerced to lowercase. Defaults to TRUE.

stemWords

TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE.

rmStopWords

TRUE, FALSE, or character vector of stopwords to remove. If TRUE, words in lexRankr::smart_stopwords will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE.

Examples

1
2
tokenize("Mr. Feeny said the test would be on Sat. At least I'm 99.9% sure that's what he said.")
tokenize("Bill is trying to earn a Ph.D. in his field.", rmStopWords=FALSE)

AdamSpannbauer/lexRankr documentation built on Feb. 4, 2018, 12:12 p.m.