tokenizers: Split texts into tokens

tokenizersR Documentation

Split texts into tokens

Description

These functions each turn a text into tokens. The tokenize_ngrams functions returns shingled n-grams.

Usage

tokenize_words(string, lowercase = TRUE)

tokenize_sentences(string, lowercase = TRUE)

tokenize_ngrams(string, lowercase = TRUE, n = 3)

tokenize_skip_ngrams(string, lowercase = TRUE, n = 3, k = 1)

Arguments

string

A character vector of length 1 to be tokenized.

lowercase

Should the tokens be made lower case?

n

For n-gram tokenizers, the number of words in each n-gram.

k

For the skip n-gram tokenizer, the maximum skip distance between words. The function will compute all skip n-grams between 0 and k.

Details

These functions will strip all punctuation.

Value

A character vector containing the tokens.

Examples

dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."
tokenize_words(dylan)
tokenize_sentences(dylan)
tokenize_ngrams(dylan, n = 2)
tokenize_skip_ngrams(dylan, n = 3, k = 2)

ropensci/textreuse documentation built on Aug. 8, 2024, 9:17 a.m.