tokenizers: Split texts into tokens
In ropensci/textreuse: Detect Text Reuse and Document Similarity

tokenizers

R Documentation

Split texts into tokens

Description

These functions each turn a text into tokens. The tokenize_ngrams functions returns shingled n-grams.

Usage

tokenize_words(string, lowercase = TRUE)

tokenize_sentences(string, lowercase = TRUE)

tokenize_ngrams(string, lowercase = TRUE, n = 3)

tokenize_skip_ngrams(string, lowercase = TRUE, n = 3, k = 1)

Arguments

`string`	A character vector of length 1 to be tokenized.
`lowercase`	Should the tokens be made lower case?
`n`	For n-gram tokenizers, the number of words in each n-gram.
`k`	For the skip n-gram tokenizer, the maximum skip distance between words. The function will compute all skip n-grams between `0` and `k`.

Details

These functions will strip all punctuation.

Value

A character vector containing the tokens.

Examples

dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."
tokenize_words(dylan)
tokenize_sentences(dylan)
tokenize_ngrams(dylan, n = 2)
tokenize_skip_ngrams(dylan, n = 3, k = 2)

ropensci/textreuse documentation built on Feb. 17, 2025, 4:36 a.m.