split_text: Split Text into Byte-Limited Segments
In deeplr: Interface to the 'DeepL' Translation API

split_text

R Documentation

Split Text into Byte-Limited Segments

Description

split_text divides input text into smaller segments that do not exceed a specified maximum size in bytes. Segmentation is based on sentence or word boundaries.

Usage

split_text(text, max_size_bytes = 29000, tokenize = "sentences")

Arguments

`text`	A character vector containing the text(s) to be split.
`max_size_bytes`	An integer specifying the maximum size (in bytes) for each segment.
`tokenize`	A string indicating the level of tokenization. Must be either `"sentences"` or `"words"`.

Details

This function uses tokenizers::tokenize_sentences (or tokenize_words if specified) to split the text into natural language segments before assembling byte-limited blocks.

Value

A tibble with one row per text segment, containing the following columns:

text_id: The index of the original text in the input vector.
segment_id: A sequential ID identifying the segment number.
segment_text: The resulting text segment, each within the specified byte limit.

Examples

## Not run: 
long_text <- paste0(rep("This is a very long text. ", 10000), collapse = "")
split_text(long_text, max_size_bytes = 1000, tokenize = "sentences")

## End(Not run)

deeplr documentation built on June 8, 2025, 12:47 p.m.