split_text | R Documentation |
split_text
divides input text into smaller segments that do not exceed a specified maximum size in bytes.
Segmentation is based on sentence or word boundaries.
split_text(text, max_size_bytes = 29000, tokenize = "sentences")
text |
A character vector containing the text(s) to be split. |
max_size_bytes |
An integer specifying the maximum size (in bytes) for each segment. |
tokenize |
A string indicating the level of tokenization. Must be either |
This function uses tokenizers::tokenize_sentences
(or tokenize_words
if specified)
to split the text into natural language segments before assembling byte-limited blocks.
A tibble with one row per text segment, containing the following columns:
text_id
: The index of the original text in the input vector.
segment_id
: A sequential ID identifying the segment number.
segment_text
: The resulting text segment, each within the specified byte limit.
## Not run:
long_text <- paste0(rep("This is a very long text. ", 10000), collapse = "")
split_text(long_text, max_size_bytes = 1000, tokenize = "sentences")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.