split_text: Split texts into segments

View source: R/split_texts.R

split_textR Documentation

Split texts into segments

Description

split_text splits texts into blocks of a maximum number of bytes.

Usage

split_text(text, max_size_bytes = 29000, tokenize = "sentences")

Arguments

text

character vector to be split.

max_size_bytes

maximum size of a single text segment in bytes.

tokenize

level of tokenization. Either "sentences" or "words".

Details

The function uses tokenizers::tokenize_sentences to split texts.

Value

Returns a (tibble) with the following columns:

  • text_id position of the text in the character vector.

  • segment_id ID of a text segment.

  • segment_text text segment that is smaller than max_size_bytes

Examples

## Not run: 
# Split long text
text <- paste0(rep("This is a very long text.", 10000), collapse = " ")
split_text(text)

## End(Not run)


deeplr documentation built on Nov. 8, 2023, 1:09 a.m.