corpus_chunk | R Documentation |
Segment a corpus into new documents of roughly equal sized text chunks, with the possibility of overlapping the chunks.
corpus_chunk(
x,
size,
truncate = FALSE,
use_docvars = TRUE,
verbose = quanteda_options("verbose")
)
x |
tokens object whose token elements will be segmented into chunks |
size |
integer; the (approximate) token length of the chunks. See Details. |
truncate |
logical; if |
use_docvars |
if |
verbose |
if |
The token length is estimated using stringi::stri_length(txt) / stringi::stri_count_boundaries(txt)
to avoid needing to tokenize and rejoin
the corpus from the tokens.
Note that when used for chunking texts prior to sending to large language
models (LLMs) with limited input token lengths, size should typically be set
to approximately 0.75-0.80 of the LLM's token limit. This is because
tokenizers (such as LLaMA's SentencePiece Byte-Pair Encoding tokenizer)
require more tokens than the linguistically defined grammatically-based
tokenizer that is the quanteda default. Note also that because of the
use of stringi::stri_count_boundaries(txt)
to approximate token length
(efficiently), the exact token length for chunking will be approximate.
tokens_chunk()
data_corpus_inaugural[1] |>
corpus_chunk(size = 10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.