corpus_chunk: Segment a corpus into chunks of a given size
In quanteda: Quantitative Analysis of Textual Data

corpus_chunk

R Documentation

Segment a corpus into chunks of a given size

Description

Segment a corpus into new documents of roughly equal sized text chunks, with the possibility of overlapping the chunks.

Usage

corpus_chunk(
  x,
  size,
  truncate = FALSE,
  use_docvars = TRUE,
  verbose = quanteda_options("verbose")
)

Arguments

`x`	tokens object whose token elements will be segmented into chunks
`size`	integer; the (approximate) token length of the chunks. See Details.
`truncate`	logical; if `TRUE`, truncate the text after `size`
`use_docvars`	if `TRUE`, repeat the docvar values for each chunk; if `FALSE`, drop the docvars in the chunked tokens
`verbose`	if `TRUE` print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Details

The token length is estimated using stringi::stri_length(txt) / stringi::stri_count_boundaries(txt) to avoid needing to tokenize and rejoin the corpus from the tokens.

Note that when used for chunking texts prior to sending to large language models (LLMs) with limited input token lengths, size should typically be set to approximately 0.75-0.80 of the LLM's token limit. This is because tokenizers (such as LLaMA's SentencePiece Byte-Pair Encoding tokenizer) require more tokens than the linguistically defined grammatically-based tokenizer that is the quanteda default. Note also that because of the use of stringi::stri_count_boundaries(txt) to approximate token length (efficiently), the exact token length for chunking will be approximate.

Examples

data_corpus_inaugural[1] |>
  corpus_chunk(size = 10)

quanteda documentation built on June 8, 2025, 9:41 p.m.

quanteda index

Package overview README.md Quick Start Guide

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

quanteda
Quantitative Analysis of Textual Data

corpus_chunk: Segment a corpus into chunks of a given size
In quanteda: Quantitative Analysis of Textual Data

Segment a corpus into chunks of a given size

Description

Usage

Arguments

Details

See Also

Examples

Related to corpus_chunk in quanteda...

R Package Documentation

Browse R Packages

We want your feedback!

quanteda Quantitative Analysis of Textual Data

corpus_chunk: Segment a corpus into chunks of a given size In quanteda: Quantitative Analysis of Textual Data

Segment a corpus into chunks of a given size

Description

Usage

Arguments

Details

See Also

Examples

Related to corpus_chunk in quanteda...

R Package Documentation

Browse R Packages

We want your feedback!

quanteda
Quantitative Analysis of Textual Data

corpus_chunk: Segment a corpus into chunks of a given size
In quanteda: Quantitative Analysis of Textual Data