corpus_sample: Randomly sample documents from a corpus
In quanteda: Quantitative Analysis of Textual Data

corpus_sample

R Documentation

Randomly sample documents from a corpus

Description

Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.

Usage

corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL, by = NULL)

Arguments

`x`	a corpus object whose documents will be sampled
`size`	a positive number, the number of documents to select; when used with `by`, the number to select from each group or a vector equal in length to the number of groups defining the samples to be chosen in each category of `by`. By defining a size larger than the number of documents, it is possible to oversample when `replace = TRUE`.
`replace`	if `TRUE`, sample with replacement
`prob`	a vector of probability weights for obtaining the elements of the vector being sampled. May not be applied when `by` is used.
`by`	optional grouping variable for sampling. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for `by`. See `news(Version >= "2.9", package = "quanteda")` for details.

Value

a corpus object (re)sampled on the documents, containing the document variables for the documents sampled.

Examples

set.seed(123)
# sampling from a corpus
summary(corpus_sample(data_corpus_inaugural, size = 5))
summary(corpus_sample(data_corpus_inaugural, size = 10, replace = TRUE))

# sampling with by
corp <- data_corpus_inaugural
corp$century <- paste(floor(corp$Year / 100) + 1)
corp$century <- paste0(corp$century, ifelse(corp$century < 21, "th", "st"))
corpus_sample(corp, size = 2, by = century) |>
    summary()
# needs drop = TRUE to avoid empty interactions
corpus_sample(corp, size = 1, by = interaction(Party, century, drop = TRUE), replace = TRUE) |>
    summary()

# sampling sentences by document
corp <- corpus(c(one = "Sentence one.  Sentence two.  Third sentence.",
                 two = "First sentence, doc2.  Second sentence, doc2."),
               docvars = data.frame(var1 = c("a", "a"), var2 = c(1, 2)))
corpus_reshape(corp, to = "sentences") %>%
    corpus_sample(replace = TRUE, by = docid(.))

# oversampling
corpus_sample(corp, size = 5, replace = TRUE)

quanteda documentation built on June 8, 2025, 9:41 p.m.