tokens_sample: Randomly sample documents from a tokens object
In quanteda: Quantitative Analysis of Textual Data

tokens_sample

R Documentation

Randomly sample documents from a tokens object

Description

Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.

Usage

tokens_sample(
  x,
  size = NULL,
  replace = FALSE,
  prob = NULL,
  by = NULL,
  env = NULL,
  verbose = quanteda_options("verbose")
)

Arguments

`x`	a tokens object whose documents will be sampled
`size`	a positive number, the number of documents to select; when used with `by`, the number to select from each group or a vector equal in length to the number of groups defining the samples to be chosen in each category of `by`. By defining a size larger than the number of documents, it is possible to oversample when `replace = TRUE`.
`replace`	if `TRUE`, sample with replacement
`prob`	a vector of probability weights for obtaining the elements of the vector being sampled. May not be applied when `by` is used.
`by`	optional grouping variable for sampling. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for `by`. See `news(Version >= "2.9", package = "quanteda")` for details.
`env`	an environment or a list object in which `x` is searched. Passed to substitute for non-standard evaluation.
`verbose`	if `TRUE` print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Value

a tokens object (re)sampled on the documents, containing the document variables for the documents sampled.

Examples

set.seed(123)
toks <- tokens(data_corpus_inaugural[1:6])
toks
tokens_sample(toks)
tokens_sample(toks, replace = TRUE) |> docnames()
tokens_sample(toks, size = 3, replace = TRUE) |> docnames()

# sampling using by
docvars(toks)
tokens_sample(toks, size = 2, replace = TRUE, by = Party) |> docnames()

quanteda documentation built on June 8, 2025, 9:41 p.m.