dfm_sample: Randomly sample documents from a dfm
In quanteda: Quantitative Analysis of Textual Data

dfm_sample

R Documentation

Randomly sample documents from a dfm

Description

Take a random sample of documents of the specified size from a dfm, with or without replacement, optionally by grouping variables or with probability weights.

Usage

dfm_sample(
  x,
  size = NULL,
  replace = FALSE,
  prob = NULL,
  by = NULL,
  verbose = quanteda_options("verbose")
)

Arguments

`x`	the dfm object whose documents will be sampled
`size`	a positive number, the number of documents to select; when used with `by`, the number to select from each group or a vector equal in length to the number of groups defining the samples to be chosen in each category of `by`. By defining a size larger than the number of documents, it is possible to oversample when `replace = TRUE`.
`replace`	if `TRUE`, sample with replacement
`prob`	a vector of probability weights for obtaining the elements of the vector being sampled. May not be applied when `by` is used.
`by`	optional grouping variable for sampling. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for `by`. See `news(Version >= "2.9", package = "quanteda")` for details.
`verbose`	if `TRUE` print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Value

a dfm object (re)sampled on the documents, containing the document variables for the documents sampled.

Examples

set.seed(10)
dfmat <- dfm(tokens(c("a b c c d", "a a c c d d d", "a b b c")))
dfmat
dfm_sample(dfmat)
dfm_sample(dfmat, replace = TRUE)

# by groups
dfmat <- dfm(tokens(data_corpus_inaugural[50:58]))
dfm_sample(dfmat, by = Party, size = 2)

quanteda documentation built on June 8, 2025, 9:41 p.m.