corpus_sample: Randomly sample documents from a corpus

Description Usage Arguments Value Examples

View source: R/corpus_sample.R

Description

Take a random sample of documents of the specified size from a corpus, with or without replacement. Works just as sample() works for the documents and their associated document-level variables.

Usage

1
corpus_sample(x, size = NULL, replace = FALSE, prob = NULL, by = NULL)

Arguments

x

a corpus object whose documents will be sampled

size

a positive number, the number of documents to select; when used with groups, the number to select from each group or a vector equal in length to the number of groups defining the samples to be chosen in each group category. By defining a size larger than the number of documents, it is possible to oversample groups.

replace

Should sampling be with replacement?

prob

A vector of probability weights for obtaining the elements of the vector being sampled. May not be applied when by is used.

by

a grouping variable for sampling. Useful for resampling sub-document units such as sentences, for instance by specifying by = "document"

Value

A corpus object with number of documents equal to size, drawn from the corpus x. The returned corpus object will contain all of the meta-data of the original corpus, and the same document variables for the documents selected.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
set.seed(2000)
# sampling from a corpus
summary(corpus_sample(data_corpus_inaugural, 5))
summary(corpus_sample(data_corpus_inaugural, 10, replace = TRUE))

# sampling sentences within document
corp <- corpus(c(one = "Sentence one.  Sentence two.  Third sentence.",
                      two = "First sentence, doc2.  Second sentence, doc2."))
corpsent <- corpus_reshape(corp, to = "sentences")
texts(corpsent)
texts(corpus_sample(corpsent, replace = TRUE, by = "document"))

koheiw/quanteda.core documentation built on Sept. 21, 2020, 3:44 p.m.