View source: R/corpus_sample.R
corpus_sample | R Documentation |
Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.
corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL, by = NULL)
x |
a corpus object whose documents will be sampled |
size |
a positive number, the number of documents to select; when used
with |
replace |
if |
prob |
a vector of probability weights for obtaining the elements of the
vector being sampled. May not be applied when |
by |
optional grouping variable for sampling. This will be evaluated in
the docvars data.frame, so that docvars may be referred to by name without
quoting. This also changes previous behaviours for |
a corpus object (re)sampled on the documents, containing the document variables for the documents sampled.
set.seed(123)
# sampling from a corpus
summary(corpus_sample(data_corpus_inaugural, size = 5))
summary(corpus_sample(data_corpus_inaugural, size = 10, replace = TRUE))
# sampling with by
corp <- data_corpus_inaugural
corp$century <- paste(floor(corp$Year / 100) + 1)
corp$century <- paste0(corp$century, ifelse(corp$century < 21, "th", "st"))
corpus_sample(corp, size = 2, by = century) |>
summary()
# needs drop = TRUE to avoid empty interactions
corpus_sample(corp, size = 1, by = interaction(Party, century, drop = TRUE), replace = TRUE) |>
summary()
# sampling sentences by document
corp <- corpus(c(one = "Sentence one. Sentence two. Third sentence.",
two = "First sentence, doc2. Second sentence, doc2."),
docvars = data.frame(var1 = c("a", "a"), var2 = c(1, 2)))
corpus_reshape(corp, to = "sentences") %>%
corpus_sample(replace = TRUE, by = docid(.))
# oversampling
corpus_sample(corp, size = 5, replace = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.