sampling: Sample Texts
In tosca: Tools for Statistical Content Analysis

sampling

R Documentation

Sample Texts

Description

Sample texts from different subsets to minimize variance of the recall estimator

Usage

sampling(id, corporaID, label, m, randomize = FALSE, exact = FALSE)

Arguments

`id`	Character: IDs of all texts in the corpus.
`corporaID`	List of Character: Each list element is a character vector and contains the IDs belonging to one subcorpus. Each ID has to be in `id`.
`label`	Named Logical: Labeling result for already labeled texts. Could be empty, if no labeled data exists. The algorithm sets `p = 0.5` for all intersections. Names have to be `id`.
`m`	Integer: Number of new samples.
`randomize`	Logical: If `TRUE` calculated split is used as parameter to draw from a multinomial distribution.
`exact`	Logical: If `TRUE` exact calculation is used. For the default `FALSE` an approximation is used.

Value

Character vector of IDs, which should be labeled next.

Examples

id <- paste0("ID", 1:1000)
corporaID <- list(sample(id, 300), sample(id, 100), sample(id, 700))
label <- sample(as.logical(0:1), 150, replace=TRUE)
names(label) <- c(sample(id, 100), sample(corporaID[[2]], 50))
m <- 100
sampling(id, corporaID, label, m)

tosca documentation built on June 8, 2025, 11:21 a.m.