sampling: Sample Texts

Description Usage Arguments Value Examples

View source: R/sampling.R

Description

Sample texts from different subsets to minimize variance of the recall estimator

Usage

1
sampling(id, corporaID, label, m, randomize = FALSE, exact = FALSE)

Arguments

id

Character: IDs of all texts in the corpus.

corporaID

List of Character: Each list element is a character vector and contains the IDs belonging to one subcorpus. Each ID has to be in id.

label

Named Logical: Labeling result for already labeled texts. Could be empty, if no labeled data exists. The algorithm sets p = 0.5 for all intersections. Names have to be id.

m

Integer: Number of new samples.

randomize

Logical: If TRUE calculated split is used as parameter to draw from a multinomial distribution.

exact

Logical: If TRUE exact calculation is used. For the default FALSE an approximation is used.

Value

Character vector of IDs, which should be labeled next.

Examples

1
2
3
4
5
6
id <- paste0("ID", 1:1000)
corporaID <- list(sample(id, 300), sample(id, 100), sample(id, 700))
label <- sample(as.logical(0:1), 150, replace=TRUE)
names(label) <- c(sample(id, 100), sample(corporaID[[2]], 50))
m <- 100
sampling(id, corporaID, label, m)

tosca documentation built on Oct. 28, 2021, 5:07 p.m.