Weighted sampling without replacement

Share:

Description

These functions implement weighted sampling without replacement using various algorithms, i.e., they take a sample of the specified size from the elements of 1:n without replacement, using the weights defined by prob. The call sample_int_*(n, size, prob) is equivalent to sample.int(n, size, replace=F, prob). (The results will most probably be different for the same random seed, but the returned samples are distributed identically for both calls.) Except for sample_int_R (which has quadratic complexity as of this writing), all functions have complexity O(n log n) or better and often run faster than R's implementation, especially when n and size are large.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sample_int_R(n, size, prob)

sample_int_ccrank(n, size, prob)

sample_int_crank(n, size, prob)

sample_int_expj(n, size, prob)

sample_int_expjs(n, size, prob)

sample_int_rank(n, size, prob)

sample_int_rej(n, size, prob)

Arguments

n

a positive number, the number of items to choose from. See ‘Details.’

size

a non-negative integer giving the number of items to choose.

prob

A vector of probability weights for obtaining the elements of the vector being sampled.

Details

sample_int_R is a simple wrapper for sample.int.

sample_int_expj and sample_int_expjs implement one-pass random sampling with a reservor with exponential jumps (Efraimidis and Spirakis, 2006, Algorithm A-ExpJ). Both functions are implemented in Rcpp; *_expj uses log-transformed keys, *_expjs implements the algorithm in the paper verbatim (at the cost of numerical stability).

sample_int_rank, sample_int_crank and sample_int_ccrank implement one-pass random sampling (Efraimidis and Spirakis, 2006, Algorithm A). The first function is implemented purely in R, the other two are optimized Rcpp implementations (_crank uses R vectors internally, while *_ccrank uses std::vector; surprisingly, *_crank seems to be faster on most inputs). It can be shown that the order statistic of U^{(1/w_i)} has the same distribution as random sampling without replacement (U=uniform(0,1) distribution). To increase numerical stability, log(U) / w_i is computed instead; the log transform does not change the order statistic.

sample_int_rej uses repeated weighted sampling with replacement and a variant of rejection sampling. It is implemented purely in R. This function simulates weighted sampling without replacement using somewhat more draws with replacement, and then discarding duplicate values (rejection sampling). If too few items are sampled, the routine calls itself recursively on a (hopefully) much smaller problem. See also http://stats.stackexchange.com/q/20590/6432.

Value

An integer vector of length size with elements from 1:n.

Author(s)

Kirill Müller (for _*expj*)

Dinre (for *_rank), Kirill Müller (for *_*crank)

Kirill Müller (for *_int_rej)

References

http://stackoverflow.com/q/15113650/946850

Efraimidis, Pavlos S., and Paul G. Spirakis. "Weighted random sampling with a reservoir." Information Processing Letters 97, no. 5 (2006): 181-185.

Efraimidis, Pavlos S., and Paul G. Spirakis. "Weighted random sampling with a reservoir." Information Processing Letters 97, no. 5 (2006): 181-185.

See Also

sample.int

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Base R implementation
s <- sample_int_R(2000, 1000, runif(2000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_R(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

## Algorithm A, Rcpp version using std::vector
s <- sample_int_ccrank(20000, 10000, runif(20000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_ccrank(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

## Algorithm A, Rcpp version using R vectors
s <- sample_int_crank(20000, 10000, runif(20000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_crank(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

## Algorithm A-ExpJ (with log-transformed keys)
s <- sample_int_expj(20000, 10000, runif(20000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_expj(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

## Algorithm A-ExpJ (paper version)
s <- sample_int_expjs(20000, 10000, runif(20000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_expjs(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

## Algorithm A
s <- sample_int_rank(20000, 10000, runif(20000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_rank(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

## Rejection sampling
s <- sample_int_rej(20000, 10000, runif(20000))
stopifnot(unique(s) == s)
p <- c(995, rep(1, 5))
n <- 1000
set.seed(42)
tbl <- table(replicate(sample_int_rej(6, 3, p),
                       n = n)) / n
stopifnot(abs(tbl - c(1, rep(0.4, 5))) < 0.04)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.