Some Remarks on Sprop"
In samplingbook: Survey Sampling Procedures

knitr::opts_chunk$set(
  rmarkdown.html_vignette.check_title = FALSE,
  collapse = TRUE,
  comment = "#>"
)
require(samplingbook, quietly = TRUE)

Exact Hypergeometric Confidence Intervals for Proportion Estimates

In survey sampling on a finite population, a simple random sample is typically selected without replacement, in which case a hypergeometric distribution models the observation. A standard construction for the confidence interval is based on a Normal approximation of the proportion with plug-in estimates for proportion and respective variance.

In most scenarios, this strategy results in satisfactory properties. However, if $p$ is close to 0 or 1, it is recommended to use the exact confidence interval based on the hypergeometrical distribution [@kuechenhoff]. The Wald-type interval has a coverage probability as low as $n/N$ for any $\alpha$ [@wang2015]. Therefore, there is no guarantee for the interval to capture the true $M$ with the desired confidence level if the sample is much smaller than the population [@wang2015].

Implementation in `samplingbook`

The function samplingbook::Sprop() estimates the proportion out of samples either with or without consideration of finite population correction.

Parameters are

m an optional non-negative integer for number of positive events,
n an optional positive integer for sample size,
N positive integer for population size. Default is N=Inf, which means calculations are carried out without finite population correction.

In case of finite population of size N is provided, different methods for calculating confidence intervals are provided

approx Wald-type interval based on normal approximation [@agresti1998], and
exact based on hypergeometric distribution as described in more detail in this document.

Sprop(m=3, n = 10, N = 50, level = 0.95)

Exact Hypergeometric Confidence Intervals

We observe $X=m$, the number of sampled units having the characteristic of interest, where $X \sim Hyper(M, N, n)$, with

$N$ is the population size,
$M$ is the number of population units with characteristic of interest, and
$n$ is the given sample size.

The respective density, i.e. the probability of successes in a sample given $M, N, n$, is $$\Pr(X=m) = \frac{{M \choose m} {N-M \choose n-m}}{N \choose n}, \text{ with support }m \in {\max(0,n+M-N), \min(M,n)} $$

We want to estimate population proportion $p = M/N$, which is equivalent to estimating $M$, the total number of population units with some attribute of interest. Then, the boundaries for the exact confidence interval $[L,U]$ can be derived as follows:

$$ \begin{aligned} \Pr(X \leq m) & = \sum_{x=0}^m \frac{{U \choose x} {N-U \choose n-x}}{N \choose n} = \alpha_1 \ \Pr(X \geq m) & = \sum_{x=m}^n \frac{{L \choose x} {N-L \choose n-x}}{N \choose n} = \alpha_2,\ & \text{with coverage constraint } \alpha_1 + \alpha_2 \leq \alpha \end{aligned} $$ For sake of simplicity, we assume symmetric confidence intervals, i.e $\alpha_1 = \alpha_2 = \alpha/2$.

Some Details on the Implementation

The implementation of the exact confidence interval for proportion estimates uses the hypergeometric distribution function phyper(x, M, N-M, n). Note that the parametrization differs slightly from ours.

``` {r, echo=FALSE, fig.width=7, fig.height=4}

par(mfrow=c(1,2))

M <- 50; curve(phyper(x, M, 500-M, 100), from=0,to=30, col="#E495A5", lwd=2, las=1,

ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=500, n=100")

M <- 60; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#86B875", lwd=2)

M <- 70; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#7DB0DD", lwd=2)

abline(h=c(0.975,0.025), lty=2); abline(v=15)

legend("bottomright", legend=paste0("M=",c(50,60,70)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)

par(mfrow=c(1,2))

M <- 4; curve(phyper(x, M, 50-M, 10), from=0,to=15, col="#E495A5", lwd=2, las=1, ylim=c(0,1), ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=50, n=10")

M <- 8; curve(phyper(x-1, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2)

M <- 20; curve(phyper(x, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2) M <- 32; curve(phyper(x, M, 50-M, 10), from=0,to=15, add=TRUE, col="#7DB0DD", lwd=2) abline(h=c(0.975,0.025), lty=2); abline(v=3) legend("bottomright", legend=paste0("M=",c(4,20,32)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)

We search for the optimal confidence boundaries $[L,U]$ that fulfill the requirements as defined in the equations above.  

* Given known total population $N$, sample size $n$ and number of successes in the sample $m$, we can define some feasibility boundaries for $M$:
    + Naturally, the smallest possible value is the observed number of successes $M_{min} = m$
    + The largest possible value equals the total number $N$ minus negative observations in the sample, i.e. $M_{max} = N - (n-m)$.

* Upper boundary $U$
    + Start with largest possible value for $M$, i.e. $U_{max} = N - (n-m)$
    + Then, decrease incrementally while the $\Pr(X \leq m) < \alpha/2$, so that we find the largest possible value which still fulfills the equation


``` {r, echo=FALSE, eval=FALSE}
N = 50; n = 10; m= 3  
(Ux =  N - (n-m)) # initialize 
phyper(m, Ux, N-Ux, n)
(Ux <- Ux -1)
#Ux <- 32
phyper(m, Ux, N-Ux, n)

Lower boundary $L$
- Start with smallest possible value for $M$, i.e. $L_{min} = m$
- Rewrite $\Pr(X \geq m) = 1 - \Pr(X \leq m) = \alpha/2 \Leftrightarrow \Pr(X \leq m) = 1 - \alpha/2$
- Then, increase incrementally while the $\Pr(X \leq m) \geq 1-\alpha/2$, so that we find the smallest possible value which still fulfills the equation

``` {r, echo=FALSE, eval=FALSE} N = 50; n = 10; m= 3
(Ux = m) # initialize phyper(m, Ux, N-Ux, n) (Ux <- Ux + 1) phyper(m, Ux, N-Ux, n)

Ux <- 32

```

References

Any scripts or data that you put into this service are public.

samplingbook documentation built on April 3, 2021, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

samplingbook
Survey Sampling Procedures

Some Remarks on Sprop"
In samplingbook: Survey Sampling Procedures

Exact Hypergeometric Confidence Intervals for Proportion Estimates

Implementation in `samplingbook`

Exact Hypergeometric Confidence Intervals

Some Details on the Implementation

par(mfrow=c(1,2))

M <- 50; curve(phyper(x, M, 500-M, 100), from=0,to=30, col="#E495A5", lwd=2, las=1,

ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=500, n=100")

M <- 60; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#86B875", lwd=2)

M <- 70; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#7DB0DD", lwd=2)

abline(h=c(0.975,0.025), lty=2); abline(v=15)

legend("bottomright", legend=paste0("M=",c(50,60,70)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)

par(mfrow=c(1,2))

M <- 8; curve(phyper(x-1, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2)

Ux <- 32

References

Try the samplingbook package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

samplingbook Survey Sampling Procedures

Some Remarks on Sprop" In samplingbook: Survey Sampling Procedures

Exact Hypergeometric Confidence Intervals for Proportion Estimates

Implementation in samplingbook

Exact Hypergeometric Confidence Intervals

Some Details on the Implementation

par(mfrow=c(1,2))

M <- 50; curve(phyper(x, M, 500-M, 100), from=0,to=30, col="#E495A5", lwd=2, las=1,

ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=500, n=100")

M <- 60; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#86B875", lwd=2)

M <- 70; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#7DB0DD", lwd=2)

abline(h=c(0.975,0.025), lty=2); abline(v=15)

legend("bottomright", legend=paste0("M=",c(50,60,70)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)

par(mfrow=c(1,2))

M <- 8; curve(phyper(x-1, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2)

Ux <- 32

References

Try the samplingbook package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

samplingbook
Survey Sampling Procedures

Some Remarks on Sprop"
In samplingbook: Survey Sampling Procedures

Implementation in `samplingbook`