knitr::opts_chunk$set(
  rmarkdown.html_vignette.check_title = FALSE,
  collapse = TRUE,
  comment = "#>"
)
require(samplingbook, quietly = TRUE)

Exact Hypergeometric Confidence Intervals for Proportion Estimates

In survey sampling on a finite population, a simple random sample is typically selected without replacement, in which case a hypergeometric distribution models the observation. A standard construction for the confidence interval is based on a Normal approximation of the proportion with plug-in estimates for proportion and respective variance.

In most scenarios, this strategy results in satisfactory properties. However, if $p$ is close to 0 or 1, it is recommended to use the exact confidence interval based on the hypergeometrical distribution [@kuechenhoff]. The Wald-type interval has a coverage probability as low as $n/N$ for any $\alpha$ [@wang2015]. Therefore, there is no guarantee for the interval to capture the true $M$ with the desired confidence level if the sample is much smaller than the population [@wang2015].

Implementation in samplingbook

The function samplingbook::Sprop() estimates the proportion out of samples either with or without consideration of finite population correction.

Parameters are

In case of finite population of size N is provided, different methods for calculating confidence intervals are provided

Sprop(m=3, n = 10, N = 50, level = 0.95)

Exact Hypergeometric Confidence Intervals

We observe $X=m$, the number of sampled units having the characteristic of interest, where $X \sim Hyper(M, N, n)$, with

The respective density, i.e. the probability of successes in a sample given $M, N, n$, is $$\Pr(X=m) = \frac{{M \choose m} {N-M \choose n-m}}{N \choose n}, \text{ with support }m \in {\max(0,n+M-N), \min(M,n)} $$

We want to estimate population proportion $p = M/N$, which is equivalent to estimating $M$, the total number of population units with some attribute of interest. Then, the boundaries for the exact confidence interval $[L,U]$ can be derived as follows:

$$ \begin{aligned} \Pr(X \leq m) & = \sum_{x=0}^m \frac{{U \choose x} {N-U \choose n-x}}{N \choose n} = \alpha_1 \ \Pr(X \geq m) & = \sum_{x=m}^n \frac{{L \choose x} {N-L \choose n-x}}{N \choose n} = \alpha_2,\ & \text{with coverage constraint } \alpha_1 + \alpha_2 \leq \alpha \end{aligned} $$ For sake of simplicity, we assume symmetric confidence intervals, i.e $\alpha_1 = \alpha_2 = \alpha/2$.

Some Details on the Implementation

The implementation of the exact confidence interval for proportion estimates uses the hypergeometric distribution function phyper(x, M, N-M, n). Note that the parametrization differs slightly from ours.

``` {r, echo=FALSE, fig.width=7, fig.height=4}

par(mfrow=c(1,2))

M <- 50; curve(phyper(x, M, 500-M, 100), from=0,to=30, col="#E495A5", lwd=2, las=1,

ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=500, n=100")

M <- 60; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#86B875", lwd=2)

M <- 70; curve(phyper(x, M, 500-M, 100), from=0,to=30, add=TRUE, col="#7DB0DD", lwd=2)

abline(h=c(0.975,0.025), lty=2); abline(v=15)

legend("bottomright", legend=paste0("M=",c(50,60,70)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)

par(mfrow=c(1,2))

M <- 4; curve(phyper(x, M, 50-M, 10), from=0,to=15, col="#E495A5", lwd=2, las=1, ylim=c(0,1), ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=50, n=10")

M <- 8; curve(phyper(x-1, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2)

M <- 20; curve(phyper(x, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2) M <- 32; curve(phyper(x, M, 50-M, 10), from=0,to=15, add=TRUE, col="#7DB0DD", lwd=2) abline(h=c(0.975,0.025), lty=2); abline(v=3) legend("bottomright", legend=paste0("M=",c(4,20,32)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)

We search for the optimal confidence boundaries $[L,U]$ that fulfill the requirements as defined in the equations above.  

* Given known total population $N$, sample size $n$ and number of successes in the sample $m$, we can define some feasibility boundaries for $M$:
    + Naturally, the smallest possible value is the observed number of successes $M_{min} = m$
    + The largest possible value equals the total number $N$ minus negative observations in the sample, i.e. $M_{max} = N - (n-m)$.

* Upper boundary $U$
    + Start with largest possible value for $M$, i.e. $U_{max} = N - (n-m)$
    + Then, decrease incrementally while the $\Pr(X \leq m) < \alpha/2$, so that we find the largest possible value which still fulfills the equation


``` {r, echo=FALSE, eval=FALSE}
N = 50; n = 10; m= 3  
(Ux =  N - (n-m)) # initialize 
phyper(m, Ux, N-Ux, n)
(Ux <- Ux -1)
#Ux <- 32
phyper(m, Ux, N-Ux, n)

``` {r, echo=FALSE, eval=FALSE} N = 50; n = 10; m= 3
(Ux = m) # initialize phyper(m, Ux, N-Ux, n) (Ux <- Ux + 1) phyper(m, Ux, N-Ux, n)

Ux <- 32

```

References



jmanitz/samplingbook documentation built on April 10, 2021, 4:44 p.m.