knitr::opts_chunk$set( rmarkdown.html_vignette.check_title = FALSE, collapse = TRUE, comment = "#>" ) require(samplingbook, quietly = TRUE)
In survey sampling on a finite population, a simple random sample is typically selected without replacement, in which case a hypergeometric distribution models the observation. A standard construction for the confidence interval is based on a Normal approximation of the proportion with plug-in estimates for proportion and respective variance.
In most scenarios, this strategy results in satisfactory properties. However, if $p$ is close to 0 or 1, it is recommended to use the exact confidence interval based on the hypergeometrical distribution [@kuechenhoff]. The Wald-type interval has a coverage probability as low as $n/N$ for any $\alpha$ [@wang2015]. Therefore, there is no guarantee for the interval to capture the true $M$ with the desired confidence level if the sample is much smaller than the population [@wang2015].
samplingbook
The function samplingbook::Sprop()
estimates the proportion out of samples either with or without consideration of finite population correction.
Parameters are
m
an optional non-negative integer for number of positive events, n
an optional positive integer for sample size,N
positive integer for population size. Default is N=Inf
, which means calculations are carried out without finite population correction.In case of finite population of size N
is provided, different methods for calculating confidence intervals are provided
approx
Wald-type interval based on normal approximation [@agresti1998], andexact
based on hypergeometric distribution as described in more detail in this document.Sprop(m=3, n = 10, N = 50, level = 0.95)
We observe $X=m$, the number of sampled units having the characteristic of interest, where $X \sim Hyper(M, N, n)$, with
The respective density, i.e. the probability of successes in a sample given $M, N, n$, is $$\Pr(X=m) = \frac{{M \choose m} {N-M \choose n-m}}{N \choose n}, \text{ with support }m \in {\max(0,n+M-N), \min(M,n)} $$
We want to estimate population proportion $p = M/N$, which is equivalent to estimating $M$, the total number of population units with some attribute of interest. Then, the boundaries for the exact confidence interval $[L,U]$ can be derived as follows:
$$ \begin{aligned} \Pr(X \leq m) & = \sum_{x=0}^m \frac{{U \choose x} {N-U \choose n-x}}{N \choose n} = \alpha_1 \ \Pr(X \geq m) & = \sum_{x=m}^n \frac{{L \choose x} {N-L \choose n-x}}{N \choose n} = \alpha_2,\ & \text{with coverage constraint } \alpha_1 + \alpha_2 \leq \alpha \end{aligned} $$ For sake of simplicity, we assume symmetric confidence intervals, i.e $\alpha_1 = \alpha_2 = \alpha/2$.
The implementation of the exact confidence interval for proportion estimates uses the hypergeometric distribution function phyper(x, M, N-M, n)
. Note that the parametrization differs slightly from ours.
``` {r, echo=FALSE, fig.width=7, fig.height=4}
M <- 4; curve(phyper(x, M, 50-M, 10), from=0,to=15, col="#E495A5", lwd=2, las=1, ylim=c(0,1), ylab="Distribution function, F(x)", main="Hyp(x, M, N-M, n) given N=50, n=10")
M <- 20; curve(phyper(x, M, 50-M, 10), from=0,to=15, add=TRUE, col="#86B875", lwd=2) M <- 32; curve(phyper(x, M, 50-M, 10), from=0,to=15, add=TRUE, col="#7DB0DD", lwd=2) abline(h=c(0.975,0.025), lty=2); abline(v=3) legend("bottomright", legend=paste0("M=",c(4,20,32)), col=c( "#E495A5","#86B875","#7DB0DD"), lty=1,lwd=2)
We search for the optimal confidence boundaries $[L,U]$ that fulfill the requirements as defined in the equations above. * Given known total population $N$, sample size $n$ and number of successes in the sample $m$, we can define some feasibility boundaries for $M$: + Naturally, the smallest possible value is the observed number of successes $M_{min} = m$ + The largest possible value equals the total number $N$ minus negative observations in the sample, i.e. $M_{max} = N - (n-m)$. * Upper boundary $U$ + Start with largest possible value for $M$, i.e. $U_{max} = N - (n-m)$ + Then, decrease incrementally while the $\Pr(X \leq m) < \alpha/2$, so that we find the largest possible value which still fulfills the equation ``` {r, echo=FALSE, eval=FALSE} N = 50; n = 10; m= 3 (Ux = N - (n-m)) # initialize phyper(m, Ux, N-Ux, n) (Ux <- Ux -1) #Ux <- 32 phyper(m, Ux, N-Ux, n)
``` {r, echo=FALSE, eval=FALSE}
N = 50; n = 10; m= 3
(Ux = m) # initialize
phyper(m, Ux, N-Ux, n)
(Ux <- Ux + 1)
phyper(m, Ux, N-Ux, n)
```
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.