Estimation {#cha-estimation}

#    IPSUR: Introduction to Probability and Statistics Using R
#    Copyright (C) 2018  G. Jay Kerns
#
#    Chapter: Estimation
#
#    This file is part of IPSUR.
#
#    IPSUR is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    IPSUR is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with IPSUR.  If not, see <http://www.gnu.org/licenses/>.
library(Hmisc)
library(RcmdrPlugin.IPSUR)
library(reshape)
library(stats4)
library(TeachingDemos)

We will discuss two branches of estimation procedures: point estimation and interval estimation. We briefly discuss point estimation first and then spend the rest of the chapter on interval estimation.

We find an estimator with the methods of Section \@ref(sec-point-estimation). We make some assumptions about the underlying population distribution and use what we know from Chapter \@ref(cha-sampling-distributions) about sampling distributions both to study how the estimator will perform, and to find intervals of confidence for underlying parameters associated with the population distribution. Once we have confidence intervals we can do inference in the form of hypothesis tests in the next chapter.

What do I want them to know?

Point Estimation {#sec-point-estimation}

The following example is how I was introduced to maximum likelihood.

\bigskip

```{example, label="how-many-fish", name="Fishing, part one"} Imagine there is a small pond in our backyard, and in the pond there live some fish. We would like to know how many fish live in the pond. How can we estimate this? One procedure developed by researchers is the capture-recapture method. Here is how it works.

We go fishing, and on each fish that we catch we attach 
an unobtrusive tag to its tail, and release it back into the water. 
Suppose on this day that we capture \(M=7\) fish in total.

Next, we wait a few days for the fish to remix and become accustomed
to their new tag. Then we go fishing again. On the second trip some of
the fish we catch may be tagged; some may not be. Let \(X\) denote the
number of caught fish which are tagged[^estimate-01], and suppose
for the sake of argument that we catch \(K=4\) fish and we find that 3
of them are tagged.

[^estimate-01]: It is theoretically possible that we could catch the
same tagged fish more than once, which would inflate our count of
tagged fish. To avoid this difficulty, suppose that on the second trip
we use a tank on the boat to hold the caught fish until data
collection is completed.

Now let \(F\) denote the (unknown) total number of fish in the
pond. We know that \(F\geq7\), because we tagged that many on the
first trip. In fact, if we let \(N\) denote the number of untagged
fish in the pond, then \(F=M+N\). Using our earlier language, we have
sampled \(K=4\) times, without replacement, from an urn (pond) which
has \(M=7\) white balls (tagged fish) and \(N=F-M\) black balls
(untagged fish), and we have observed \(x=3\) of them to be white
(tagged). What is the probability of this?

Looking back to Section \@ref(sec-other-discrete-distributions), we see that the
random variable \(X\) has a
\(\mathsf{hyper}(\mathtt{m}=M,\,\mathtt{n}=F-M,\,\mathtt{k}=K)\)
distribution. Therefore, for an observed value \(X=x\) the probability
would be \[ \mathbb{P}(X=x)=\frac{{M \choose x}{F-M \choose K-x}}{{F
\choose K}}.  \] First we notice that \(F\) must be at least 7. Could
\(F\) be equal to seven? If \(F=7\) then all of the fish would have
been tagged on the first run, and there would be no untagged fish in
the pond, thus, \(\mathbb{P}(\mbox{3 successes in 4 trials})=0\).
What about \(F=8\); what would be the probability of observing \(X=3\)
tagged fish?  \[ \mathbb{P}(\mbox{3 successes in 4 trials})=\frac{{7
\choose 3}{1 \choose 1}}{{8 \choose 4}}=\frac{35}{70}=0.5.  \]
Similarly, if \(F=9\) then the probability of observing \(X=3\) tagged
fish would be \[ \mathbb{P}(\mbox{3 successes in 4 trials})=\frac{{7
\choose 3}{2 \choose 1}}{{9 \choose 4}}=\frac{70}{126}\approx0.556.
\] We can see already that the observed data \(X=3\) is more likely
when \(F=9\) than it is when \(F=8\). And here lies the genius of Sir
Ronald Aylmer Fisher: he asks, "What is the value of \(F\) which has
the highest likelihood?" In other words, for all of the different
possible values of \(F\), which one makes the above probability the
biggest? We can answer this question with a plot of
\(\mathbb{P}(X=x)\) versus \(F\). See Figure
\@ref(fig:capture-recapture).

The graph shows that \(\hat{F}=9\) is the number of fish with the maximum likelihood, and that is what we will take for our estimate of the number of fish in the the pond.



```r
heights = rep(0, 16)
for (j in 7:15) heights[j] <- dhyper(3, m = 7, n = j - 7, k = 4)
plot(6:15, heights[6:15], pch = 16, cex = 1.5, xlab = "number of fish in pond", ylab = "Likelihood")
abline(h = 0)
lines(6:15, heights[6:15], type = "h", lwd = 2, lty = 3)
text(9, heights[9]/6, bquote(hat(F)==.(9)), cex = 2, pos = 4)
lines(9, heights[9], type = "h", lwd = 2)
points(9, 0, pch = 4, lwd = 3, cex = 2)

(ref:cap-capture-recapture) \small A plot of maximum likelihood for the capture-recapture experiment. We see that (\hat{F}=9) is the number of fish that results in the maximum likelihood for the observed data.

\bigskip

```{example, label="bass-bluegill", name="Fishing, part two"} In the last example we were only concerned with how many fish were in the pond, but now, we will ask a different question. Suppose it is known that there are only two species of fish in the pond: smallmouth bass (Micropterus dolomieu) and bluegill (Lepomis macrochirus); perhaps we built the pond some years ago and stocked it with only these two species. We would like to estimate the proportion of fish in the pond which are bass.

Let \(p=\mbox{the proportion of bass}\). Without any other
information, it is conceivable for \(p\) to be any value in the
interval \([0,1]\), but for the sake of argument we will suppose that
\(p\) falls strictly between zero and one. How can we learn about the
true value of \(p\)? Go fishing! As before, we will use
catch-and-release, but unlike before, we will not tag the fish. We
will simply note the species of any caught fish before returning it to
the pond.

Suppose we catch \(n\) fish. Let \[ X_{i} = \begin{cases} 1, &
\mbox{if the \(i^{\text{th}}\) fish is a bass,}\\ 0, & \mbox{if the
\(i^{\text{th}}\) fish is a bluegill.} \end{cases} \] Since we are
returning the fish to the pond once caught, we may think of this as a
sampling scheme with replacement where the proportion of bass \(p\)
does not change. Given that we allow the fish sufficient time to "mix"
once returned, it is not completely unreasonable to model our fishing
experiment as a sequence of Bernoulli trials, so that the \(X_{i}\)'s
would be IID
\(\mathsf{binom(\mathtt{size}}=1,\,\mathtt{prob}=p)\). Under those
assumptions we would have
\begin{eqnarray*}
\mathbb{P}(X_{1}=x_{1},\, X_{2}=x_{2},\,\ldots,\, X_{n}=x_{n}) & = & \mathbb{P}(X_{1}=x_{1})\,\mathbb{P}(X_{2}=x_{2})\,\cdots\mathbb{P}(X_{n}=x_{n}),\\
 & = & p^{x_{1}}(1-p)^{x_{1}}\, p^{x_{2}}(1-p)^{x_{2}}\cdots\, p^{x_{n}}(1-p)^{x_{n}},\\
 & = & p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}.
\end{eqnarray*}
That is, \[ \mathbb{P}(X_{1}=x_{1},\, X_{2}=x_{2},\,\ldots,\,
X_{n}=x_{n})=p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}.  \] This last
quantity is a function of \(p\), called the *likelihood function*
\(L(p)\): \[ L(p)=p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}.  \] A graph of
\(L\) for values of \(\sum x_{i}=3,\ 4\), and 5 when \(n=7\) is shown
in Figure \@ref(fig:fishing-part-two).


```r
curve(x^5*(1-x)^2, 0, 1, xlab = "p", ylab = "L(p)")
curve(x^4*(1-x)^3, 0, 1, add = TRUE)
curve(x^3*(1-x)^4, 0, 1, add = TRUE)

(ref:cap-fishing-part-two) \small Assorted likelihood functions for Fishing, part two. Three graphs are shown of (L) when (\sum x_{i}) equals 3, 4, and 5, respectively, from left to right. We pick an (L) that matches the observed data and then maximize (L) as a function of (p). If (\sum x_{i}=4), then the maximum appears to occur somewhere around (p \approx 0.6).

We want the value of (p) which has the highest likelihood, that is, we again wish to maximize the likelihood. We know from calculus (see Appendix \@ref(sec-differential-and-integral)) to differentiate (L) and set (L'=0) to find a maximum. [ L'(p)=\left(\sum x_{i}\right)p^{\sum x_{i}-1}(1-p)^{n-\sum x_{i}}+p^{\sum x_{i}}\left(n-\sum x_{i}\right)(1-p)^{n-\sum x_{i}-1}(-1). ] The derivative vanishes ((L'=0)) when \begin{eqnarray} \left(\sum x_{i}\right)p^{\sum x_{i}-1}(1-p)^{n-\sum x_{i}} & = & p^{\sum x_{i}}\left(n-\sum x_{i}\right)(1-p)^{n-\sum x_{i}-1},\ \sum x_{i}(1-p) & = & \left(n-\sum x_{i}\right)p,\ \sum x_{i}-p\sum x_{i} & = & np-p\sum x_{i},\ \frac{1}{n}\sum_{i=1}^{n}x_{i} & = & p. \end{eqnarray} This "best" (p), the one which maximizes the likelihood, is called the maximum likelihood estimator (MLE) of (p) and is denoted (\hat{p}). That is, \begin{equation} \hat{p}=\frac{\sum_{i=1}^{n}x_{i}}{n}=\overline{x}. \end{equation}

\bigskip

```{block, type="remark"} Strictly speaking we have only shown that the derivative equals zero at (\hat{p}), so it is theoretically possible that the critical value (\hat{p}=\overline{x}) is located at a minimum[^est-2] instead of a maximum! We should be thorough and check that (L'>0) when (p<\overline{x}) and (L'<0) when (p>\overline{x}). Then by the First Derivative Test (Theorem \@ref(thm:first-derivative-test)) we could be certain that (\hat{p}=\overline{x}) is indeed a maximum likelihood estimator, and not a minimum likelihood estimator.

The result is shown in Figure \@ref(fig:species-mle).


[^est-2]: We can tell from the graph that our value of \(\hat{p}\) is a maximum instead of a minimum so we do not really need to worry for this example. Other examples are not so easy, however, and we should be careful to be cognizant of this extra step.


```r
dat <- rbinom(27, size = 1, prob = 0.3)
like <- function(x){
r <- 1
for (k in 1:27){ r <- r*dbinom(dat[k], size = 1, prob = x)}
return(r)
}
curve(like, from = 0, to = 1, xlab = "parameter space", ylab = "Likelihood", lwd = 3, col = "blue")
abline(h = 0, lwd = 1, lty = 3, col = "grey")
mle <- mean(dat)
mleobj <- like(mle)
lines(mle, mleobj, type = "h", lwd = 2, lty = 3, col = "red")
points(mle, 0, pch = 4, lwd = 2, cex = 2, col = "red")
text(mle, mleobj/6, substitute(hat(theta)==a, list(a=round(mle, 4))), cex = 2, pos = 4)

(ref:cap-species-mle) \small Species maximum likelihood. Here we see that (\hat{theta} \approx 0.44) is the proportion of bass with the maximum likelihood.

In general, we have a family of PDFs (f(x|\theta)) indexed by a parameter (\theta) in some parameter space (\Theta). We want to learn about (\theta). We take a (SRS(n)): \begin{equation} X_{1},\, X_{2},\,\ldots,X_{n}\mbox{ which are IID (f(x| \theta )).} \end{equation}

\bigskip

Given the observed data \(x_{1}\), \(x_{2}\),..., \(x_{n}\), the
*likelihood function* \(L\) is defined by \[
L(\theta)=\prod_{i=1}^{n}f(x_{i}|\theta),\quad \theta\in\Theta.  \]

The next step is to maximize (L). The method we will use in this book is to find the derivative (L') and solve the equation (L'(\theta)=0). Call a solution (\hat{\theta}). We will check that (L) is maximized at (\hat{\theta}) using the First Derivative Test or the Second Derivative Test (\left(L''(\hat{\theta})<0\right)).

\bigskip

A value \(\theta\) that maximizes \(L\) is called a *maximum
likelihood estimator* (MLE) and is denoted \(\hat{\theta}\). It is a
function of the sample, \(\hat{\theta}=\hat{\theta}\left(X_{1},\,
X_{2},\,\ldots,X_{n}\right)\), and is called a *point estimator* of
\(\theta\).

\bigskip

```{block, type="remark"} Some comments about maximum likelihood estimators:

MLEs are just one of _many_ possible estimators. One of the more
popular alternatives are the *method of moments estimators*; see Casella and Berger [@Casella2002] for more.

Notice, in Example \@ref(ex:bass-bluegill) we had \(X_{i}\) IID
\(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)\), and we saw that
the MLE was \(\hat{p}=\overline{X}\). But further
\begin{eqnarray*}
\mathbb{E}\overline{X} & = & \mathbb{E}\left(\frac{X_{1}+X_{2}+\cdots+X_{n}}{n}\right),\\
 & = & \frac{1}{n}\left(\mathbb{E} X_{1}+\mathbb{E} X_{2}+\cdots+\mathbb{E} X_{n}\right),\\
 & = & \frac{1}{n}\left(np\right),\\
 & = & p,
\end{eqnarray*}
which is exactly the same as the parameter which we estimated. More
concisely, \(\mathbb{E}\hat{p}=p\), that is, on the average, the
estimator is exactly right.

\bigskip

```{definition}
Let \(s(X_{1},X_{2},\ldots,X_{n})\) be a statistic which estimates
\(\theta\). If \[ \mathbb{E} s(X_{1},X_{2},\ldots,X_{n})=\theta, \]
then the statistic \(s(X_{1},X_{2},\ldots,X_{n})\) is said to be an
*unbiased estimator* of \(\theta\). Otherwise, it is *biased*.

\bigskip

```{example, label="normal-mle-both"} Let (X_{1}), (X_{2}), ... , (X_{n}) be an (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) distribution. It can be shown (in Exercise \@ref(xca-norm-mu-sig-mle)) that if (\mbox{$\theta$}=(\mu,\sigma^{2})) then the MLE of (\theta) is \begin{equation} \hat{\theta}=(\hat{\mu},\hat{\sigma}^{2}), \end{equation} where (\hat{\mu}=\overline{X}) and \begin{equation} \hat{\sigma^{2}}=\frac{1}{n}\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}=\frac{n-1}{n}S^{2}. \end{equation}

We of course know from \@ref{pro:mean-sd-xbar} that \(\hat{\mu}\) is
unbiased. What about \(\hat{\sigma^{2}}\)? Let us check:
\begin{eqnarray*}
\mathbb{E}\,\hat{\sigma^{2}} & = & \mathbb{E}\,\frac{n-1}{n}S^{2}\\
 & = & \mathbb{E}\left(\frac{\sigma^{2}}{n}\frac{(n-1)S^{2}}{\sigma^{2}}\right)\\
 & = & \frac{\sigma^{2}}{n}\mathbb{E}\ \mathsf{chisq}(\mathtt{df}=n-1)\\
 & = & \frac{\sigma^{2}}{n}(n-1),
\end{eqnarray*}
from which we may conclude two things:

* \(\hat{\sigma^{2}}\) is a biased estimator of \(\sigma^{2}\), and
* \(S^{2}=n\hat{\sigma^{2}}/(n-1)\) is an unbiased estimator of
  \(\sigma^{2}\).



One of the most common questions in an introductory statistics class
is, "Why do we divide by \(n-1\) when we compute the sample variance?
It's an average, right? So why do we not divide by \(n\)?" We see now that division by \(n\)
amounts to the use of a *biased* estimator for \(\sigma^{2}\), that
is, if we divided by \(n\) then on the average we would
*underestimate* the true value of \(\sigma^{2}\). We use \(n-1\) so
that, on the average, our estimator of \(\sigma^{2}\) will be exactly
right.

### How to do it with R

R can be used to find maximum likelihood estimators in a
lot of diverse settings. We will discuss only the most basic here and
will leave the rest to more sophisticated texts.

For one parameter estimation problems we may use the `optimize`
function to find MLEs. The arguments are the function to be maximized
(the likelihood function), the range over which the optimization is to
take place, and optionally any other arguments to be passed to the
likelihood if needed.

Let us see how to do Example \@ref(ex:bass-bluegill). Recall that our
likelihood function was given by
\begin{equation}
L(p)=p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}.
\end{equation}
Notice that the likelihood is just a product of
\(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)\) PMFs. We first
give some sample data (in the vector `datavals`), next we define the
likelihood function `L`, and finally we `optimize` `L` over the range
`c(0,1)`.

```r 
x <- mtcars$am
L <- function(p,x) prod(dbinom(x, size = 1, prob = p))
optimize(L, interval = c(0,1), x = x, maximum = TRUE)
A <- optimize(L, interval = c(0,1), x = x, maximum = TRUE)
amax <- A$maximum; aobj <- A$objective

Note that the optimize function by default minimizes the function L, so we have to set maximum = TRUE to get an MLE. The returned value of $maximum gives an approximate value of the MLE to be r round(amax, 3) and objective gives L evaluated at the MLE which is approximately r round(aobj, 3).

We previously remarked that it is usually more numerically convenient to maximize the log-likelihood (or minimize the negative log-likelihood), and we can just as easily do this with R. We just need to calculate the log-likelihood beforehand which (for this example) is [ -l(p)=-\sum x_{i}\ln\, p-\left(n-\sum x_{i}\right)\ln(1-p). ]

It is done in R with

minuslogL <- function(p,x){
                -sum(dbinom(x, size = 1, prob = p, log = TRUE))
             }
optimize(minuslogL, interval = c(0,1), x = x)

Note that we did not need maximum = TRUE because we minimized the negative log-likelihood. The answer for the MLE is essentially the same as before, but the $objective value was different, of course.

For multiparameter problems we may use a similar approach by way of the mle function in the stats4 package [@stats4].

\bigskip

``{example, name="Plant Growth"} We will investigate theweightvariable of thePlantGrowth` data. We will suppose that the weights constitute a random observations (X_{1}), (X_{2}), ... , (X_{n}) that are IID (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) which is not unreasonable based on a histogram and other exploratory measures. We will find the MLE of (\theta=(\mu,\sigma^{2})). We claimed in Example \@ref(ex:normal-mle-both) that (\hat{\theta}=(\hat{\mu},\hat{\sigma}^{2})) had the form given above. Let us check whether this is plausible numerically. The negative log-likelihood function is

```r
minuslogL <- function(mu, sigma2){
  -sum(dnorm(x, mean = mu, sd = sqrt(sigma2), log = TRUE))
}

Note that we omitted the data as an argument to the log-likelihood function; the only arguments were the parameters over which the maximization is to take place. Now we will simulate some data and find the MLE. The optimization algorithm requires starting values (intelligent guesses) for the parameters. We choose values close to the sample mean and variance (which turn out to be approximately 5 and 0.5, respectively) to illustrate the procedure.

library(stats4)
x <- PlantGrowth$weight
MaxLikeEst <- mle(minuslogL, start = list(mu = 5, sigma2 = 0.5))
summary(MaxLikeEst)

The outputted MLEs are shown above, and mle even gives us estimates for the standard errors of (\hat{\mu}) and (\hat{\sigma}^{2}) (which were obtained by inverting the numerical Hessian matrix at the optima; see Appendix \@ref(sec-multivariable-calculus)). Let us check how close the numerical MLEs came to the theoretical MLEs:

mean(x); var(x)*29/30; sd(x)/sqrt(30)

The numerical MLEs were very close to the theoretical MLEs. We already knew that the standard error of (\hat{\mu}=\overline{X}) is (\sigma/\sqrt{n}), and the numerical estimate of this was very close too.

There is functionality in the distrTEst package to calculate theoretical MLEs; we will skip examples of these for the time being.

Confidence Intervals for Means {#sec-confidence-intervals-for-means}

We are given (X_{1}), (X_{2}), ..., (X_{n}) that are an (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) distribution, where (\mu) is unknown. We know that we may estimate (\mu) with (\overline{X}), and we have seen that this estimator is the MLE. But how good is our estimate? We know that \begin{equation} \frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1). \end{equation} For a big probability (1-\alpha), for instance, 95%, we can calculate the quantile (z_{\alpha/2}). Then \begin{equation} \mathbb{P}\left(-z_{\alpha/2}\leq\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\leq z_{\alpha/2}\right)=1-\alpha. \end{equation} But now consider the following string of equivalent inequalities: [ -z_{\alpha/2}\leq\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\leq z_{\alpha/2}, ] [ -z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\leq\overline{X}-\mu\leq z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right), ] [ -\overline{X} - z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\leq - \mu \leq - \overline{X} + z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right), ] [ \overline{X} - z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right) \leq \mu \leq \overline{X} + z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right). ] That is, \begin{equation} \mathbb{P}\left(\overline{X}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\leq\mu\leq\overline{X}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)=1-\alpha. \end{equation}

\bigskip

The interval
\begin{equation}
\left[\overline{X}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\ \overline{X}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right]
\end{equation}
is a \(100(1-\alpha)\%\) *confidence interval for* \(\mu\). The quantity \(1-\alpha\) is called the *confidence coefficient*.

\bigskip

```{block, type="remark"} The interval is also sometimes written more compactly as \begin{equation} \label{eq-z-interval} \overline{X}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}. \end{equation}

The interpretation of confidence intervals is tricky and often
mistaken by novices. When I am teaching the concept "live" during
class, I usually ask the students to imagine that my piece of chalk
represents the "unknown" parameter, and I lay it down on the desk in
front of me. Once the chalk has been lain, it is *fixed*; it does not
move. Our goal is to estimate the parameter. For the estimator I pick
up a sheet of loose paper lying nearby. The estimation procedure is to
randomly drop the piece of paper from above, and observe where it
lands. If the piece of paper covers the piece of chalk, then we are
successful -- our estimator covers the parameter. If it falls off to
one side or the other, then we are unsuccessful; our interval fails to
cover the parameter.

Then I ask them: suppose we were to repeat this procedure hundreds,
thousands, millions of times. Suppose we kept track of how many times
we covered and how many times we did not. What percentage of the time
would we be successful?

In the demonstration, the parameter corresponds to the chalk, the
sheet of paper corresponds to the confidence interval, and the random
experiment corresponds to dropping the sheet of paper. The percentage
of the time that we are successful *exactly* corresponds to the
*confidence coefficient*. That is, if we use a 95% confidence
interval, then we can say that, in the long run, approximately 95% of
our intervals will cover the true parameter (which is fixed, but
unknown).

See Figure \@ref(fig:ci-examp), which is a graphical display of these ideas.


```r
ci.examp()

(ref:cap-ci-examp) \small The graph was generated by the ci.examp function from the TeachingDemos package. Fifty (50) samples of size twenty five (25) were generated from a (\mathsf{norm}(\mathtt{mean}=100,\,\mathtt{sd}=10)) distribution, and each sample was used to find a 95% confidence interval for the population mean using Equation \eqref{eq-z-interval}. The 50 confidence intervals are represented above by horizontal lines, and the respective sample means are denoted by vertical slashes. Confidence intervals that "cover" the true mean value of 100 are plotted in black; those that fail to cover are plotted in a lighter color. In the plot we see that only one (1) of the simulated intervals out of the 50 failed to cover (\mu=100), which is a success rate of 98%. If the number of generated samples were to increase from 50 to 500 to 50000, ..., then we would expect our success rate to approach the exact value of 95%.

Under the above framework, we can reason that an "interval" with a larger confidence coefficient corresponds to a wider sheet of paper. Furthermore, the width of the confidence interval (sheet of paper) should be somehow related to the amount of information contained in the random sample, (X_{1}), (X_{2}), ..., (X_{n}). The following remarks makes these notions precise.

\bigskip

```{block, type="remark"} For a fixed confidence coefficient (1-\alpha), if (n) increases, then the confidence interval gets SHORTER.

For a fixed sample size (n), if (1-\alpha) increases, then the confidence interval gets WIDER.

\bigskip

```{example, label="plant-one-samp-z-int", name="Results from an Experiment on Plant Growth"} 
The `PlantGrowth` data frame gives the results of an experiment to
measure plant yield (as measured by the weight of the plant). We would
like to a 95% confidence interval for the mean weight of the
plants. Suppose that we know from prior research that the true
population standard deviation of the plant weights is \(0.7\) g.

The parameter of interest is \(\mu\), which represents the true mean
weight of the population of all plants of the particular species in
the study. We will first take a look at a stem-and-leaf display of the
data:
with(PlantGrowth, stem.leaf(weight))

The data appear to be approximately normal with no extreme values. The data come from a designed experiment, so it is reasonable to suppose that the observations constitute a simple random sample of weights[^estimate-03]. We know the population standard deviation (\sigma=0.70) from prior research. We are going to use the one-sample (z)-interval.

[^estimate-03]: Actually we will see later that there is reason to believe that the observations are simple random samples from three distinct populations. See Section \ref{sec-Analysis-of-Variance}.

dim(PlantGrowth)   # sample size is first entry
with(PlantGrowth, mean(weight))
qnorm(0.975)

We find the sample mean of the data to be (\overline{x}=5.073) and (z_{\alpha/2}=z_{0.025}\approx1.96). Our interval is therefore [ \overline{x}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}=5.073\pm1.96\cdot\frac{0.70}{\sqrt{30}}, ] which comes out to approximately ([4.823,\,5.323]). In conclusion, we are 95% confident that the true mean weight (\mu) of all plants of this species lies somewhere between 4.823 g and 5.323 g, that is, we are 95% confident that the interval ([4.823,\,5.323]) covers (\mu).

BLANK

\bigskip

Give some data with \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) an \(SRS(n)\)
from a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\)
distribution. Maybe small sample?
  1. What is the parameter of interest? in the context of the problem.
  2. Give a point estimate for (\mu).
  3. What are the assumptions being made in the problem? Do they meet the conditions of the interval?
  4. Calculate the interval.
  5. Draw the conclusion.

\bigskip

```{block, type="remark"} What if (\sigma) is unknown? We instead use the interval \begin{equation} \overline{X}\pm z_{\alpha/2}\frac{S}{\sqrt{n}}, \end{equation} where (S) is the sample standard deviation.

The author learned of a handy acronym from AP Statistics Exam graders
that summarizes the important parts of confidence interval estimation,
which is PANIC: *P*-arameter, *A*-ssumptions, *N*-ame, *I*-nterval,
and *C*-onclusion.

* **Parameter:** identify the parameter of interest with the proper
                symbols. Write down what the parameter means in the
                context of the problem.
* **Assumptions:**  list any assumptions made in the experiment. If
                  there are any other assumptions needed or that were
                  not checked, state what they are and why they are
                  important.
* **Name:**  choose a statistical procedure from your bag of tricks
           based on the answers to the previous two parts. The
           assumptions of the procedure you choose should match those
           of the problem; if they do not match then either pick a
           different procedure or openly admit that the results may
           not be reliable. Write down any underlying formulas used.
* **Interval:**  calculate the interval from the sample data. This can
               be done by hand but will more often be done with the
               aid of a computer. Regardless of the method, all
               calculations or code should be shown so that the entire
               process is repeatable by a subsequent reader.
* **Conclusion:**  state the final results, using language in the
                 context of the problem. Include the appropriate
                 interpretation of the interval, making reference to
                 the confidence coefficient.

\bigskip

```{block, type="remark"}
All of the above intervals for \(\mu\) were two-sided, but there are
also one-sided intervals for \(\mu\). They look like
\begin{equation}
\left[\overline{X}-z_{\alpha}\frac{\sigma}{\sqrt{n}},\ \infty\right)\quad \mbox{or}\quad \left(-\infty,\ \overline{X}+z_{\alpha}\frac{\sigma}{\sqrt{n}}\right]
\end{equation}
and satisfy
\begin{equation}
\mathbb{P}\left(\overline{X}-z_{\alpha}\frac{\sigma}{\sqrt{n}}\leq\mu\right)=1-\alpha\quad \mbox{and}\quad \mathbb{P}\left(\overline{X}+z_{\alpha}\frac{\sigma}{\sqrt{n}}\geq\mu\right)=1-\alpha.
\end{equation}

\bigskip

BLANK

Small sample, some data with \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) an
\(SRS(n)\) from a
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\)
distribution.  PANIC

How to do it with R

We can do Example \@ref(ex:plant-one-samp-z-int) with the following code.

temp <- with(PlantGrowth, z.test(weight, stdev = 0.7))
temp

BLANK

The confidence interval bounds are shown in the sixth line down of the output (please disregard all of the additional output information for now -- we will use it in Chapter \@ref(cha-hypothesis-testing)). We can make the plot for Figure \@ref(fig:plant-z-int-plot) with

plot(temp, "Conf")

Confidence Intervals for Differences of Means {#sec-conf-interv-for-diff-means}

Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X})) distribution and let (Y_{1}), (Y_{2}), ..., (Y_{m}) be a (SRS(m)) from a (\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y})) distribution. Further, assume that the (X_{1}), (X_{2}), ..., (X_{n}) sample is independent of the (Y_{1}), (Y_{2}), ..., (Y_{m}) sample.

Suppose that (\sigma_{X}) and (\sigma_{Y}) are known. We would like a confidence interval for (\mu_{X}-\mu_{Y}). We know that \begin{equation} \overline{X}-\overline{Y}\sim\mathsf{norm}\left(\mathtt{mean}=\mu_{X}-\mu_{Y},\,\mathtt{sd}=\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}\right). \end{equation} Therefore, a (100(1-\alpha)\%) confidence interval for (\mu_{X}-\mu_{Y}) is given by \begin{equation} \label{eq-two-samp-mean-CI} \left(\overline{X}-\overline{Y}\right)\pm z_{\alpha/2}\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}. \end{equation} Unfortunately, most of the time the values of (\sigma_{X}) and (\sigma_{Y}) are unknown. This leads us to the following:

How to do it with R

The basic function is t.test which has a var.equal argument that may be set to TRUE or FALSE. The confidence interval is shown as part of the output, although there is a lot of additional information that is not needed until Chapter \@ref(cha-hypothesis-testing).

There is not any specific functionality to handle the (z)-interval for small samples, but if the samples are large then t.test with var.equal = FALSE will be essentially the same thing. The standard deviations are never (?) known in advance anyway so it does not really matter in practice.

Confidence Intervals for Proportions {#sec-confidence-intervals-proportions}

We would like to know (p) which is the "proportion of successes". For instance, (p) could be:

We are given an (SRS(n)) (X_{1}), (X_{2}), ..., (X_{n}) distributed (\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)). Recall from Section \@ref(sec-binom-dist) that the common mean of these variables is (\mathbb{E} X=p) and the variance is (\mathbb{E}(X-p)^{2}=p(1-p)). If we let (Y=\sum X_{i}), then from Section \@ref(sec-binom-dist) we know that (Y\sim\mathsf{binom}(\mathtt{size}=n,\,\mathtt{prob}=p)) and that [ \overline{X}=\frac{Y}{n}\mbox{ has }\mathbb{E}\overline{X}=p\mbox{ and }\mathrm{Var}(\overline{X})=\frac{p(1-p)}{n}. ] Thus if (n) is large (here is the CLT) then an approximate (100(1-\alpha)\%) confidence interval for (p) would be given by \begin{equation} \label{eq-ci-p-no-good} \overline{X}\pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}. \end{equation} OOPS...! Equation \eqref{eq-ci-p-no-good} is of no use to us because the \underbar{unknown} parameter (p) is in the formula! (If we knew what (p) was to plug in the formula then we would not need a confidence interval in the first place.) There are two solutions to this problem.

  1. Replace (p) with (\hat{p}=\overline{X}). Then an approximate (100(1-\alpha)\%) confidence interval for (p) is given by \begin{equation} \hat{p}\pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \end{equation} This approach is called the Wald interval and is also known as the asymptotic interval because it appeals to the CLT for large sample sizes.
  2. Go back to first principles. Note that [ -z_{\alpha/2}\leq\frac{Y/n-p}{\sqrt{p(1-p)/n}}\leq z_{\alpha/2} ] exactly when the function (f) defined by [ f(p)=\left(Y/n-p\right)^{2}-z_{\alpha/2}^{2}\frac{p(1-p)}{n} ] satisfies (f(p)\leq0). But (f) is quadratic in (p) so its graph is a parabola; it has two roots, and these roots form the limits of the confidence interval. We can find them with the quadratic formula (see Exercise \@ref(xca-ci-quad-form)): \begin{equation} \left.\left[\left(\hat{p}+\frac{z_{\alpha/2}^{2}}{2n}\right)\pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}+\frac{z_{\alpha/2}^{2}}{(2n)^{2}}}\right]\right/ \left(1+\frac{z_{\alpha/2}^{2}}{n}\right) \end{equation} This approach is called the score interval because it is based on the inversion of the "Score test". See Chapter \@ref(cha-categorical-data-analysis). It is also known as the Wilson interval; see Agresti [@Agresti2002].

For two proportions (p_{1}) and (p_{2}), we may collect independent (\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)) samples of size (n_{1}) and (n_{2}), respectively. Let (Y_{1}) and (Y_{2}) denote the number of successes in the respective samples. We know that [ \frac{Y_{1}}{n_{1}}\approx\mathsf{norm}\left(\mathtt{mean}=p_{1},\,\mathtt{sd}=\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}}}\right) ] and [ \frac{Y_{2}}{n_{2}}\approx\mathsf{norm}\left(\mathtt{mean}=p_{2},\,\mathtt{sd}=\sqrt{\frac{p_{2}(1-p_{2})}{n_{2}}}\right) ] so it stands to reason that an approximate (100(1-\alpha)\%) confidence interval for (p_{1}-p_{2}) is given by \begin{equation} \left(\hat{p}{1}-\hat{p}{2}\right)\pm z_{\alpha/2}\sqrt{\frac{\hat{p}{1}(1-\hat{p}{1})}{n_{1}}+\frac{\hat{p}{2}(1-\hat{p}{2})}{n_{2}}}, \end{equation} where (\hat{p}{1}=Y{1}/n_{1}) and (\hat{p}{2}=Y{2}/n_{2}).

\bigskip

```{block, type="remark"} When estimating a single proportion, one-sided intervals are sometimes needed. They take the form \begin{equation} \left[0,\ \hat{p}+z_{\alpha}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right] \end{equation} or \begin{equation} \left[\hat{p}-z_{\alpha}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\ 1\right] \end{equation} or in other words, we know in advance that the true proportion is restricted to the interval ([0,1]), so we can truncate our confidence interval to those values on either side.

### How to do it with R

```r 
library(Hmisc)
binconf(x = 7, n = 25, method = "asymptotic")
binconf(x = 7, n = 25, method = "wilson")

The default value of the method argument is wilson. An alternate way is

data(RcmdrTestDrive)
tab <- xtabs(~gender, data = RcmdrTestDrive)
prop.test(rbind(tab), conf.level = 0.95, correct = FALSE)
A <- as.data.frame(Titanic)
B <- with(A, untable(A, Freq))

Confidence Intervals for Variances {#sec-confidence-intervals-for-variances}

BLANK

I am thinking one and two sample problems here.

How to do it with R

BLANK

I am thinking about sigma.test in the TeachingDemos package [@TeachingDemos] and var.test in base R [@base] here.

Fitting Distributions {#sec-fitting-distributions}

How to do it with R

BLANK

I am thinking about fitdistr from the MASS package [@MASS].

Sample Size and Margin of Error {#sec-Sample-Size-and-moe}

Sections \@ref(sec-confidence-intervals-for-means) through \@ref(sec-confidence-intervals-for-variances) all began the same way: we were given the sample size (n) and the confidence coefficient (1-\alpha), and our task was to find a margin of error (E) so that [ \hat{\theta}\pm E\mbox{ is a }100(1-\alpha)\%\mbox{ confidence interval for }\theta. ]

Some examples we saw were:

We already know (we can see in the formulas above) that (E) decreases as (n) increases. Now we would like to use this information to our advantage: suppose that we have a fixed margin of error (E,) say (E=3), and we want a (100(1-\alpha)\%) confidence interval for (\mu). The question is: how big does (n) have to be?

For the case of a population mean the answer is easy: we set up an equation and solve for (n).

\bigskip

BLANK

Given a situation, given \(\sigma\), given \(E\), we would like to
know how big \(n\) has to be to ensure that \(\overline{X}\pm5\) is a
95% confidence interval for \(\mu\).

\bigskip

```{block, type="remark"} Always round up any decimal values of (n), no matter how small the decimal is. Another name for (E) is the "maximum error of the estimate".

For proportions, recall that the asymptotic formula to estimate \(p\)
was \[ \hat{p}\pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.  \]
Reasoning as above we would want
\begin{align}
\label{eq-samp-size-prop-ME}
E & =z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\mbox{ or}\\
n & =z_{\alpha/2}^{2}\frac{\hat{p}(1-\hat{p})}{E^{2}}.
\end{align}
OOPS! Recall that \(\hat{p}=Y/n\), which would put the variable \(n\)
on both sides of Equation \eqref{eq-samp-size-prop-ME}. Again, there are two
solutions to the problem.

1. If we have a good idea of what \(p\) is, say \(p^{\ast}\) then we
   can plug it in to get
   \begin{equation}
   n=z_{\alpha/2}^{2}\frac{p^{\ast}(1-p^{\ast})}{E^{2}}.
   \end{equation}
2. Even if we have no idea what \(p\) is, we do know from calculus
   that \(p(1-p)\leq1/4\) because the function \(f(x)=x(1-x)\) is
   quadratic (so its graph is a parabola which opens downward) with
   maximum value attained at \(x=1/2\). Therefore, regardless of our
   choice for \(p^{\ast}\) the sample size must satisfy
   \begin{equation}
   n=z_{\alpha/2}^{2}\frac{p^{\ast}(1-p^{\ast})}{E^{2}}\leq\frac{z_{\alpha/2}^{2}}{4E^{2}}.
   \end{equation}
   The quantity \(z_{\alpha/2}^{2}/4E^{2}\) is large enough to
   guarantee \(100(1-\alpha)\%\) confidence.


BLANK

Proportion example.

\bigskip

```{block, type="remark"}
For very small populations sometimes the value of \(n\) obtained from
the formula is too big. In this case we should use the hypergeometric
distribution for a sampling model rather than the binomial model. With
this modification the formulas change to the following: if \(N\)
denotes the population size then let
\begin{equation}
m=z_{\alpha/2}^{2}\frac{p^{\ast}(1-p^{\ast})}{E^{2}}
\end{equation}
and the sample size needed to ensure \(100(1-\alpha)\%\) confidence is
achieved is
\begin{equation}
n=\frac{m}{1+\frac{m-1}{N}}.
\end{equation}
If we do not have a good value for the estimate \(p^{\ast}\) then we may use \(p^{\ast}=1/2\).

How to do it with R

BLANK

I am thinking about power.t.test, power.prop.test, power.anova.test, and I am also thinking about replicate.

Other Topics {#sec-other-topics}

Mention mle from the stats4 package [@stats4].

Exercises

```{block, type="xca", label="xca-norm-mu-sig-MLE"} Let (X_{1}), (X_{2}), ..., (X_{n}) be an (SRS(n)) from a (\mathsf{norm}(\mathtt{mean} = \mu, \, \mathtt{sd} = \sigma)) distribution. Find a two-dimensional MLE for (\theta=(\mu,\sigma)).

\bigskip

```{block, type="xca", label="xca-CI-quad-form"} 
Find the upper and lower limits for the
confidence interval procedure by finding the roots of \(f\) defined by
\[ f(p)=\left(Y/n-p\right)^{2}-z_{\alpha/2}^{2}\frac{p(1-p)}{n}.  \]
You are going to need the quadratic formula.


Try the IPSUR package in your browser

Any scripts or data that you put into this service are public.

IPSUR documentation built on May 2, 2019, 9:15 a.m.