# IPSUR: Introduction to Probability and Statistics Using R # Copyright (C) 2018 G. Jay Kerns # # Chapter: Estimation # # This file is part of IPSUR. # # IPSUR is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # IPSUR is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
library(Hmisc) library(RcmdrPlugin.IPSUR) library(reshape) library(stats4) library(TeachingDemos)
We will discuss two branches of estimation procedures: point estimation and interval estimation. We briefly discuss point estimation first and then spend the rest of the chapter on interval estimation.
We find an estimator with the methods of Section \@ref(sec-point-estimation). We make some assumptions about the underlying population distribution and use what we know from Chapter \@ref(cha-sampling-distributions) about sampling distributions both to study how the estimator will perform, and to find intervals of confidence for underlying parameters associated with the population distribution. Once we have confidence intervals we can do inference in the form of hypothesis tests in the next chapter.
What do I want them to know?
The following example is how I was introduced to maximum likelihood.
\bigskip
```{example, label="how-many-fish", name="Fishing, part one"} Imagine there is a small pond in our backyard, and in the pond there live some fish. We would like to know how many fish live in the pond. How can we estimate this? One procedure developed by researchers is the capture-recapture method. Here is how it works.
We go fishing, and on each fish that we catch we attach an unobtrusive tag to its tail, and release it back into the water. Suppose on this day that we capture \(M=7\) fish in total. Next, we wait a few days for the fish to remix and become accustomed to their new tag. Then we go fishing again. On the second trip some of the fish we catch may be tagged; some may not be. Let \(X\) denote the number of caught fish which are tagged[^estimate-01], and suppose for the sake of argument that we catch \(K=4\) fish and we find that 3 of them are tagged. [^estimate-01]: It is theoretically possible that we could catch the same tagged fish more than once, which would inflate our count of tagged fish. To avoid this difficulty, suppose that on the second trip we use a tank on the boat to hold the caught fish until data collection is completed. Now let \(F\) denote the (unknown) total number of fish in the pond. We know that \(F\geq7\), because we tagged that many on the first trip. In fact, if we let \(N\) denote the number of untagged fish in the pond, then \(F=M+N\). Using our earlier language, we have sampled \(K=4\) times, without replacement, from an urn (pond) which has \(M=7\) white balls (tagged fish) and \(N=F-M\) black balls (untagged fish), and we have observed \(x=3\) of them to be white (tagged). What is the probability of this? Looking back to Section \@ref(sec-other-discrete-distributions), we see that the random variable \(X\) has a \(\mathsf{hyper}(\mathtt{m}=M,\,\mathtt{n}=F-M,\,\mathtt{k}=K)\) distribution. Therefore, for an observed value \(X=x\) the probability would be \[ \mathbb{P}(X=x)=\frac{{M \choose x}{F-M \choose K-x}}{{F \choose K}}. \] First we notice that \(F\) must be at least 7. Could \(F\) be equal to seven? If \(F=7\) then all of the fish would have been tagged on the first run, and there would be no untagged fish in the pond, thus, \(\mathbb{P}(\mbox{3 successes in 4 trials})=0\). What about \(F=8\); what would be the probability of observing \(X=3\) tagged fish? \[ \mathbb{P}(\mbox{3 successes in 4 trials})=\frac{{7 \choose 3}{1 \choose 1}}{{8 \choose 4}}=\frac{35}{70}=0.5. \] Similarly, if \(F=9\) then the probability of observing \(X=3\) tagged fish would be \[ \mathbb{P}(\mbox{3 successes in 4 trials})=\frac{{7 \choose 3}{2 \choose 1}}{{9 \choose 4}}=\frac{70}{126}\approx0.556. \] We can see already that the observed data \(X=3\) is more likely when \(F=9\) than it is when \(F=8\). And here lies the genius of Sir Ronald Aylmer Fisher: he asks, "What is the value of \(F\) which has the highest likelihood?" In other words, for all of the different possible values of \(F\), which one makes the above probability the biggest? We can answer this question with a plot of \(\mathbb{P}(X=x)\) versus \(F\). See Figure \@ref(fig:capture-recapture). The graph shows that \(\hat{F}=9\) is the number of fish with the maximum likelihood, and that is what we will take for our estimate of the number of fish in the the pond. ```r heights = rep(0, 16) for (j in 7:15) heights[j] <- dhyper(3, m = 7, n = j - 7, k = 4) plot(6:15, heights[6:15], pch = 16, cex = 1.5, xlab = "number of fish in pond", ylab = "Likelihood") abline(h = 0) lines(6:15, heights[6:15], type = "h", lwd = 2, lty = 3) text(9, heights[9]/6, bquote(hat(F)==.(9)), cex = 2, pos = 4) lines(9, heights[9], type = "h", lwd = 2) points(9, 0, pch = 4, lwd = 3, cex = 2)
(ref:cap-capture-recapture) \small A plot of maximum likelihood for the capture-recapture experiment. We see that (\hat{F}=9) is the number of fish that results in the maximum likelihood for the observed data.
\bigskip
```{example, label="bass-bluegill", name="Fishing, part two"} In the last example we were only concerned with how many fish were in the pond, but now, we will ask a different question. Suppose it is known that there are only two species of fish in the pond: smallmouth bass (Micropterus dolomieu) and bluegill (Lepomis macrochirus); perhaps we built the pond some years ago and stocked it with only these two species. We would like to estimate the proportion of fish in the pond which are bass.
Let \(p=\mbox{the proportion of bass}\). Without any other information, it is conceivable for \(p\) to be any value in the interval \([0,1]\), but for the sake of argument we will suppose that \(p\) falls strictly between zero and one. How can we learn about the true value of \(p\)? Go fishing! As before, we will use catch-and-release, but unlike before, we will not tag the fish. We will simply note the species of any caught fish before returning it to the pond. Suppose we catch \(n\) fish. Let \[ X_{i} = \begin{cases} 1, & \mbox{if the \(i^{\text{th}}\) fish is a bass,}\\ 0, & \mbox{if the \(i^{\text{th}}\) fish is a bluegill.} \end{cases} \] Since we are returning the fish to the pond once caught, we may think of this as a sampling scheme with replacement where the proportion of bass \(p\) does not change. Given that we allow the fish sufficient time to "mix" once returned, it is not completely unreasonable to model our fishing experiment as a sequence of Bernoulli trials, so that the \(X_{i}\)'s would be IID \(\mathsf{binom(\mathtt{size}}=1,\,\mathtt{prob}=p)\). Under those assumptions we would have \begin{eqnarray*} \mathbb{P}(X_{1}=x_{1},\, X_{2}=x_{2},\,\ldots,\, X_{n}=x_{n}) & = & \mathbb{P}(X_{1}=x_{1})\,\mathbb{P}(X_{2}=x_{2})\,\cdots\mathbb{P}(X_{n}=x_{n}),\\ & = & p^{x_{1}}(1-p)^{x_{1}}\, p^{x_{2}}(1-p)^{x_{2}}\cdots\, p^{x_{n}}(1-p)^{x_{n}},\\ & = & p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}. \end{eqnarray*} That is, \[ \mathbb{P}(X_{1}=x_{1},\, X_{2}=x_{2},\,\ldots,\, X_{n}=x_{n})=p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}. \] This last quantity is a function of \(p\), called the *likelihood function* \(L(p)\): \[ L(p)=p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}. \] A graph of \(L\) for values of \(\sum x_{i}=3,\ 4\), and 5 when \(n=7\) is shown in Figure \@ref(fig:fishing-part-two). ```r curve(x^5*(1-x)^2, 0, 1, xlab = "p", ylab = "L(p)") curve(x^4*(1-x)^3, 0, 1, add = TRUE) curve(x^3*(1-x)^4, 0, 1, add = TRUE)
(ref:cap-fishing-part-two) \small Assorted likelihood functions for Fishing, part two. Three graphs are shown of (L) when (\sum x_{i}) equals 3, 4, and 5, respectively, from left to right. We pick an (L) that matches the observed data and then maximize (L) as a function of (p). If (\sum x_{i}=4), then the maximum appears to occur somewhere around (p \approx 0.6).
We want the value of (p) which has the highest likelihood, that is, we again wish to maximize the likelihood. We know from calculus (see Appendix \@ref(sec-differential-and-integral)) to differentiate (L) and set (L'=0) to find a maximum. [ L'(p)=\left(\sum x_{i}\right)p^{\sum x_{i}-1}(1-p)^{n-\sum x_{i}}+p^{\sum x_{i}}\left(n-\sum x_{i}\right)(1-p)^{n-\sum x_{i}-1}(-1). ] The derivative vanishes ((L'=0)) when \begin{eqnarray} \left(\sum x_{i}\right)p^{\sum x_{i}-1}(1-p)^{n-\sum x_{i}} & = & p^{\sum x_{i}}\left(n-\sum x_{i}\right)(1-p)^{n-\sum x_{i}-1},\ \sum x_{i}(1-p) & = & \left(n-\sum x_{i}\right)p,\ \sum x_{i}-p\sum x_{i} & = & np-p\sum x_{i},\ \frac{1}{n}\sum_{i=1}^{n}x_{i} & = & p. \end{eqnarray} This "best" (p), the one which maximizes the likelihood, is called the maximum likelihood estimator (MLE) of (p) and is denoted (\hat{p}). That is, \begin{equation} \hat{p}=\frac{\sum_{i=1}^{n}x_{i}}{n}=\overline{x}. \end{equation}
\bigskip
```{block, type="remark"} Strictly speaking we have only shown that the derivative equals zero at (\hat{p}), so it is theoretically possible that the critical value (\hat{p}=\overline{x}) is located at a minimum[^est-2] instead of a maximum! We should be thorough and check that (L'>0) when (p<\overline{x}) and (L'<0) when (p>\overline{x}). Then by the First Derivative Test (Theorem \@ref(thm:first-derivative-test)) we could be certain that (\hat{p}=\overline{x}) is indeed a maximum likelihood estimator, and not a minimum likelihood estimator.
The result is shown in Figure \@ref(fig:species-mle). [^est-2]: We can tell from the graph that our value of \(\hat{p}\) is a maximum instead of a minimum so we do not really need to worry for this example. Other examples are not so easy, however, and we should be careful to be cognizant of this extra step. ```r dat <- rbinom(27, size = 1, prob = 0.3) like <- function(x){ r <- 1 for (k in 1:27){ r <- r*dbinom(dat[k], size = 1, prob = x)} return(r) } curve(like, from = 0, to = 1, xlab = "parameter space", ylab = "Likelihood", lwd = 3, col = "blue") abline(h = 0, lwd = 1, lty = 3, col = "grey") mle <- mean(dat) mleobj <- like(mle) lines(mle, mleobj, type = "h", lwd = 2, lty = 3, col = "red") points(mle, 0, pch = 4, lwd = 2, cex = 2, col = "red") text(mle, mleobj/6, substitute(hat(theta)==a, list(a=round(mle, 4))), cex = 2, pos = 4)
(ref:cap-species-mle) \small Species maximum likelihood. Here we see that (\hat{theta} \approx 0.44) is the proportion of bass with the maximum likelihood.
In general, we have a family of PDFs (f(x|\theta)) indexed by a parameter (\theta) in some parameter space (\Theta). We want to learn about (\theta). We take a (SRS(n)): \begin{equation} X_{1},\, X_{2},\,\ldots,X_{n}\mbox{ which are IID (f(x| \theta )).} \end{equation}
\bigskip
Given the observed data \(x_{1}\), \(x_{2}\),..., \(x_{n}\), the *likelihood function* \(L\) is defined by \[ L(\theta)=\prod_{i=1}^{n}f(x_{i}|\theta),\quad \theta\in\Theta. \]
The next step is to maximize (L). The method we will use in this book is to find the derivative (L') and solve the equation (L'(\theta)=0). Call a solution (\hat{\theta}). We will check that (L) is maximized at (\hat{\theta}) using the First Derivative Test or the Second Derivative Test (\left(L''(\hat{\theta})<0\right)).
\bigskip
A value \(\theta\) that maximizes \(L\) is called a *maximum likelihood estimator* (MLE) and is denoted \(\hat{\theta}\). It is a function of the sample, \(\hat{\theta}=\hat{\theta}\left(X_{1},\, X_{2},\,\ldots,X_{n}\right)\), and is called a *point estimator* of \(\theta\).
\bigskip
```{block, type="remark"} Some comments about maximum likelihood estimators:
MLEs are just one of _many_ possible estimators. One of the more popular alternatives are the *method of moments estimators*; see Casella and Berger [@Casella2002] for more. Notice, in Example \@ref(ex:bass-bluegill) we had \(X_{i}\) IID \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)\), and we saw that the MLE was \(\hat{p}=\overline{X}\). But further \begin{eqnarray*} \mathbb{E}\overline{X} & = & \mathbb{E}\left(\frac{X_{1}+X_{2}+\cdots+X_{n}}{n}\right),\\ & = & \frac{1}{n}\left(\mathbb{E} X_{1}+\mathbb{E} X_{2}+\cdots+\mathbb{E} X_{n}\right),\\ & = & \frac{1}{n}\left(np\right),\\ & = & p, \end{eqnarray*} which is exactly the same as the parameter which we estimated. More concisely, \(\mathbb{E}\hat{p}=p\), that is, on the average, the estimator is exactly right. \bigskip ```{definition} Let \(s(X_{1},X_{2},\ldots,X_{n})\) be a statistic which estimates \(\theta\). If \[ \mathbb{E} s(X_{1},X_{2},\ldots,X_{n})=\theta, \] then the statistic \(s(X_{1},X_{2},\ldots,X_{n})\) is said to be an *unbiased estimator* of \(\theta\). Otherwise, it is *biased*.
\bigskip
```{example, label="normal-mle-both"} Let (X_{1}), (X_{2}), ... , (X_{n}) be an (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) distribution. It can be shown (in Exercise \@ref(xca-norm-mu-sig-mle)) that if (\mbox{$\theta$}=(\mu,\sigma^{2})) then the MLE of (\theta) is \begin{equation} \hat{\theta}=(\hat{\mu},\hat{\sigma}^{2}), \end{equation} where (\hat{\mu}=\overline{X}) and \begin{equation} \hat{\sigma^{2}}=\frac{1}{n}\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}=\frac{n-1}{n}S^{2}. \end{equation}
We of course know from \@ref{pro:mean-sd-xbar} that \(\hat{\mu}\) is unbiased. What about \(\hat{\sigma^{2}}\)? Let us check: \begin{eqnarray*} \mathbb{E}\,\hat{\sigma^{2}} & = & \mathbb{E}\,\frac{n-1}{n}S^{2}\\ & = & \mathbb{E}\left(\frac{\sigma^{2}}{n}\frac{(n-1)S^{2}}{\sigma^{2}}\right)\\ & = & \frac{\sigma^{2}}{n}\mathbb{E}\ \mathsf{chisq}(\mathtt{df}=n-1)\\ & = & \frac{\sigma^{2}}{n}(n-1), \end{eqnarray*} from which we may conclude two things: * \(\hat{\sigma^{2}}\) is a biased estimator of \(\sigma^{2}\), and * \(S^{2}=n\hat{\sigma^{2}}/(n-1)\) is an unbiased estimator of \(\sigma^{2}\). One of the most common questions in an introductory statistics class is, "Why do we divide by \(n-1\) when we compute the sample variance? It's an average, right? So why do we not divide by \(n\)?" We see now that division by \(n\) amounts to the use of a *biased* estimator for \(\sigma^{2}\), that is, if we divided by \(n\) then on the average we would *underestimate* the true value of \(\sigma^{2}\). We use \(n-1\) so that, on the average, our estimator of \(\sigma^{2}\) will be exactly right. ### How to do it with R R can be used to find maximum likelihood estimators in a lot of diverse settings. We will discuss only the most basic here and will leave the rest to more sophisticated texts. For one parameter estimation problems we may use the `optimize` function to find MLEs. The arguments are the function to be maximized (the likelihood function), the range over which the optimization is to take place, and optionally any other arguments to be passed to the likelihood if needed. Let us see how to do Example \@ref(ex:bass-bluegill). Recall that our likelihood function was given by \begin{equation} L(p)=p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}. \end{equation} Notice that the likelihood is just a product of \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)\) PMFs. We first give some sample data (in the vector `datavals`), next we define the likelihood function `L`, and finally we `optimize` `L` over the range `c(0,1)`. ```r x <- mtcars$am L <- function(p,x) prod(dbinom(x, size = 1, prob = p)) optimize(L, interval = c(0,1), x = x, maximum = TRUE)
A <- optimize(L, interval = c(0,1), x = x, maximum = TRUE) amax <- A$maximum; aobj <- A$objective
Note that the optimize
function by default minimizes the function
L
, so we have to set maximum = TRUE
to get an MLE. The returned
value of $maximum
gives an approximate value of the MLE to be
r round(amax, 3)
and objective
gives L
evaluated at the
MLE which is approximately r round(aobj, 3)
.
We previously remarked that it is usually more numerically convenient to maximize the log-likelihood (or minimize the negative log-likelihood), and we can just as easily do this with R. We just need to calculate the log-likelihood beforehand which (for this example) is [ -l(p)=-\sum x_{i}\ln\, p-\left(n-\sum x_{i}\right)\ln(1-p). ]
It is done in R with
minuslogL <- function(p,x){ -sum(dbinom(x, size = 1, prob = p, log = TRUE)) } optimize(minuslogL, interval = c(0,1), x = x)
Note that we did not need maximum = TRUE
because we minimized the
negative log-likelihood. The answer for the MLE is essentially the
same as before, but the $objective
value was different, of course.
For multiparameter problems we may use a similar approach by way of
the mle
function in the stats4
package [@stats4].
\bigskip
``{example, name="Plant Growth"}
We will investigate the
weightvariable of the
PlantGrowth` data. We will suppose that the weights constitute a
random observations (X_{1}), (X_{2}), ... , (X_{n}) that are IID
(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) which is not
unreasonable based on a histogram and other exploratory measures. We
will find the MLE of (\theta=(\mu,\sigma^{2})). We claimed in
Example \@ref(ex:normal-mle-both) that
(\hat{\theta}=(\hat{\mu},\hat{\sigma}^{2})) had the form given
above. Let us check whether this is plausible numerically. The
negative log-likelihood function is
```r minuslogL <- function(mu, sigma2){ -sum(dnorm(x, mean = mu, sd = sqrt(sigma2), log = TRUE)) }
Note that we omitted the data as an argument to the log-likelihood function; the only arguments were the parameters over which the maximization is to take place. Now we will simulate some data and find the MLE. The optimization algorithm requires starting values (intelligent guesses) for the parameters. We choose values close to the sample mean and variance (which turn out to be approximately 5 and 0.5, respectively) to illustrate the procedure.
library(stats4) x <- PlantGrowth$weight MaxLikeEst <- mle(minuslogL, start = list(mu = 5, sigma2 = 0.5)) summary(MaxLikeEst)
The outputted MLEs are shown above, and mle
even gives us estimates
for the standard errors of (\hat{\mu}) and (\hat{\sigma}^{2})
(which were obtained by inverting the numerical Hessian matrix at the
optima; see Appendix \@ref(sec-multivariable-calculus)). Let us check how close
the numerical MLEs came to the theoretical MLEs:
mean(x); var(x)*29/30; sd(x)/sqrt(30)
The numerical MLEs were very close to the theoretical MLEs. We already knew that the standard error of (\hat{\mu}=\overline{X}) is (\sigma/\sqrt{n}), and the numerical estimate of this was very close too.
There is functionality in the distrTEst
package to
calculate theoretical MLEs; we will skip examples of these for the
time being.
We are given (X_{1}), (X_{2}), ..., (X_{n}) that are an (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) distribution, where (\mu) is unknown. We know that we may estimate (\mu) with (\overline{X}), and we have seen that this estimator is the MLE. But how good is our estimate? We know that \begin{equation} \frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1). \end{equation} For a big probability (1-\alpha), for instance, 95%, we can calculate the quantile (z_{\alpha/2}). Then \begin{equation} \mathbb{P}\left(-z_{\alpha/2}\leq\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\leq z_{\alpha/2}\right)=1-\alpha. \end{equation} But now consider the following string of equivalent inequalities: [ -z_{\alpha/2}\leq\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\leq z_{\alpha/2}, ] [ -z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\leq\overline{X}-\mu\leq z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right), ] [ -\overline{X} - z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\leq - \mu \leq - \overline{X} + z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right), ] [ \overline{X} - z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right) \leq \mu \leq \overline{X} + z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right). ] That is, \begin{equation} \mathbb{P}\left(\overline{X}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\leq\mu\leq\overline{X}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)=1-\alpha. \end{equation}
\bigskip
The interval \begin{equation} \left[\overline{X}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\ \overline{X}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right] \end{equation} is a \(100(1-\alpha)\%\) *confidence interval for* \(\mu\). The quantity \(1-\alpha\) is called the *confidence coefficient*.
\bigskip
```{block, type="remark"} The interval is also sometimes written more compactly as \begin{equation} \label{eq-z-interval} \overline{X}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}. \end{equation}
The interpretation of confidence intervals is tricky and often mistaken by novices. When I am teaching the concept "live" during class, I usually ask the students to imagine that my piece of chalk represents the "unknown" parameter, and I lay it down on the desk in front of me. Once the chalk has been lain, it is *fixed*; it does not move. Our goal is to estimate the parameter. For the estimator I pick up a sheet of loose paper lying nearby. The estimation procedure is to randomly drop the piece of paper from above, and observe where it lands. If the piece of paper covers the piece of chalk, then we are successful -- our estimator covers the parameter. If it falls off to one side or the other, then we are unsuccessful; our interval fails to cover the parameter. Then I ask them: suppose we were to repeat this procedure hundreds, thousands, millions of times. Suppose we kept track of how many times we covered and how many times we did not. What percentage of the time would we be successful? In the demonstration, the parameter corresponds to the chalk, the sheet of paper corresponds to the confidence interval, and the random experiment corresponds to dropping the sheet of paper. The percentage of the time that we are successful *exactly* corresponds to the *confidence coefficient*. That is, if we use a 95% confidence interval, then we can say that, in the long run, approximately 95% of our intervals will cover the true parameter (which is fixed, but unknown). See Figure \@ref(fig:ci-examp), which is a graphical display of these ideas. ```r ci.examp()
(ref:cap-ci-examp) \small The graph was generated by the ci.examp
function from the TeachingDemos
package. Fifty (50) samples of size twenty five (25) were generated from a (\mathsf{norm}(\mathtt{mean}=100,\,\mathtt{sd}=10)) distribution, and each sample was used to find a 95% confidence interval for the population mean using Equation \eqref{eq-z-interval}. The 50 confidence intervals are represented above by horizontal lines, and the respective sample means are denoted by vertical slashes. Confidence intervals that "cover" the true mean value of 100 are plotted in black; those that fail to cover are plotted in a lighter color. In the plot we see that only one (1) of the simulated intervals out of the 50 failed to cover (\mu=100), which is a success rate of 98%. If the number of generated samples were to increase from 50 to 500 to 50000, ..., then we would expect our success rate to approach the exact value of 95%.
Under the above framework, we can reason that an "interval" with a larger confidence coefficient corresponds to a wider sheet of paper. Furthermore, the width of the confidence interval (sheet of paper) should be somehow related to the amount of information contained in the random sample, (X_{1}), (X_{2}), ..., (X_{n}). The following remarks makes these notions precise.
\bigskip
```{block, type="remark"} For a fixed confidence coefficient (1-\alpha), if (n) increases, then the confidence interval gets SHORTER.
For a fixed sample size (n), if (1-\alpha) increases, then the confidence interval gets WIDER.
\bigskip ```{example, label="plant-one-samp-z-int", name="Results from an Experiment on Plant Growth"} The `PlantGrowth` data frame gives the results of an experiment to measure plant yield (as measured by the weight of the plant). We would like to a 95% confidence interval for the mean weight of the plants. Suppose that we know from prior research that the true population standard deviation of the plant weights is \(0.7\) g. The parameter of interest is \(\mu\), which represents the true mean weight of the population of all plants of the particular species in the study. We will first take a look at a stem-and-leaf display of the data:
with(PlantGrowth, stem.leaf(weight))
The data appear to be approximately normal with no extreme values. The data come from a designed experiment, so it is reasonable to suppose that the observations constitute a simple random sample of weights[^estimate-03]. We know the population standard deviation (\sigma=0.70) from prior research. We are going to use the one-sample (z)-interval.
[^estimate-03]: Actually we will see later that there is reason to believe that the observations are simple random samples from three distinct populations. See Section \ref{sec-Analysis-of-Variance}.
dim(PlantGrowth) # sample size is first entry
with(PlantGrowth, mean(weight))
qnorm(0.975)
We find the sample mean of the data to be (\overline{x}=5.073) and (z_{\alpha/2}=z_{0.025}\approx1.96). Our interval is therefore [ \overline{x}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}=5.073\pm1.96\cdot\frac{0.70}{\sqrt{30}}, ] which comes out to approximately ([4.823,\,5.323]). In conclusion, we are 95% confident that the true mean weight (\mu) of all plants of this species lies somewhere between 4.823 g and 5.323 g, that is, we are 95% confident that the interval ([4.823,\,5.323]) covers (\mu).
BLANK
\bigskip
Give some data with \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) an \(SRS(n)\) from a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) distribution. Maybe small sample?
\bigskip
```{block, type="remark"} What if (\sigma) is unknown? We instead use the interval \begin{equation} \overline{X}\pm z_{\alpha/2}\frac{S}{\sqrt{n}}, \end{equation} where (S) is the sample standard deviation.
The author learned of a handy acronym from AP Statistics Exam graders that summarizes the important parts of confidence interval estimation, which is PANIC: *P*-arameter, *A*-ssumptions, *N*-ame, *I*-nterval, and *C*-onclusion. * **Parameter:** identify the parameter of interest with the proper symbols. Write down what the parameter means in the context of the problem. * **Assumptions:** list any assumptions made in the experiment. If there are any other assumptions needed or that were not checked, state what they are and why they are important. * **Name:** choose a statistical procedure from your bag of tricks based on the answers to the previous two parts. The assumptions of the procedure you choose should match those of the problem; if they do not match then either pick a different procedure or openly admit that the results may not be reliable. Write down any underlying formulas used. * **Interval:** calculate the interval from the sample data. This can be done by hand but will more often be done with the aid of a computer. Regardless of the method, all calculations or code should be shown so that the entire process is repeatable by a subsequent reader. * **Conclusion:** state the final results, using language in the context of the problem. Include the appropriate interpretation of the interval, making reference to the confidence coefficient. \bigskip ```{block, type="remark"} All of the above intervals for \(\mu\) were two-sided, but there are also one-sided intervals for \(\mu\). They look like \begin{equation} \left[\overline{X}-z_{\alpha}\frac{\sigma}{\sqrt{n}},\ \infty\right)\quad \mbox{or}\quad \left(-\infty,\ \overline{X}+z_{\alpha}\frac{\sigma}{\sqrt{n}}\right] \end{equation} and satisfy \begin{equation} \mathbb{P}\left(\overline{X}-z_{\alpha}\frac{\sigma}{\sqrt{n}}\leq\mu\right)=1-\alpha\quad \mbox{and}\quad \mathbb{P}\left(\overline{X}+z_{\alpha}\frac{\sigma}{\sqrt{n}}\geq\mu\right)=1-\alpha. \end{equation}
\bigskip
BLANK
Small sample, some data with \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) an \(SRS(n)\) from a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) distribution. PANIC
We can do Example \@ref(ex:plant-one-samp-z-int) with the following code.
temp <- with(PlantGrowth, z.test(weight, stdev = 0.7)) temp
BLANK
The confidence interval bounds are shown in the sixth line down of the output (please disregard all of the additional output information for now -- we will use it in Chapter \@ref(cha-hypothesis-testing)). We can make the plot for Figure \@ref(fig:plant-z-int-plot) with
plot(temp, "Conf")
Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X})) distribution and let (Y_{1}), (Y_{2}), ..., (Y_{m}) be a (SRS(m)) from a (\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y})) distribution. Further, assume that the (X_{1}), (X_{2}), ..., (X_{n}) sample is independent of the (Y_{1}), (Y_{2}), ..., (Y_{m}) sample.
Suppose that (\sigma_{X}) and (\sigma_{Y}) are known. We would like a confidence interval for (\mu_{X}-\mu_{Y}). We know that \begin{equation} \overline{X}-\overline{Y}\sim\mathsf{norm}\left(\mathtt{mean}=\mu_{X}-\mu_{Y},\,\mathtt{sd}=\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}\right). \end{equation} Therefore, a (100(1-\alpha)\%) confidence interval for (\mu_{X}-\mu_{Y}) is given by \begin{equation} \label{eq-two-samp-mean-CI} \left(\overline{X}-\overline{Y}\right)\pm z_{\alpha/2}\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}. \end{equation} Unfortunately, most of the time the values of (\sigma_{X}) and (\sigma_{Y}) are unknown. This leads us to the following:
The basic function is t.test
which has a var.equal
argument that
may be set to TRUE
or FALSE
. The confidence interval is shown as
part of the output, although there is a lot of additional information
that is not needed until Chapter \@ref(cha-hypothesis-testing).
There is not any specific functionality to handle the (z)-interval
for small samples, but if the samples are large then t.test
with
var.equal = FALSE
will be essentially the same thing. The standard
deviations are never (?) known in advance anyway so it does not really
matter in practice.
We would like to know (p) which is the "proportion of successes". For instance, (p) could be:
We are given an (SRS(n)) (X_{1}), (X_{2}), ..., (X_{n}) distributed (\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)). Recall from Section \@ref(sec-binom-dist) that the common mean of these variables is (\mathbb{E} X=p) and the variance is (\mathbb{E}(X-p)^{2}=p(1-p)). If we let (Y=\sum X_{i}), then from Section \@ref(sec-binom-dist) we know that (Y\sim\mathsf{binom}(\mathtt{size}=n,\,\mathtt{prob}=p)) and that [ \overline{X}=\frac{Y}{n}\mbox{ has }\mathbb{E}\overline{X}=p\mbox{ and }\mathrm{Var}(\overline{X})=\frac{p(1-p)}{n}. ] Thus if (n) is large (here is the CLT) then an approximate (100(1-\alpha)\%) confidence interval for (p) would be given by \begin{equation} \label{eq-ci-p-no-good} \overline{X}\pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}. \end{equation} OOPS...! Equation \eqref{eq-ci-p-no-good} is of no use to us because the \underbar{unknown} parameter (p) is in the formula! (If we knew what (p) was to plug in the formula then we would not need a confidence interval in the first place.) There are two solutions to this problem.
For two proportions (p_{1}) and (p_{2}), we may collect independent (\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)) samples of size (n_{1}) and (n_{2}), respectively. Let (Y_{1}) and (Y_{2}) denote the number of successes in the respective samples. We know that [ \frac{Y_{1}}{n_{1}}\approx\mathsf{norm}\left(\mathtt{mean}=p_{1},\,\mathtt{sd}=\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}}}\right) ] and [ \frac{Y_{2}}{n_{2}}\approx\mathsf{norm}\left(\mathtt{mean}=p_{2},\,\mathtt{sd}=\sqrt{\frac{p_{2}(1-p_{2})}{n_{2}}}\right) ] so it stands to reason that an approximate (100(1-\alpha)\%) confidence interval for (p_{1}-p_{2}) is given by \begin{equation} \left(\hat{p}{1}-\hat{p}{2}\right)\pm z_{\alpha/2}\sqrt{\frac{\hat{p}{1}(1-\hat{p}{1})}{n_{1}}+\frac{\hat{p}{2}(1-\hat{p}{2})}{n_{2}}}, \end{equation} where (\hat{p}{1}=Y{1}/n_{1}) and (\hat{p}{2}=Y{2}/n_{2}).
\bigskip
```{block, type="remark"} When estimating a single proportion, one-sided intervals are sometimes needed. They take the form \begin{equation} \left[0,\ \hat{p}+z_{\alpha}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right] \end{equation} or \begin{equation} \left[\hat{p}-z_{\alpha}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\ 1\right] \end{equation} or in other words, we know in advance that the true proportion is restricted to the interval ([0,1]), so we can truncate our confidence interval to those values on either side.
### How to do it with R ```r library(Hmisc) binconf(x = 7, n = 25, method = "asymptotic")
binconf(x = 7, n = 25, method = "wilson")
The default value of the method
argument is wilson
. An alternate way is
data(RcmdrTestDrive)
tab <- xtabs(~gender, data = RcmdrTestDrive) prop.test(rbind(tab), conf.level = 0.95, correct = FALSE)
A <- as.data.frame(Titanic) B <- with(A, untable(A, Freq))
BLANK
I am thinking one and two sample problems here.
BLANK
I am thinking about sigma.test
in the TeachingDemos
package
[@TeachingDemos] and var.test
in base R [@base]
here.
BLANK
I am thinking about fitdistr
from the MASS
package [@MASS].
Sections \@ref(sec-confidence-intervals-for-means) through \@ref(sec-confidence-intervals-for-variances) all began the same way: we were given the sample size (n) and the confidence coefficient (1-\alpha), and our task was to find a margin of error (E) so that [ \hat{\theta}\pm E\mbox{ is a }100(1-\alpha)\%\mbox{ confidence interval for }\theta. ]
Some examples we saw were:
We already know (we can see in the formulas above) that (E) decreases as (n) increases. Now we would like to use this information to our advantage: suppose that we have a fixed margin of error (E,) say (E=3), and we want a (100(1-\alpha)\%) confidence interval for (\mu). The question is: how big does (n) have to be?
For the case of a population mean the answer is easy: we set up an equation and solve for (n).
\bigskip
BLANK
Given a situation, given \(\sigma\), given \(E\), we would like to know how big \(n\) has to be to ensure that \(\overline{X}\pm5\) is a 95% confidence interval for \(\mu\).
\bigskip
```{block, type="remark"} Always round up any decimal values of (n), no matter how small the decimal is. Another name for (E) is the "maximum error of the estimate".
For proportions, recall that the asymptotic formula to estimate \(p\) was \[ \hat{p}\pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \] Reasoning as above we would want \begin{align} \label{eq-samp-size-prop-ME} E & =z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\mbox{ or}\\ n & =z_{\alpha/2}^{2}\frac{\hat{p}(1-\hat{p})}{E^{2}}. \end{align} OOPS! Recall that \(\hat{p}=Y/n\), which would put the variable \(n\) on both sides of Equation \eqref{eq-samp-size-prop-ME}. Again, there are two solutions to the problem. 1. If we have a good idea of what \(p\) is, say \(p^{\ast}\) then we can plug it in to get \begin{equation} n=z_{\alpha/2}^{2}\frac{p^{\ast}(1-p^{\ast})}{E^{2}}. \end{equation} 2. Even if we have no idea what \(p\) is, we do know from calculus that \(p(1-p)\leq1/4\) because the function \(f(x)=x(1-x)\) is quadratic (so its graph is a parabola which opens downward) with maximum value attained at \(x=1/2\). Therefore, regardless of our choice for \(p^{\ast}\) the sample size must satisfy \begin{equation} n=z_{\alpha/2}^{2}\frac{p^{\ast}(1-p^{\ast})}{E^{2}}\leq\frac{z_{\alpha/2}^{2}}{4E^{2}}. \end{equation} The quantity \(z_{\alpha/2}^{2}/4E^{2}\) is large enough to guarantee \(100(1-\alpha)\%\) confidence. BLANK Proportion example. \bigskip ```{block, type="remark"} For very small populations sometimes the value of \(n\) obtained from the formula is too big. In this case we should use the hypergeometric distribution for a sampling model rather than the binomial model. With this modification the formulas change to the following: if \(N\) denotes the population size then let \begin{equation} m=z_{\alpha/2}^{2}\frac{p^{\ast}(1-p^{\ast})}{E^{2}} \end{equation} and the sample size needed to ensure \(100(1-\alpha)\%\) confidence is achieved is \begin{equation} n=\frac{m}{1+\frac{m-1}{N}}. \end{equation} If we do not have a good value for the estimate \(p^{\ast}\) then we may use \(p^{\ast}=1/2\).
BLANK
I am thinking about power.t.test
, power.prop.test
,
power.anova.test
, and I am also thinking about replicate
.
Mention mle
from the stats4
package [@stats4].
```{block, type="xca", label="xca-norm-mu-sig-MLE"} Let (X_{1}), (X_{2}), ..., (X_{n}) be an (SRS(n)) from a (\mathsf{norm}(\mathtt{mean} = \mu, \, \mathtt{sd} = \sigma)) distribution. Find a two-dimensional MLE for (\theta=(\mu,\sigma)).
\bigskip ```{block, type="xca", label="xca-CI-quad-form"} Find the upper and lower limits for the confidence interval procedure by finding the roots of \(f\) defined by \[ f(p)=\left(Y/n-p\right)^{2}-z_{\alpha/2}^{2}\frac{p(1-p)}{n}. \] You are going to need the quadratic formula.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.