# IPSUR: Introduction to Probability and Statistics Using R # Copyright (C) 2018 G. Jay Kerns # # Chapter: Sampling Distributions # # This file is part of IPSUR. # # IPSUR is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # IPSUR is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
This is an important chapter; it is the bridge from probability and descriptive statistics that we studied in Chapters \@ref(cha-describing-data-distributions) through \@ref(cha-multivariable-distributions) to inferential statistics which forms the latter part of this book.
Here is the link: we are presented with a population about which we would like to learn. And while it would be desirable to examine every single member of the population, we find that it is either impossible or infeasible to for us to do so, thus, we resort to collecting a sample instead. We do not lose heart. Our method will suffice, provided the sample is representative of the population. A good way to achieve this is to sample randomly from the population.
Supposing for the sake of argument that we have collected a random sample, the next task is to make some sense out of the data because the complete list of sample information is usually cumbersome, unwieldy. We summarize the data set with a descriptive statistic, a quantity calculated from the data (we saw many examples of these in Chapter \@ref(cha-describing-data-distributions)). But our sample was random... therefore, it stands to reason that our statistic will be random, too. How is the statistic distributed?
The probability distribution associated with the population (from which we sample) is called the population distribution, and the probability distribution associated with our statistic is called its sampling distribution; clearly, the two are interrelated. To learn about the population distribution, it is imperative to know everything we can about the sampling distribution. Such is the goal of this chapter.
We begin by introducing the notion of simple random samples and cataloguing some of their more convenient mathematical properties. Next we focus on what happens in the special case of sampling from the normal distribution (which, again, has several convenient mathematical properties), and in particular, we meet the sampling distribution of (\overline{X}) and (S^{2}). Then we explore what happens to (\overline{X})'s sampling distribution when the population is not normal and prove one of the most remarkable theorems in statistics, the Central Limit Theorem (CLT).
With the CLT in hand, we then investigate the sampling distributions of several other popular statistics, taking full advantage of those with a tractable form. We finish the chapter with an exploration of statistics whose sampling distributions are not quite so tractable, and to accomplish this goal we will use simulation methods that are grounded in all of our work in the previous four chapters.
What do I want them to know?
If \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are independent with \(X_{i}\sim f\) for \(i=1,2,\ldots,n\), then we say that \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are *independent and identically distributed* (IID) from the population \(f\) or alternatively we say that \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are a *simple random sample of size* \(n\), denoted \(SRS(n)\), from the population \(f\).
\bigskip
```{proposition, label="mean-sd-xbar"} Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a population distribution with mean (\mu) and finite standard deviation (\sigma). Then the mean and standard deviation of (\overline{X}) are given by the formulas (\mu_{\overline{X}}=\mu) and (\sigma_{\overline{X}}=\sigma/\sqrt{n}).
\bigskip ```{proof} Plug in \(a_{1}=a_{2}=\cdots=a_{n}=1/n\) in Proposition \@ref(pro:mean-sd-lin-comb).
The next fact will be useful to us when it comes time to prove the Central Limit Theorem in Section \@ref(sec-central-limit-theorem).
\bigskip
```{proposition, label="pro-mgf-xbar"} Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a population distribution with MGF (M(t)). Then the MGF of (\overline{X}) is given by \begin{equation} M_{\overline{X}}(t)=\left[M\left(\frac{t}{n}\right)\right]^{n}. \end{equation}
\bigskip ```{proof} Go from the definition: \begin{eqnarray*} M_{\overline{X}}(t) & = & \mathbb{E}\,\mathrm{e}^{t\overline{X}},\\ & = & \mathbb{E}\,\mathrm{e}^{t(X_{1}+\cdots+X_{n})/n},\\ & = & \mathbb{E}\,\mathrm{e}^{tX_{1}/n}\mathrm{e}^{tX_{2}/n}\cdots\mathrm{e}^{tX_{n}/n}. \end{eqnarray*} And because \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are independent, Proposition \@ref(pro:indep-implies-prodexpect) allows us to distribute the expectation among each term in the product, which is \[ \mathbb{E}\mathrm{e}^{tX_{1}/n}\,\mathbb{E}\mathrm{e}^{tX_{2}/n}\cdots\mathbb{E}\mathrm{e}^{tX_{n}/n}. \] The last step is to recognize that each term in last product above is exactly \(M(t/n)\).
Let \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) be a \(SRS(n)\) from a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) distribution. Then the sample mean \(\overline{X}\) has a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma/\sqrt{n})\) sampling distribution.
The mean and standard deviation of \(\overline{X}\) follow directly from Proposition \@ref(pro:mean-sd-xbar). To address the shape, first remember from Section \@ref(sec-the-normal-distribution) that the \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) MGF is of the form \[ M(t)=\exp\left[ \mu t+\sigma^{2}t^{2}/2\right] . \] Now use Proposition \@ref(pro:mgf-xbar) to find \begin{eqnarray*} M_{\overline{X}}(t) & = & \left[M\left(\frac{t}{n}\right)\right]^{n},\\ & = & \left[\exp\left( \mu(t/n)+\sigma^{2}(t/n)^{2}/2\right) \right]^{n},\\ & = & \exp\left( \, n\cdot\left[\mu(t/n)+\sigma^{2}(t/n)^{2}/2\right]\right) ,\\ & = & \exp\left( \mu t+(\sigma/\sqrt{n})^{2}t^{2}/2\right), \end{eqnarray*} and we recognize this last quantity as the MGF of a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma/\sqrt{n})\) distribution.
```{theorem, label="thm-xbar-ands"} Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) distribution, and let \begin{equation} \overline{X}=\sum_{i=1}^{n}X_{i}\quad \mbox{and}\quad S^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\overline{X})^{2}. \end{equation} Then
\bigskip ```{proof} The proof is beyond the scope of the present book, but the theorem is simply too important to be omitted. The interested reader could consult Casella and Berger [@Casella2002], or Hogg *et al* [@Hogg2005].
Let \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) be a \(SRS(n)\) from a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) distribution. Then the quantity \begin{equation} T=\frac{\overline{X}-\mu}{S/\sqrt{n}} \end{equation} has a \(\mathsf{t}(\mathtt{df}=n-1)\) sampling distribution.
\bigskip
Divide the numerator and denominator by \(\sigma\) and rewrite \[ T=\frac{\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}}{S/\sigma}=\frac{\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}}{\sqrt{\left.\frac{(n-1)S^{2}}{\sigma^{2}}\right/ (n-1)}}. \] Now let \[ Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\quad \mbox{and}\quad V=\frac{(n-1)S^{2}}{\sigma^{2}}, \] so that \begin{equation} T=\frac{Z}{\sqrt{V/r}}, \end{equation} where \(r=n-1\). We know from Section \@ref(sub-samp-mean-dist-of) that \(Z\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) and we know from Section \@ref(sub-samp-var-dist) that \(V\sim\mathsf{chisq}(\mathtt{df}=n-1)\). Further, since we are sampling from a normal distribution, Theorem \@ref(thm:xbar-ands) gives that \(\overline{X}\) and \(S^{2}\) are independent and by Fact \@ref(fac-indep-then-function-indep) so are \(Z\) and \(V\). In summary, the distribution of \(T\) is the same as the distribution of the quantity \(Z/\sqrt{V/r}\), where \(Z\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) and \(V\sim\mathsf{chisq}(\mathtt{df}=r)\) are independent. This is in fact the definition of Student's \(t\) distribution.
This distribution was first published by W. S. Gosset [@Student1908] under the
pseudonym Student, and the distribution has consequently come to be
known as Student's (t) distribution. The PDF of (T) can be derived
explicitly using the techniques of Section
\@ref(sec-functions-of-continuous); it takes the form
\begin{equation}
f_{X}(x)=\frac{\Gamma[(r+1)/2]}{\sqrt{r\pi}\ \Gamma(r/2)}\left(1+\frac{x^{2}}{r}\right)^{-(r+1)/2},\quad -\infty < x < \infty.
\end{equation}
Any random variable (X) with the preceding PDF is said to have
Student's (t) distribution with (r) degrees of freedom, and we
write (X\sim\mathsf{t}(\mathtt{df}=r)). The shape of the PDF is
similar to the normal, but the tails are considerably heavier. See
Figure \@ref(fig:students-t-dist-vary-df). As with the normal distribution,
there are four functions in R associated with the (t)
distribution, namely dt
, pt
,qt
, and rt
, which compute the PDF,
CDF, quantile function, and generate random variates, respectively.
The code to produce Figure \@ref(fig:students-t-dist-vary-df) is
curve(dt(x, df = 30), from = -3, to = 3, lwd = 3, ylab = "y") ind <- c(1, 2, 3, 5, 10) for (i in ind) curve(dt(x, df = i), -3, 3, add = TRUE)
(ref:cap-students-t-dist-vary-df) \small A plot of Student's (t) distribution for various degrees of freedom.
Similar to that done for the normal we may define (\mathsf{t}_{\alpha}(\mathtt{df}=n-1)) as the number on the (x)-axis such that there is exactly (\alpha) area under the (\mathsf{t}(\mathtt{df}=n-1)) curve to its right.
\bigskip
Find \(\mathsf{t}{}_{0.01}(\mathtt{df}=23)\) with the quantile function.
qt(0.01, df = 23, lower.tail = FALSE)
\bigskip
```{block, type="remark"} There are a few things to note about the (\mathtt{t}(\mathtt{df}=r)) distribution.
## The Central Limit Theorem {#sec-central-limit-theorem} In this section we study the distribution of the sample mean when the underlying distribution is *not* normal. We saw in Section \@ref(sec-sampling-from-normal-dist) that when \(X_{1}\), \(X_{2}\), ... , \(X_{n}\) is a \(SRS(n)\) from a \(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) distribution then \(\overline{X} \sim \mathsf{norm}(\mathtt{mean} = \mu,\,\mathtt{sd} = \sigma/\sqrt{n})\). In other words, we may say (owing to Fact \ref(fac-lin-trans-norm-is-norm)) when the underlying population is normal that the sampling distribution of \(Z\) defined by \begin{equation} Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}} \end{equation} is \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\). However, there are many populations that are *not* normal ... and the statistician often finds herself sampling from such populations. What can be said in this case? The surprising answer is contained in the following theorem. \bigskip ```{theorem, label="central-limit-theorem", name="Central Limit Theorem"} Let \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) be a \(SRS(n)\) from a population distribution with mean \(\mu\) and finite standard deviation \(\sigma\). Then the sampling distribution of \begin{equation} Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}} \end{equation} approaches a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) distribution as \(n\to\infty\).
\bigskip
```{block, type="remark"} We suppose that (X_{1}), (X_{2}), ... , (X_{n}) are IID, and we learned in Section \@ref(sec-simple-random-samples) that (\overline{X}) has mean (\mu) and standard deviation (\sigma/\sqrt{n}), so we already knew that (Z) has mean zero and standard deviation one. The beauty of the CLT is that it addresses the shape of (Z)'s distribution when the sample size is large.
\bigskip ```{block, type="remark"} Notice that the shape of the underlying population's distribution is not mentioned in Theorem \@ref(thm:central-limit-thrm); indeed, the result is true for any population that is well-behaved enough to have a finite standard deviation. In particular, if the population is normally distributed then we know from Section \@ref(sub-samp-mean-dist-of) that the distribution of \(\overline{X}\) (and \(Z\) by extension) is *exactly* normal, for *every* \(n\).
\bigskip
```{block, type="remark"} How large is "sufficiently large"? It is here that the shape of the underlying population distribution plays a role. For populations with distributions that are approximately symmetric and mound-shaped, the samples may need to be only of size four or five, while for highly skewed or heavy-tailed populations the samples may need to be much larger for the distribution of the sample means to begin to show a bell-shape. Regardless, for a given population distribution (with finite standard deviation) the approximation tends to be better for larger sample sizes.
### How to do it with R The `TeachingDemos` package [@TeachingDemos] has `clt.examp` and the `distrTeach` [@distrTeach] package has `illustrateCLT`. Try the following at the command line (output omitted): ```r example(clt.examp)
and
example(illustrateCLT)
The IPSUR
package [@IPSUR] has the functions clt1
, clt2
, and
clt3
(see Exercise \@ref(xca-clt123) at the end of this chapter). Its purpose
is to investigate what happens to the sampling distribution of
(\overline{X}) when the population distribution is mound shaped,
finite support, and skewed, namely (\mathsf{t}(\mathtt{df}=3)),
(\mathsf{unif}(\mathtt{a}=0,\,\mathtt{b}=10)), and
(\mathsf{gamma}(\mathtt{shape}=1.21,\,\mathtt{scale}=1/2.37)),
respectively.
For example, when the command clt1()
is issued a plot window opens
to show a graph of the PDF of a (\mathsf{t}(\mathtt{df}=3))
distribution. On the display are shown numerical values of the
population mean and variance. While the students examine the graph the
computer is simulating random samples of size sample.size = 2
from
the population distribution rt
a total of N.iter = 100000
times,
and sample means are calculated of each sample. Next follows a
histogram of the simulated sample means, which closely approximates
the sampling distribution of (\overline{X}), see Section
\@ref(sec-simulated-sampling-distributions). Also shown are the sample
mean and sample variance of all of the simulated (\overline{X})
values. As a final step, when the student clicks the second plot, a
normal curve with the same mean and variance as the simulated
(\overline{X}) values is superimposed over the histogram. Students
should compare the population theoretical mean and variance to the
simulated mean and variance of the sampling distribution. They should
also compare the shape of the simulated sampling distribution to the
shape of the normal distribution.
The three separate clt1
, clt2
, and clt3
functions were written
so that students could compare what happens overall when the shape of
the population distribution changes.
There are often two populations under consideration, and it sometimes of interest to compare properties between groups. To do so we take independent samples from each population and calculate respective sample statistics for comparison. In some simple cases the sampling distribution of the comparison is known and easy to derive; such cases are the subject of the present section.
Let \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\) be an \(SRS(n_{1})\) from a \(\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X})\) distribution and let \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) be an \(SRS(n_{2})\) from a \(\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y})\) distribution. Suppose that \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\) and \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) are independent samples. Then the quantity \begin{equation} \label{eq-diff-indep-sample-means} \frac{\overline{X}-\overline{Y}-(\mu_{X}-\mu_{Y})}{\sqrt{\left.\sigma_{X}^{2}\right/ n_{1}+\left.\sigma_{Y}^{2}\right/ n_{2}}} \end{equation} has a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) sampling distribution. Equivalently, \(\overline{X}-\overline{Y}\) has a \(\mathsf{norm}(\mathtt{mean}=\mu_{X}-\mu_{Y},\,\mathtt{sd}=\sqrt{\left.\sigma_{X}^{2}\right/ n_{1}+\left.\sigma_{Y}^{2}\right/ n_{2}})\) sampling distribution.
\bigskip
We know that \(\overline{X}\) is \(\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X}/\sqrt{n_{1}})\) and we also know that \(\overline{Y}\) is \(\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y}/\sqrt{n_{2}})\). And since the samples \(X_{1}\), \(X_{2}\), ..., \(X_{n_{1}}\) and \(Y_{1}\), \(Y_{2}\), ..., \(Y_{n_{2}}\) are independent, so too are \(\overline{X}\) and \(\overline{Y}\). The distribution of their difference is thus normal as well, and the mean and standard deviation are given by Proposition \@ref(pro:mean-sd-lin-comb-two).
\bigskip
```{block, type="remark"} Even if the distribution of one or both of the samples is not normal, the quantity in Equation \eqref{eq-diff-indep-sample-means} will be approximately normal provided both sample sizes are large.
\bigskip ```{block, type="remark"} For the special case of \(\mu_{X}=\mu_{Y}\) we have shown that \begin{equation} \frac{\overline{X} - \overline{Y}}{\sqrt{ \sigma_{X}^{2}/ n_{1} + \sigma_{Y}^{2}/n_{2}}} \end{equation} has a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) sampling distribution, or in other words, \(\overline{X} - \overline{Y}\) has a \(\mathsf{norm}(\mathtt{mean} = 0,\,\mathtt{sd} = \sqrt{\sigma_{X}^{2} / n_{1} + \sigma_{Y}^{2} / n_{2}})\) sampling distribution. This will be important when it comes time to do hypothesis tests; see Section \@ref(sec-conf-interv-for-diff-means).
Let \(X_{1}\), \(X_{2}\), ..., \(X_{n_{1}}\) be an \(SRS(n_{1})\) from a \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p_{1})\) distribution and let \(Y_{1}\), \(Y_{2}\), ..., \(Y_{n_{2}}\) be an \(SRS(n_{2})\) from a \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p_{2})\) distribution. Suppose that \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\) and \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) are independent samples. Define \begin{equation} \hat{p}_{1}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}X_{i}\quad \mbox{and}\quad \hat{p}_{2}=\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}Y_{j}. \end{equation} Then the sampling distribution of \begin{equation} \frac{\hat{p}_{1}-\hat{p}_{2}-(p_{1}-p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}}+\frac{p_{2}(1-p_{2})}{n_{2}}}} \end{equation} approaches a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) distribution as both \(n_{1},\, n_{2}\to\infty\). In other words, the sampling distribution of \(\hat{p}_{1}-\hat{p}_{2}\) is approximately \begin{equation} \mathsf{norm}\left(\mathtt{mean}=p_{1}-p_{2},\,\mathtt{sd}=\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}}+\frac{p_{2}(1-p_{2})}{n_{2}}}\right), \end{equation} provided both \(n_{1}\) and \(n_{2}\) are sufficiently large.
\bigskip
We know that \(\hat{p}_{1}\) is approximately normal for \(n_{1}\) sufficiently large by the CLT, and we know that \(\hat{p}_{2}\) is approximately normal for \(n_{2}\) sufficiently large, also by the CLT. Further, \(\hat{p}_{1}\) and \(\hat{p}_{2}\) are independent since they are derived from independent samples. And a difference of independent (approximately) normal distributions is (approximately) normal[^sampdist1], by Exercise \@ref(xca-diff-indep-norm). The expressions for the mean and standard deviation follow immediately from Proposition \@ref(pro:mean-sd-lin-comb-two) combined with the formulas for the \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)\) distribution from Chapter \@ref(cha-discrete-distributions).
[^sampdist1]: This does not explicitly follow because of our cavalier use of "approximately" in too many places. To be more thorough, however, would require more concepts than we can afford at the moment. The interested reader may consult a more advanced text, specifically the topic of weak convergence, that is, convergence in distribution.
Let \(X_{1}\), \(X_{2}\), ..., \(X_{n_{1}}\) be an \(SRS(n_{1})\) from a \(\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X})\) distribution and let \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) be an \(SRS(n_{2})\) from a \(\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y})\) distribution. Suppose that \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\) and \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) are independent samples. Then the ratio \begin{equation} F=\frac{\sigma_{Y}^{2}S_{X}^{2}}{\sigma_{X}^{2}S_{Y}^{2}} \end{equation} has an \(\mathsf{f}(\mathtt{df1}=n_{1}-1,\,\mathtt{df2}=n_{2}-1)\) sampling distribution.
We know from Theorem \@ref(thm:xbar-ands) that \((n_{1}-1)S_{X}^{2}/\sigma_{X}^{2}\) is distributed \(\mathsf{chisq}(\mathtt{df}=n_{1}-1)\) and \((n_{2}-1)S_{Y}^{2}/\sigma_{Y}^{2}\) is distributed \(\mathsf{chisq}(\mathtt{df}=n_{2}-1)\). Now write \[ F=\frac{\sigma_{Y}^{2}S_{X}^{2}}{\sigma_{X}^{2}S_{Y}^{2}}=\frac{\left.(n_{1}-1)S_{X}^{2}\right/ (n_{1}-1)}{\left.(n_{2}-1)S_{Y}^{2}\right/ (n_{2}-1)}\cdot\frac{\left.1\right/ \sigma_{X}^{2}}{\left.1\right/ \sigma_{Y}^{2}}, \] by multiplying and dividing the numerator with \(n_{1}-1\) and doing likewise for the denominator with \(n_{2}-1\). Now we may regroup the terms into \[ F=\frac{\left.\frac{(n_{1}-1)S_{X}^{2}}{\sigma_{X}^{2}}\right/ (n_{1}-1)}{\left.\frac{(n_{2}-1)S_{Y}^{2}}{\sigma_{Y}^{2}}\right/ (n_{2}-1)}, \] and we recognize \(F\) to be the ratio of independent \(\mathsf{chisq}\) distributions, each divided by its respective numerator \(\mathtt{df}=n_{1}-1\) and denominator \(\mathtt{df}=n_{1}-1\) degrees of freedom. This is, indeed, the definition of Snedecor's \(F\) distribution.
\bigskip
```{block, type="remark"} For the special case of (\sigma_{X}=\sigma_{Y}) we have shown that \begin{equation} F=\frac{S_{X}^{2}}{S_{Y}^{2}} \end{equation} has an (\mathsf{f}(\mathtt{df1}=n_{1}-1,\,\mathtt{df2}=n_{2}-1)) sampling distribution. This will be important in Chapters \@ref(cha-estimation) onward.
## Simulated Sampling Distributions {#sec-simulated-sampling-distributions} Some comparisons are meaningful, but their sampling distribution is not quite so tidy to describe analytically. What do we do then? As it turns out, we do not need to know the exact analytical form of the sampling distribution; sometimes it is enough to approximate it with a simulated distribution. In this section we will show you how. Note that R is particularly well suited to compute simulated sampling distributions, much more so than, say, SPSS or SAS. ### The Interquartile Range ```r iqrs <- replicate(100, IQR(rnorm(100)))
We can look at the mean of the simulated values
mean(iqrs) # close to 1
and we can see the standard deviation
sd(iqrs)
Now let's take a look at a plot of the simulated values
hist(iqrs, breaks = 20)
(ref:cap-simulated-iqr) \small A plot of simulated IQRs.
mads <- replicate(100, mad(rnorm(100)))
We can look at the mean of the simulated values
mean(mads) # close to 1.349
and we can see the standard deviation
sd(mads)
Now let's take a look at a plot of the simulated values
hist(mads, breaks = 20)
(ref:cap-simulated-mads) \small A plot of simulated MADs.
k <- 1 n <- sample(10:30, size=10, replace = TRUE) mu <- round(rnorm(10, mean = 20))
``{block2, type="xca"}
Suppose that we observe a random sample \(X_{1}\), \(X_{2}\), ... ,
\(X_{n}\) of size \(n = \)
r nfrom a
\(\mathsf{norm}(\mathtt{mean}=
r mu`)) distribution.
\bigskip ```{block, type="xca", label="xca-clt123"} In this exercise we will investigate how the shape of the population distribution affects the time until the distribution of \(\overline{X}\) is acceptably normal.
\bigskip
```{block, type="xca"} Let (X_{1}),..., (X_{25}) be a random sample from a (\mathsf{norm}(\mathtt{mean}=37,\,\mathtt{sd}=45)) distribution, and let (\overline{X}) be the sample mean of these (n=25) observations.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.