Sampling Distributions {#cha-sampling-distributions}

#    IPSUR: Introduction to Probability and Statistics Using R
#    Copyright (C) 2018  G. Jay Kerns
#
#    Chapter: Sampling Distributions
#
#    This file is part of IPSUR.
#
#    IPSUR is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    IPSUR is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with IPSUR.  If not, see <http://www.gnu.org/licenses/>.

This is an important chapter; it is the bridge from probability and descriptive statistics that we studied in Chapters \@ref(cha-describing-data-distributions) through \@ref(cha-multivariable-distributions) to inferential statistics which forms the latter part of this book.

Here is the link: we are presented with a population about which we would like to learn. And while it would be desirable to examine every single member of the population, we find that it is either impossible or infeasible to for us to do so, thus, we resort to collecting a sample instead. We do not lose heart. Our method will suffice, provided the sample is representative of the population. A good way to achieve this is to sample randomly from the population.

Supposing for the sake of argument that we have collected a random sample, the next task is to make some sense out of the data because the complete list of sample information is usually cumbersome, unwieldy. We summarize the data set with a descriptive statistic, a quantity calculated from the data (we saw many examples of these in Chapter \@ref(cha-describing-data-distributions)). But our sample was random... therefore, it stands to reason that our statistic will be random, too. How is the statistic distributed?

The probability distribution associated with the population (from which we sample) is called the population distribution, and the probability distribution associated with our statistic is called its sampling distribution; clearly, the two are interrelated. To learn about the population distribution, it is imperative to know everything we can about the sampling distribution. Such is the goal of this chapter.

We begin by introducing the notion of simple random samples and cataloguing some of their more convenient mathematical properties. Next we focus on what happens in the special case of sampling from the normal distribution (which, again, has several convenient mathematical properties), and in particular, we meet the sampling distribution of (\overline{X}) and (S^{2}). Then we explore what happens to (\overline{X})'s sampling distribution when the population is not normal and prove one of the most remarkable theorems in statistics, the Central Limit Theorem (CLT).

With the CLT in hand, we then investigate the sampling distributions of several other popular statistics, taking full advantage of those with a tractable form. We finish the chapter with an exploration of statistics whose sampling distributions are not quite so tractable, and to accomplish this goal we will use simulation methods that are grounded in all of our work in the previous four chapters.

What do I want them to know?

Simple Random Samples {#sec-simple-random-samples}

If \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are independent with
\(X_{i}\sim f\) for \(i=1,2,\ldots,n\), then we say that \(X_{1}\),
\(X_{2}\), ..., \(X_{n}\) are *independent and identically
distributed* (IID) from the population \(f\) or alternatively we say
that \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are a *simple random sample
of size* \(n\), denoted \(SRS(n)\), from the population \(f\).

\bigskip

```{proposition, label="mean-sd-xbar"} Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a population distribution with mean (\mu) and finite standard deviation (\sigma). Then the mean and standard deviation of (\overline{X}) are given by the formulas (\mu_{\overline{X}}=\mu) and (\sigma_{\overline{X}}=\sigma/\sqrt{n}).

\bigskip

```{proof}
Plug in \(a_{1}=a_{2}=\cdots=a_{n}=1/n\) in Proposition
\@ref(pro:mean-sd-lin-comb).

The next fact will be useful to us when it comes time to prove the Central Limit Theorem in Section \@ref(sec-central-limit-theorem).

\bigskip

```{proposition, label="pro-mgf-xbar"} Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a population distribution with MGF (M(t)). Then the MGF of (\overline{X}) is given by \begin{equation} M_{\overline{X}}(t)=\left[M\left(\frac{t}{n}\right)\right]^{n}. \end{equation}

\bigskip

```{proof}
Go from the definition:
\begin{eqnarray*}
M_{\overline{X}}(t) & = & \mathbb{E}\,\mathrm{e}^{t\overline{X}},\\
 & = & \mathbb{E}\,\mathrm{e}^{t(X_{1}+\cdots+X_{n})/n},\\
 & = & \mathbb{E}\,\mathrm{e}^{tX_{1}/n}\mathrm{e}^{tX_{2}/n}\cdots\mathrm{e}^{tX_{n}/n}.
\end{eqnarray*}
And because \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) are independent,
Proposition \@ref(pro:indep-implies-prodexpect) allows us to distribute the
expectation among each term in the product, which is \[
\mathbb{E}\mathrm{e}^{tX_{1}/n}\,\mathbb{E}\mathrm{e}^{tX_{2}/n}\cdots\mathbb{E}\mathrm{e}^{tX_{n}/n}.
\] The last step is to recognize that each term in last product above
is exactly \(M(t/n)\).

Sampling from a Normal Distribution {#sec-sampling-from-normal-dist}

The Distribution of the Sample Mean {#sub-samp-mean-dist-of}

Let \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) be a \(SRS(n)\) from a
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\)
distribution. Then the sample mean \(\overline{X}\) has a
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma/\sqrt{n})\)
sampling distribution.
The mean and standard deviation of \(\overline{X}\) follow directly
from Proposition \@ref(pro:mean-sd-xbar). To address the shape, first
remember from Section \@ref(sec-the-normal-distribution) that the
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) MGF is of
the form \[ M(t)=\exp\left[ \mu t+\sigma^{2}t^{2}/2\right] .  \] Now
use Proposition \@ref(pro:mgf-xbar) to find
\begin{eqnarray*}
M_{\overline{X}}(t) & = & \left[M\left(\frac{t}{n}\right)\right]^{n},\\
 & = & \left[\exp\left( \mu(t/n)+\sigma^{2}(t/n)^{2}/2\right) \right]^{n},\\
 & = & \exp\left( \, n\cdot\left[\mu(t/n)+\sigma^{2}(t/n)^{2}/2\right]\right) ,\\
 & = & \exp\left( \mu t+(\sigma/\sqrt{n})^{2}t^{2}/2\right),
\end{eqnarray*}
and we recognize this last quantity as the MGF of a
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma/\sqrt{n})\)
distribution.

The Distribution of the Sample Variance {#sub-samp-var-dist}

```{theorem, label="thm-xbar-ands"} Let (X_{1}), (X_{2}), ..., (X_{n}) be a (SRS(n)) from a (\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)) distribution, and let \begin{equation} \overline{X}=\sum_{i=1}^{n}X_{i}\quad \mbox{and}\quad S^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\overline{X})^{2}. \end{equation} Then

  1. (\overline{X}) and (S^{2}) are independent, and
  2. The rescaled sample variance \begin{equation} \frac{(n-1)}{\sigma^{2}}S^{2}=\frac{\sum_{i=1}^{n}(X_{i}-\overline{X})^{2}}{\sigma^{2}} \end{equation} has a (\mathsf{chisq}(\mathtt{df}=n-1)) sampling distribution.
\bigskip

```{proof}
The proof is beyond the scope of the present book, but the theorem is
simply too important to be omitted. The interested reader could
consult Casella and Berger [@Casella2002], or Hogg *et al*
[@Hogg2005].

The Distribution of Student's t Statistic {#sub-students-t-distribution}

Let \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) be a \(SRS(n)\) from a
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\)
distribution. Then the quantity
\begin{equation}
T=\frac{\overline{X}-\mu}{S/\sqrt{n}}
\end{equation}
has a \(\mathsf{t}(\mathtt{df}=n-1)\) sampling distribution.

\bigskip

Divide the numerator and denominator by \(\sigma\) and rewrite \[
T=\frac{\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}}{S/\sigma}=\frac{\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}}{\sqrt{\left.\frac{(n-1)S^{2}}{\sigma^{2}}\right/
(n-1)}}.  \] Now let \[
Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\quad \mbox{and}\quad
V=\frac{(n-1)S^{2}}{\sigma^{2}}, \] so that
\begin{equation}
T=\frac{Z}{\sqrt{V/r}},
\end{equation}
where \(r=n-1\).

We know from Section \@ref(sub-samp-mean-dist-of) that
\(Z\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) and we know
from Section \@ref(sub-samp-var-dist) that
\(V\sim\mathsf{chisq}(\mathtt{df}=n-1)\). Further, since we are
sampling from a normal distribution, Theorem \@ref(thm:xbar-ands) gives
that \(\overline{X}\) and \(S^{2}\) are independent and by Fact
\@ref(fac-indep-then-function-indep) so are \(Z\) and \(V\). In summary,
the distribution of \(T\) is the same as the distribution of the
quantity \(Z/\sqrt{V/r}\), where
\(Z\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) and
\(V\sim\mathsf{chisq}(\mathtt{df}=r)\) are independent. This is in
fact the definition of Student's \(t\) distribution.

This distribution was first published by W. S. Gosset [@Student1908] under the pseudonym Student, and the distribution has consequently come to be known as Student's (t) distribution. The PDF of (T) can be derived explicitly using the techniques of Section \@ref(sec-functions-of-continuous); it takes the form \begin{equation} f_{X}(x)=\frac{\Gamma[(r+1)/2]}{\sqrt{r\pi}\ \Gamma(r/2)}\left(1+\frac{x^{2}}{r}\right)^{-(r+1)/2},\quad -\infty < x < \infty. \end{equation} Any random variable (X) with the preceding PDF is said to have Student's (t) distribution with (r) degrees of freedom, and we write (X\sim\mathsf{t}(\mathtt{df}=r)). The shape of the PDF is similar to the normal, but the tails are considerably heavier. See Figure \@ref(fig:students-t-dist-vary-df). As with the normal distribution, there are four functions in R associated with the (t) distribution, namely dt, pt,qt, and rt, which compute the PDF, CDF, quantile function, and generate random variates, respectively.

The code to produce Figure \@ref(fig:students-t-dist-vary-df) is

curve(dt(x, df = 30), from = -3, to = 3, lwd = 3, ylab = "y")
ind <- c(1, 2, 3, 5, 10)
for (i in ind) curve(dt(x, df = i), -3, 3, add = TRUE)

(ref:cap-students-t-dist-vary-df) \small A plot of Student's (t) distribution for various degrees of freedom.

Similar to that done for the normal we may define (\mathsf{t}_{\alpha}(\mathtt{df}=n-1)) as the number on the (x)-axis such that there is exactly (\alpha) area under the (\mathsf{t}(\mathtt{df}=n-1)) curve to its right.

\bigskip

Find \(\mathsf{t}{}_{0.01}(\mathtt{df}=23)\) with the quantile
function.
qt(0.01, df = 23, lower.tail = FALSE)

\bigskip

```{block, type="remark"} There are a few things to note about the (\mathtt{t}(\mathtt{df}=r)) distribution.

  1. The (\mathtt{t}(\mathtt{df}=1)) distribution is the same as the (\mathsf{cauchy}(\mathtt{location}=0,\,\mathtt{scale}=1)) distribution. The Cauchy distribution is rather pathological and is a counterexample to many famous results.
  2. The standard deviation of (\mathsf{t}(\mathtt{df}=r)) is undefined (that is, infinite) unless (r>2). When (r) is more than 2, the standard deviation is always bigger than one, but decreases to 1 as (r\to\infty).
  3. As (r\to\infty), the (\mathtt{t}(\mathtt{df}=r)) distribution approaches the (\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)) distribution.
## The Central Limit Theorem {#sec-central-limit-theorem}


In this section we study the distribution of the sample mean when the
underlying distribution is *not* normal. We saw in Section \@ref(sec-sampling-from-normal-dist) that when \(X_{1}\), \(X_{2}\), ... , \(X_{n}\) is a
\(SRS(n)\) from a
\(\mathsf{norm}(\mathtt{mean}=\mu,\,\mathtt{sd}=\sigma)\) distribution
then \(\overline{X} \sim \mathsf{norm}(\mathtt{mean} =
\mu,\,\mathtt{sd} = \sigma/\sqrt{n})\). In other words, we may say
(owing to Fact \ref(fac-lin-trans-norm-is-norm)) when the underlying
population is normal that the sampling distribution of \(Z\) defined
by
\begin{equation}
Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}
\end{equation}
is \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\). 

However, there are many populations that are *not* normal ... and the
statistician often finds herself sampling from such populations. What
can be said in this case? The surprising answer is contained in the
following theorem.

\bigskip

```{theorem, label="central-limit-theorem", name="Central Limit Theorem"}
Let \(X_{1}\), \(X_{2}\), ..., \(X_{n}\) be
a \(SRS(n)\) from a population distribution with mean \(\mu\) and
finite standard deviation \(\sigma\). Then the sampling distribution
of
\begin{equation}
Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}
\end{equation}
approaches a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) distribution as \(n\to\infty\).

\bigskip

```{block, type="remark"} We suppose that (X_{1}), (X_{2}), ... , (X_{n}) are IID, and we learned in Section \@ref(sec-simple-random-samples) that (\overline{X}) has mean (\mu) and standard deviation (\sigma/\sqrt{n}), so we already knew that (Z) has mean zero and standard deviation one. The beauty of the CLT is that it addresses the shape of (Z)'s distribution when the sample size is large.

\bigskip

```{block, type="remark"}
Notice that the shape of the underlying population's distribution is
not mentioned in Theorem \@ref(thm:central-limit-thrm); indeed, the result is true for any
population that is well-behaved enough to have a finite standard
deviation. In particular, if the population is normally distributed
then we know from Section \@ref(sub-samp-mean-dist-of) that the distribution
of \(\overline{X}\) (and \(Z\) by extension) is *exactly* normal, for
*every* \(n\).

\bigskip

```{block, type="remark"} How large is "sufficiently large"? It is here that the shape of the underlying population distribution plays a role. For populations with distributions that are approximately symmetric and mound-shaped, the samples may need to be only of size four or five, while for highly skewed or heavy-tailed populations the samples may need to be much larger for the distribution of the sample means to begin to show a bell-shape. Regardless, for a given population distribution (with finite standard deviation) the approximation tends to be better for larger sample sizes.

### How to do it with R

The `TeachingDemos` package [@TeachingDemos] has `clt.examp` and
the `distrTeach` [@distrTeach] package has `illustrateCLT`. Try
the following at the command line (output omitted):
```r
example(clt.examp)

and

example(illustrateCLT)

The IPSUR package [@IPSUR] has the functions clt1, clt2, and clt3 (see Exercise \@ref(xca-clt123) at the end of this chapter). Its purpose is to investigate what happens to the sampling distribution of (\overline{X}) when the population distribution is mound shaped, finite support, and skewed, namely (\mathsf{t}(\mathtt{df}=3)), (\mathsf{unif}(\mathtt{a}=0,\,\mathtt{b}=10)), and (\mathsf{gamma}(\mathtt{shape}=1.21,\,\mathtt{scale}=1/2.37)), respectively.

For example, when the command clt1() is issued a plot window opens to show a graph of the PDF of a (\mathsf{t}(\mathtt{df}=3)) distribution. On the display are shown numerical values of the population mean and variance. While the students examine the graph the computer is simulating random samples of size sample.size = 2 from the population distribution rt a total of N.iter = 100000 times, and sample means are calculated of each sample. Next follows a histogram of the simulated sample means, which closely approximates the sampling distribution of (\overline{X}), see Section \@ref(sec-simulated-sampling-distributions). Also shown are the sample mean and sample variance of all of the simulated (\overline{X}) values. As a final step, when the student clicks the second plot, a normal curve with the same mean and variance as the simulated (\overline{X}) values is superimposed over the histogram. Students should compare the population theoretical mean and variance to the simulated mean and variance of the sampling distribution. They should also compare the shape of the simulated sampling distribution to the shape of the normal distribution.

The three separate clt1, clt2, and clt3 functions were written so that students could compare what happens overall when the shape of the population distribution changes.

Sampling Distributions of Two-Sample Statistics {#sec-samp-dist-two-samp}

There are often two populations under consideration, and it sometimes of interest to compare properties between groups. To do so we take independent samples from each population and calculate respective sample statistics for comparison. In some simple cases the sampling distribution of the comparison is known and easy to derive; such cases are the subject of the present section.

Difference of Independent Sample Means

Let \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\) be an \(SRS(n_{1})\)
from a
\(\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X})\)
distribution and let \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) be an
\(SRS(n_{2})\) from a
\(\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y})\)
distribution. Suppose that \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\)
and \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) are independent
samples. Then the quantity
\begin{equation}
\label{eq-diff-indep-sample-means}
\frac{\overline{X}-\overline{Y}-(\mu_{X}-\mu_{Y})}{\sqrt{\left.\sigma_{X}^{2}\right/ n_{1}+\left.\sigma_{Y}^{2}\right/ n_{2}}}
\end{equation}
has a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) sampling
distribution. Equivalently, \(\overline{X}-\overline{Y}\) has a
\(\mathsf{norm}(\mathtt{mean}=\mu_{X}-\mu_{Y},\,\mathtt{sd}=\sqrt{\left.\sigma_{X}^{2}\right/
n_{1}+\left.\sigma_{Y}^{2}\right/ n_{2}})\) sampling distribution.

\bigskip

We know that \(\overline{X}\) is
\(\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X}/\sqrt{n_{1}})\)
and we also know that \(\overline{Y}\) is
\(\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y}/\sqrt{n_{2}})\). And
since the samples \(X_{1}\), \(X_{2}\), ..., \(X_{n_{1}}\) and
\(Y_{1}\), \(Y_{2}\), ..., \(Y_{n_{2}}\) are independent, so too are
\(\overline{X}\) and \(\overline{Y}\). The distribution of their
difference is thus normal as well, and the mean and standard deviation
are given by Proposition \@ref(pro:mean-sd-lin-comb-two).

\bigskip

```{block, type="remark"} Even if the distribution of one or both of the samples is not normal, the quantity in Equation \eqref{eq-diff-indep-sample-means} will be approximately normal provided both sample sizes are large.

\bigskip

```{block, type="remark"}
For the special case of \(\mu_{X}=\mu_{Y}\) we have shown
 that
\begin{equation}
\frac{\overline{X} - \overline{Y}}{\sqrt{ \sigma_{X}^{2}/ n_{1} + \sigma_{Y}^{2}/n_{2}}}
\end{equation}
has a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) sampling
distribution, or in other words, \(\overline{X} - \overline{Y}\) has a
\(\mathsf{norm}(\mathtt{mean} = 0,\,\mathtt{sd} = \sqrt{\sigma_{X}^{2}
/ n_{1} + \sigma_{Y}^{2} / n_{2}})\) sampling distribution. This will
be important when it comes time to do hypothesis tests; see Section
\@ref(sec-conf-interv-for-diff-means).

Difference of Independent Sample Proportions

Let \(X_{1}\), \(X_{2}\), ..., \(X_{n_{1}}\) be an \(SRS(n_{1})\) from
a \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p_{1})\)
distribution and let \(Y_{1}\), \(Y_{2}\), ..., \(Y_{n_{2}}\) be an
\(SRS(n_{2})\) from a
\(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p_{2})\)
distribution. Suppose that \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\)
and \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) are independent
samples. Define
\begin{equation}
\hat{p}_{1}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}X_{i}\quad \mbox{and}\quad \hat{p}_{2}=\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}Y_{j}.
\end{equation}
Then the sampling distribution of
\begin{equation}
\frac{\hat{p}_{1}-\hat{p}_{2}-(p_{1}-p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}}+\frac{p_{2}(1-p_{2})}{n_{2}}}}
\end{equation}
approaches a \(\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=1)\) distribution as both \(n_{1},\, n_{2}\to\infty\). In other words, the sampling distribution of \(\hat{p}_{1}-\hat{p}_{2}\) is approximately
\begin{equation}
\mathsf{norm}\left(\mathtt{mean}=p_{1}-p_{2},\,\mathtt{sd}=\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}}+\frac{p_{2}(1-p_{2})}{n_{2}}}\right),
\end{equation}
provided both \(n_{1}\) and \(n_{2}\) are sufficiently large.

\bigskip

We know that \(\hat{p}_{1}\) is approximately normal for \(n_{1}\)
sufficiently large by the CLT, and we know that \(\hat{p}_{2}\) is
approximately normal for \(n_{2}\) sufficiently large, also by the
CLT. Further, \(\hat{p}_{1}\) and \(\hat{p}_{2}\) are independent
since they are derived from independent samples. And a difference of
independent (approximately) normal distributions is (approximately)
normal[^sampdist1], by Exercise \@ref(xca-diff-indep-norm).

The expressions for the mean and standard deviation follow immediately
from Proposition \@ref(pro:mean-sd-lin-comb-two) combined with the formulas
for the \(\mathsf{binom}(\mathtt{size}=1,\,\mathtt{prob}=p)\)
distribution from Chapter \@ref(cha-discrete-distributions).

[^sampdist1]: This does not explicitly follow because of our cavalier use of "approximately" in too many places. To be more thorough, however, would require more concepts than we can afford at the moment. The interested reader may consult a more advanced text, specifically the topic of weak convergence, that is, convergence in distribution.

Ratio of Independent Sample Variances

Let \(X_{1}\), \(X_{2}\), ..., \(X_{n_{1}}\) be an \(SRS(n_{1})\) from
a \(\mathsf{norm}(\mathtt{mean}=\mu_{X},\,\mathtt{sd}=\sigma_{X})\)
distribution and let \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) be an
\(SRS(n_{2})\) from a
\(\mathsf{norm}(\mathtt{mean}=\mu_{Y},\,\mathtt{sd}=\sigma_{Y})\)
distribution. Suppose that \(X_{1}\), \(X_{2}\), ... , \(X_{n_{1}}\)
and \(Y_{1}\), \(Y_{2}\), ... , \(Y_{n_{2}}\) are independent
samples. Then the ratio
\begin{equation}
F=\frac{\sigma_{Y}^{2}S_{X}^{2}}{\sigma_{X}^{2}S_{Y}^{2}}
\end{equation}
has an \(\mathsf{f}(\mathtt{df1}=n_{1}-1,\,\mathtt{df2}=n_{2}-1)\)
sampling distribution.
We know from Theorem \@ref(thm:xbar-ands) that
\((n_{1}-1)S_{X}^{2}/\sigma_{X}^{2}\) is distributed
\(\mathsf{chisq}(\mathtt{df}=n_{1}-1)\) and
\((n_{2}-1)S_{Y}^{2}/\sigma_{Y}^{2}\) is distributed
\(\mathsf{chisq}(\mathtt{df}=n_{2}-1)\). Now write \[
F=\frac{\sigma_{Y}^{2}S_{X}^{2}}{\sigma_{X}^{2}S_{Y}^{2}}=\frac{\left.(n_{1}-1)S_{X}^{2}\right/
(n_{1}-1)}{\left.(n_{2}-1)S_{Y}^{2}\right/
(n_{2}-1)}\cdot\frac{\left.1\right/ \sigma_{X}^{2}}{\left.1\right/
\sigma_{Y}^{2}}, \] by multiplying and dividing the numerator with
\(n_{1}-1\) and doing likewise for the denominator with
\(n_{2}-1\). Now we may regroup the terms into \[
F=\frac{\left.\frac{(n_{1}-1)S_{X}^{2}}{\sigma_{X}^{2}}\right/
(n_{1}-1)}{\left.\frac{(n_{2}-1)S_{Y}^{2}}{\sigma_{Y}^{2}}\right/
(n_{2}-1)}, \] and we recognize \(F\) to be the ratio of independent
\(\mathsf{chisq}\) distributions, each divided by its respective
numerator \(\mathtt{df}=n_{1}-1\) and denominator
\(\mathtt{df}=n_{1}-1\) degrees of freedom. This is, indeed, the
definition of Snedecor's \(F\) distribution.

\bigskip

```{block, type="remark"} For the special case of (\sigma_{X}=\sigma_{Y}) we have shown that \begin{equation} F=\frac{S_{X}^{2}}{S_{Y}^{2}} \end{equation} has an (\mathsf{f}(\mathtt{df1}=n_{1}-1,\,\mathtt{df2}=n_{2}-1)) sampling distribution. This will be important in Chapters \@ref(cha-estimation) onward.

## Simulated Sampling Distributions {#sec-simulated-sampling-distributions}

Some comparisons are meaningful, but their sampling distribution is
not quite so tidy to describe analytically. What do we do then?

As it turns out, we do not need to know the exact analytical form of
the sampling distribution; sometimes it is enough to approximate it
with a simulated distribution. In this section we will show you
how. Note that R is particularly well suited to compute simulated
sampling distributions, much more so than, say, SPSS or SAS.

### The Interquartile Range

```r 
iqrs <- replicate(100, IQR(rnorm(100)))

We can look at the mean of the simulated values

mean(iqrs)    # close to 1

and we can see the standard deviation

sd(iqrs)

Now let's take a look at a plot of the simulated values

hist(iqrs, breaks = 20)

(ref:cap-simulated-iqr) \small A plot of simulated IQRs.

The Median Absolute Deviation

mads <- replicate(100, mad(rnorm(100)))

We can look at the mean of the simulated values

mean(mads)    # close to 1.349

and we can see the standard deviation

sd(mads)

Now let's take a look at a plot of the simulated values

hist(mads, breaks = 20)

(ref:cap-simulated-mads) \small A plot of simulated MADs.

Exercises

k <- 1
n <- sample(10:30, size=10, replace = TRUE)
mu <- round(rnorm(10, mean = 20))

``{block2, type="xca"} Suppose that we observe a random sample \(X_{1}\), \(X_{2}\), ... , \(X_{n}\) of size \(n = \)r nfrom a \(\mathsf{norm}(\mathtt{mean}=r mu`)) distribution.

  1. What is the mean of (\overline{X})?
  2. What is the standard deviation of (\overline{X})?
  3. What is the distribution of (\overline{X})? (approximately)
  4. Find (\mathbb{P}(a< \overline{X} \leq b))
  5. Find (\mathbb{P}(\overline{X} > c)).
\bigskip

```{block, type="xca", label="xca-clt123"}
In this exercise we will investigate how the shape of
the population distribution affects the time until the distribution of
\(\overline{X}\) is acceptably normal.

\bigskip

```{block, type="xca"} Let (X_{1}),..., (X_{25}) be a random sample from a (\mathsf{norm}(\mathtt{mean}=37,\,\mathtt{sd}=45)) distribution, and let (\overline{X}) be the sample mean of these (n=25) observations.

  1. How is (\overline{X}) distributed? (\mathsf{norm}(\mathtt{mean}=37,\,\mathtt{sd}=45/\sqrt{25}))
  2. Find (\mathbb{P}(\overline{X} > 43.1)). And that's all she wrote. ```


Try the IPSUR package in your browser

Any scripts or data that you put into this service are public.

IPSUR documentation built on May 2, 2019, 9:15 a.m.