library(learnr) library(testwhat) library(magrittr) options(repos = "https://cloud.r-project.org") tutorial_options( exercise.timelimit = 60, exercise.checker = testwhat::testwhat_learnr ) knitr::opts_chunk$set(comment = NA)
Warning. It is straightforward from this setup to understand that the problem is not symmetric in the hypotheses. Indeed, the procedure relies on what happens when $H_0$ is true but does not depend on $H_1$.
You design a test statistic $T(X_1, \dots, X_n)$ for the purpose of performing this test which must satisfy at least the first three of the following four properties:
Example: Test on the mean. Suppose you have a sample of $n$ i.i.d. random variables $X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ and you know the value of $\sigma^2$. We want to test whether the mean $\mu$ of the distribution is equal to some pre-specified value $\mu_0$. We can therefore use $H_0: \mu = \mu_0$ vs. $H_1: \mu \ne \mu_0$. At this point, a good candidate test statistic to look at for performing this test is:
$$ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma}, \quad \mbox{with} \quad \overline X := \frac{1}{n} \sum_{i=1}^n X_i. $$
Indeed,
If you designed your test statistic carefully, you might have access to its theoretical distribution when $H_0$ is true under strong distributional assumptions about the data. This is called parametric hypothesis testing.
If it is not the case, you can often derive the theoretical distribution of the statistic under the null hypothesis asymptotically, i.e. assuming that you have a large sample ($n \gg 1$); this is called asymptotic hypothesis testing.
If you are in a large sample size regime but still cannot have access to the theoretical distribution of your test statistic, you can approach this distribution using bootstrapping; this is called bootstrap hypothesis testing.
If you are in a low sample size regime, then you can approach the distribution of the test statistic using permutations; this is called permutation hypothesis testing.
Depending on what you put into the alternative hypothesis $H_1$, larger values of the test statistic that raise suspicion regarding the validity of $H_0$ might mean:
In the first two cases, we say that the test is one-sided. In the latter case, we say that the test is two-sided because we interpret large values in both tails as suspicious.
Example: Test on the mean. Suppose you have a sample of $n$ i.i.d. random variables $X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ and you know the value of $\sigma^2$. We want to test whether the mean $\mu$ of the distribution is equal to some pre-specified value $\mu_0$. As we have since, a good candidate test statistic to look at for performing this test is:
$$ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma}, \quad \mbox{with} \quad \overline X := \frac{1}{n} \sum_{i=1}^n X_i. $$
Now, using this test statistic, we might be interested in performing three different tests:
Remember that we must look at large values of $T$ that raise suspicion in favor of $H_1$.
knitr::include_graphics("images/one-two-sided.jpeg")
So we are now at a point where we know which tail(s) which we should look at and we now need to make a decision as to what large means. In other words, above which threshold on all possible values of my test statistic should I consider that I can reject $H_0$.
When you decide to reject or not, you might make an error:
The only error rate you can control is the type I error rate because it is a probability computed assuming that the null hypothesis is true, which is exactly the situation we put ourselves in for making the decision (i.e. looking at the tails of the null distribution of $T$).
At this point, you can decide that you do not want to make more than a certain amount of type I errors. So you want to force that $\mathbb{P}(\mbox{reject } H_0 | H_0 \mbox{ is true}) \le \alpha$, for some upper bound threshold $\alpha$ on the probability of type I errors. This threshold is called significance level of the test and is often denoted by the greek letter $\alpha$.
Let us now translate what this rule implies for the one-sided right-tail alternative case. The event "$\mbox{reject } H_0$" translates into $T > x$ for some $x$ value of the test statistic $T$ while the conditional event "$H_0 \mbox{ is true}$" specifies to evaluate the probability of this event under the assumption that $H_0$ is true. At this point, the rule becomes:
$$ \mathbb{P}{H_0}(T > x) \le \alpha, $$ $$ \Leftrightarrow 1 - \mathbb{P}{H_0}(T \le x) \le \alpha, $$ $$ \Leftrightarrow 1 - F_{T|H_0}(x) \le \alpha, $$ $$ \Leftrightarrow F_{T|H_0}(x) \ge 1 - \alpha, $$
which is verified for all $x \ge q_{1-\alpha}$, where $q_{1-\alpha}$ is the quantile of order $1-\alpha$ of the null distribution of $T$. So in the end:
This decision-making rule guarantees that the probability of making a type I error is upper-bounded by the significance level $\alpha$.
Another approach pertains to basing our decision on the remaining area under the curve in the relevant tail(s) of the null distribution of $T$ for values more extreme than the one calculated from the observed sample.
The area under the curve in the relevant tail(s) of the null distribution where we find values more extreme than the one calculated from the observed sample defined by the value of the test statistic calculated from the observed sample is called the p-value. It measures what was the probability of observing the data we did observe, or data even more in favor of the alternative hypothesis, assuming that the null hypothesis is true. Formally, if $t_0$ is the value of the test statistic calculated from the observed sample, then:
Hence, if the $p$-value is very small, it might means that
In practice, we can show that rejecting the null hypothesis when $p \le \alpha$ also procudes a decision-making rule that guarantees a probability of type I error at most $\alpha$.
Summary of decision-making procedure for the one-sided right-tail scenario:
knitr::include_graphics("images/pvalue-alpha.png")
The statistical power of a test is an important aspect of the test because
It is the probability of being able to correctly reject the null hypothesis, i.e. to reject it when it is indeed wrong. It is often denoted by $\mu$. In terms of events, it is defined by:
$$ \mu := \mathbb{P}(\mbox{Reject } H_0 | H_1 \mbox{ is true}). $$
Note. The statistical power $\mu$ is equal to $1 - \beta$, where $\beta$ is the greek letter often used to designate the probability of type II errors, $\beta := \mathbb{P}(\mbox{Do not reject } H_0 | H_1 \mbox{ is true})$.
Power calculations are difficult because it requires to put ourselve under the alternative hypothesis, which is often of the form $\mu > \mu_0$ or $\mu < \mu_0$ or $\mu \ne \mu_0$. In other words, you often lack information to compute probabilities assuming that the alternative hypothesis is true. You need to specify several alternative hypothesis and see how the power evolves when exploring several alternatives. You need to have an idea of the order of the difference you might be expecting between, say, a true mean and a nominal value, or of the magnitude of the difference between two population means, etc.
Compliance testing aims at determining whether a process, product, or service complies with the requirements of a specification, technical standard, contract, or regulation.
Model: Let $(X_1, \dots, X_n)$ be $n$ i.i.d. random variables that follow a normal distribution $\mathcal{N}(\mu, \sigma^2)$.
Hypotheses:
$$ H_0: \mu = \mu_0 \quad \mbox{vs.} \quad H_1: \mu \ne \mu_0 \quad \mbox{or} \quad H_1: \mu > \mu_0 \quad \mbox{or} \quad H_1: \mu < \mu_0. $$
Test Statistic:
$$ T = \sqrt{n} \frac{\overline X - \mu_0}{\sigma}. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \mathcal{N}(0, 1). $$
R function: None.
Test Statistic:
$$ T = \frac{\overline X - \mu_0}{\sqrt{\frac{S^2}{n}}}, \quad \mbox{with} \quad S^2 := \frac{1}{n-1} \sum_{i = 1}^n (X_i - \overline X)^2. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \mathcal{S}tudent(n-1). $$
R function: t.test()
Assumptions: Let $(X_1, \dots, X_n)$ be $n$ i.i.d. random variables that follow a normal distribution $\mathcal{N}(\mu, \sigma^2)$.
Hypotheses:
$$ H_0: \sigma^2 = \sigma_0^2 \quad \mbox{vs.} \quad H_1: \sigma^2 \ne \sigma_0^2 \quad \mbox{or} \quad H_1: \sigma^2 > \sigma_0^2 \quad \mbox{or} \quad H_1: \sigma^2 < \sigma_0^2. $$
Test Statistic:
$$ T = \sum_{i = 1}^n \frac{(X_i - \mu)^2}{\sigma_0^2}. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \chi^2(n). $$
R function: None.
Test Statistic:
$$ T = \sum_{i = 1}^n \frac{(X_i - \overline X)^2}{\sigma_0^2}. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \chi^2(n-1). $$
R function: None.
Here we want to test compliance of the proporition $p$ of a given population to a pre-specified rate.
Assumptions: Let $(X_1, \dots, X_n)$ be $n$ i.i.d. random variables that follow a Bernoulli distribution $\mathcal{B}e(p)$. The interpretation is that $X_i$ measures if individual $i$ possesses the characteristic of which we want to know the proportion or not.
Hypotheses:
$$ H_0: p = p_0 \quad \mbox{vs.} \quad H_1: p \ne p_0 \quad \mbox{or} \quad H_1: p > p_0 \quad \mbox{or} \quad H_1: p < p_0. $$
Test Statistic:
$$ T = \sum_{i = 1}^n \frac{(X_i - \mu)^2}{\sigma_0^2}. $$
Null distribution: Under the null hypothesis,
$$ T = \sqrt{n}\frac{\overline X_n - p_0}{\sqrt{p_0 (1 - p_0)}} \xrightarrow{\mathcal{L}} \mathcal{N}(0,1). $$
R function: prop.test()
Validity: asymptotic.
The goal is to compare the mean or the variance of two normal populations from an observed finite sample of each.
Model: Let $(X_1, \dots, X_{n_x})$ be $n_X$ i.i.d. random variables that follow a normal distribution $\mathcal{N}(\mu_X, \sigma_X^2)$ and $(Y_1, \dots, Y_{n_Y})$ be $n_Y$ i.i.d. random variables that follow a normal distribution $\mathcal{N}(\mu_Y, \sigma_Y^2)$. Assume furthermore that the two sample are statistically independent.
Hypotheses:
$$ H_0: \sigma_X^2 = \sigma_Y^2 \quad \mbox{vs.} \quad H_1: \sigma_X^2 \ne \sigma_Y^2 \quad \mbox{or} \quad H_1: \sigma_X^2 > \sigma_Y^2 \quad \mbox{or} \quad H_1: \sigma_X^2 < \sigma_Y^2. $$
Test Statistic:
$$ T = \frac{S_X^2}{S_Y^2}, \quad \mbox{with} \quad S_X^2 = \frac{1}{n_X - 1} \sum_{i = 1}^{n_X} (X_i - \overline X)^2 \quad \mbox{and} \quad S_Y^2 = \frac{1}{n_Y - 1} \sum_{i = 1}^{n_Y} (Y_i - \overline Y)^2. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \mathcal{F}isher(n_X - 1, n_Y - 1). $$
R function: var.test()
Validity: Relies on the normality assumption and independence within and between samples.
Hypotheses:
$$ H_0: \mu_X = \mu_Y \quad \mbox{vs.} \quad H_1: \mu_X \ne \mu_Y \quad \mbox{or} \quad H_1: \mu_X > \mu_Y \quad \mbox{or} \quad H_1: \mu_X < \mu_Y. $$
Test Statistic:
$$ T = \frac{\left( \overline X - \overline Y \right) - \left( \mu_X - \mu_Y \right)}{\sqrt{S_\mathrm{pooled}^2 \left( \frac{1}{n_X} + \frac{1}{n_Y} \right)}}. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \mathcal{S}tudent(n_X + n_Y - 2), $$
R function: t.test()
Validity: Relies on the normality assumption and independence within and between samples.
Test Statistic:
$$ T = \frac{\left( \overline X - \overline Y \right) - \left( \mu_X - \mu_Y \right)}{\sqrt{\frac{S_X^2}{n_X} + \frac{S_Y^2}{n_Y}}}. $$
Null distribution:
$$ T \stackrel{H_0}{\sim} \mathcal{S}tudent(m), $$
where the degree of freedom $m$ of the Student's law is obtained either via:
$$ m_\mathrm{Welch} = \frac{\left( \frac{S_X^2}{n_X} + \frac{S_Y^2}{n_Y} \right)^2}{\frac{\left( \frac{S_X^2}{n_X} \right)^2}{n_X-1} + \frac{\left( \frac{S_Y^2}{n_Y} \right)^2}{n_Y-1}}; \quad \mbox{or}, $$
$$ m_\mathrm{Satterthwaite} = \frac{(n_X-1)(n_Y-1)}{(n_X-1) C_Y^2 + (n_Y-1) C_X^2}, \quad \mbox{with} \quad C_X = \frac{\frac{S_X^2}{n_X}}{\frac{S_X^2}{n_X} + \frac{S_Y^2}{n_Y}} \quad \mbox{and} \quad C_Y = 1 - C_X. $$
R function: t.test()
Validity: Relies on the normality assumption and independence within and between samples.
Adequacy tests aim at determining if the distribution of the observations is coherent with a given probability distribution.
This test is dedicated to test adequacy to the normal distribution and relies on order statistics, which are nothing but a ranking of the observations from the smallest to the largest.
Hypotheses:
$$ H_0: (X_1, \dots, X_n) \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2) \quad \mbox{vs.} \quad H_1: (X_1, \dots, X_n) \not\sim \mathcal{N}(\mu, \sigma^2). $$
Test Statistic:
$$ T = {\left(\sum_{i=1}^n a_i X_{(i)}\right)^2 \over \sum_{i=1}^n (X_i-\overline{X})^2} $$ where $X_{(i)}$ is the order statistic for observation $i$, $\overline X$ the sample mean and
$$ (a_1, \dots, a_n) = {\mathbf{m}^\top V^{-1} \over (\mathbf{m}^\top V^{-1}V^{-1}\mathbf{m})^{1/2}}, $$
where $m_i = \mathbb{E}[Z_{(i)}]$ with $(Z_1, \dots, Z_n) \stackrel{i.i.d.}{\sim} \mathcal{N}(0,1)$ and $V$ is the variance-covariance matrix of $(Z_{(1)}, \dots, Z_{(n)})$.
Null distribution:
$$ T \stackrel{H_0}{\sim} \mathcal{W}ilks(n). $$
R function: shapiro.test()
Validity: The sample size should meet $3 \le n \le 5000$. The Wilks distribution is approximated except for $n=3$.
This test aims at testing adequacy to any probability distribution but has generally less statistical power than the Shapiro-Wilk test when testing adequacy to the normal distribution.
Hypotheses:
$$ H_0: (X_1, \dots, X_n) \stackrel{i.i.d.}{\sim} F_0 \quad \mbox{vs.} \quad H_1: (X_1, \dots, X_n) \not\sim F_0. $$
Test Statistic:
$$ T = \sup_{x \in \mathbb R} | F_n(x) - F(x)| $$ where $F_n$ is the cumulative distribution function and $F$ is the cumulative distribution function of the law under testing.
Null distribution: Under the null hypothesis
$$ \sqrt{n} T \xrightarrow{\mathcal{L}} \sup_{x \in \mathbb R} | B(F(x))|, $$
where $B$ is the Brownian bridge.
R function: ks.test()
Validity: Asymptotic.
Model: We group observations in classes and we compare the observed frequencies of these classes to the corresponding theoretical frequencies as given by the hypothesized law $F_0$.
Hypotheses:
$$ H_0: (X_1, \dots, X_n) \stackrel{i.i.d.}{\sim} F_0 \quad \mbox{vs.} \quad H_1: (X_1, \dots, X_n) \not\sim F_0. $$ Test Statistic:
$$ T = n\sum_{i=1}^k\frac{(f_i-f_{0i})^2}{f_{0i}}, $$
where $n$ is the total number of observations, $k$ the number of classes, $f_i = n_i / n$ and $f_{i0}$ the theoretical frequency of class $i$, i.e. the probability that the random variable ends up in class $i$.
Null distribution: Under the null hypothesis
$$ T \xrightarrow{\mathcal{L}} \chi^2_{k - 1 - \ell}, $$
where $\ell$ is the number of estimated parameters for $F_0$.
R function: chisq.test()
Validity: Requires large class frequencies. Typically, we group observations in order to get classes with $n_i \ge 5$.
The independence $\chi ^2$ test is used to test whether two categorical random variables are independent.
Model: Let $X = (Y, Z) = ((Y_1, Z_1), \dots, (Y_n, Z_n))$ be a bivariate sample of $n$ i.i.d. pairs of categorical random variables. Let $\nu$ be the law of $(Y_1, Z_1)$, $\mu$ the law of $Y_1$ and $\lambda$ the law of $Z_1$. Let ${y_1, \dots, y_s}$ be the set of possible values for $Y_1$ and ${z_1, \dots, z_r}$ the set of possible values for $Z_1$. For $\ell \in {1, \dots, s}$ et $h \in {1, \dots, r}$, let
$$ N_{\ell,\cdot} = \left| \left{ i \in {1, \dots, n}; Y_i = y_\ell \right} \right|,\quad N_{\cdot, h} = \left| \left{ i \in {1, \dots, n}; Z_i = z_h \right} \right|, $$
$$ N_{\ell,h} = \left| \left{(i \in {1, \dots, n}; Y_i = y_\ell, Z_i = z_h \right} \right|.$$
Hypotheses:
$$ H_0: \nu = \mu \otimes \lambda \quad \mbox{vs.} \quad H_1: \nu \ne \mu \otimes \lambda. $$
Test statistic:
$$ T = n \sum_{\ell = 1}^s \sum_{h = 1}^r \frac{ \left( \frac{N_{\ell, h}}{n} - \frac{N_{\ell, \cdot}}{n} \frac{N_{\cdot, h}}{n} \right)^2 }{ \frac{N_{\ell, \cdot}}{n} \frac{N_{\cdot, h}}{n} }. $$
Null distribution: Under the null hypothesis,
$$ T \xrightarrow{\mathcal{L}} \chi^2 \left( (s-1)(r-1) \right). $$
R function: chisq.test()
Validity: Asymptotic. In practice, we often use the criteria $n \gg 30$ and $N_{\ell, h} \gg 5$, for each pair $(\ell, h)$.
We want to study the energy efficiency of a chemical reaction that is documented having a nominal energy efficiency of $90\%$. Based on previous experiments on the same reaction, we know that the energy efficiency is a Gaussian random variable with unknown mean $\mu$ and variance equal to $2$. In the last $5$ days, the plant has given the following energy efficiencies (in percentage):
$$ 91.6, \quad 88.75, \quad 90.8, \quad 89.95, \quad 91.3 $$
A study about air pollution done by a research station measured, on $8$ different air samples, the following values of a polluant (in $\mu$g/m$^2$):
$$ 2.2 \quad 1.8 \quad 3.1 \quad 2.0 \quad 2.4 \quad 2.0 \quad 2.1 \quad 1.2 $$
Assuming that the sampled population is normal,
A medical inspection in an elementary school during a measles epidemic led to the examination of $30$ children to assess whether they were affected. The results are in a tibble exam
which contains the following:
set.seed(1234) p <- 0.1 n <- 30 exam <- tibble::tibble( Id = 1:30, Status = sample(c("Healthy", "Sick"), 30, replace = TRUE, prob = c(1 - p, p)) )
Let $p$ be the probability that a child from the same school is sick. Determine:
The capacities (in ampere-hours) of $10$ batteries were recorded as follows:
$$ 140, \quad 136, \quad 150, \quad 144, \quad 148, \quad 152, \quad 138, \quad 141, \quad 143, \quad 151 $$
A company produces barbed wire in skeins of $100$m each, nominally. The real length of the skeins is a random variable $X$ distributed as a $\mathcal{N}(\mu, 4)$. Measuring $10$ skeins, we get the following lengths:
$$ 98.683, 96.599, 99.617, 102.544, 100.110, 102.000, 98.394, 100.324, 98.743, 103.247 $$
In an atmospheric study the researchers registered, over $8$ different samples of air, the following concentration of COG (in micrograms over cubic meter):
$$ 2.3;\; 1.7;\; 3.2;\; 2.1;\; 2.3;\; 2.0;\; 2.2;\; 1.2 $$
Assume now that the COG concentration is normally distributed.
On a total of $2350$ interviewed citizens, $1890$ approve the construction of a new movie theater.
A computer chip manufacturer claims that no more than $1\%$ of the chips it sends out are defective. An electronics company, impressed with this claim, has purchased a large quantity of such chips. To determine if the manufacturer’s claim can be taken literally, the company has decided to test a sample of $300$ of these chips. If $5$ of these $300$ chips are found to be defective, should the manufacturer’s claim be rejected?
To determine the impurity level in alloys of steel, two different tests can be used. $8$ specimens are tested, with both procedures, and the results are written in the following table:
specimen n. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ----------- | --- | --- | --- | --- | --- | --- | --- | --- Test 1 | 1.2 | 1.3 | 1.7 | 1.8 | 1.5 | 1.4 | 1.4 | 1.3 Test 2 | 1.4 | 1.7 | 2.0 | 2.1 | 1.5 | 1.3 | 1.7 | 1.6
Assume that the data are normal.
A sample of $300$ voters from region A and $200$ voters from region B showed that the $56\%$ and the $48\%$, respectively, prefer a certain candidate. Can we say that at a significance level of $5\%$ there is a difference between the two regions?
In a sample of $100$ measures of the boiling temperature of a certain liquid, we obtain a sample mean $\overline{x} = 100^{o}C$ with a sample variance $s^2 = 0.0098^{o}C^2$. Assuming that the observation comes from a normal population:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.