In alexpghayes/distributions: Probability Distributions as S3 Objects

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)

In this vignette we'll calculate an 88 percent confidence interval for the mean of a single sample. We'll use the same data we use for a one-sample T-test, which was:

[ 3, 7, 11, 0, 7, 0, 4, 5, 6, 2 ]

Recall that a confidence interval for the mean based off the T distribution is valid when:

The data comes from a normal distribution.
We have lots of data. How much? Many textbooks use 30 data points as a rule of thumb.

In the T-test example we verified that the sample seems to come from a normal distribution using a quantile-quantile plot (QQ-plot).

The formula for a confidence interval with confidence coefficient $1 - \alpha$ (in our case this is 0.88) is then:

[ \left( \bar x + t_{n-1, \alpha / 2} \cdot \frac{s}{\sqrt{n}}, \bar x + t_{n-1, 1 - \alpha / 2} \cdot \frac{s}{\sqrt{n}} \right) ]

where $\bar x = \frac 1n \sum_{i=1}^n x_i$ is the sample mean, $t_{n-1, \alpha / 2}$ is the $\alpha / 2$ lower quantile of a $t_{n-1}$ distribution, $s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2}$ is the sample standard deviation, and $n$ is the sample size. Since the T distribution is symmetric around zero, this is exactly equivalent to

[ \left( \bar x - t_{n-1, 1 - \alpha / 2} \cdot \frac{s}{\sqrt{n}}, \bar x + t_{n-1, 1 - \alpha / 2} \cdot \frac{s}{\sqrt{n}} \right) ]

which may look slightly more familiar. Having trouble seeing what the difference is? Look at the subscript of the quantiles $t$. Let's go ahead and calculate this out in R. Since our confidence coefficient is 0.88 (corresponding to an 88 percent confidence interval) we have:

[ 0.88 = 1 - \alpha ]

so that $\alpha = 0.12$. Now we can get started.

library(distributions3)

# read in the data
x <- c(3, 7, 11, 0, 7, 0, 4, 5, 6, 2)
n <- length(x)

# make T_9 random variable
T_9 <- StudentsT(df = 9)

# first approach
mean(x) + quantile(T_9, 0.12 / 2) * sd(x) / sqrt(n)
mean(x) + quantile(T_9, 1 - 0.12 / 2) * sd(x) / sqrt(n)

So our confidence interval using the first set of equations is (2.63, 6.37). Now we use the second set of equations:

# second approach
mean(x) - quantile(T_9, 1 - 0.12 / 2) * sd(x) / sqrt(n)
mean(x) + quantile(T_9, 1 - 0.12 / 2) * sd(x) / sqrt(n)

We get the same thing! Just like we expected.

There's one last thing we need to address. You may not have seen either of the formulas for a T-confidence interval that I wrote. You may have seen the formula:

[ \left( \bar x - t_{n-1, \alpha / 2} \cdot \frac{s}{\sqrt{n}}, \bar x + t_{n-1, \alpha / 2} \cdot \frac{s}{\sqrt{n}} \right) ]

If this is you, you have my condolences, and your instructor probably hates you. Ask that they use lower quantile notation rather than upper quantile notation.

This looks almost like the second approach, except using $t_{n-1, \alpha / 2}$ instead of $t_{n-1, 1 - \alpha / 2}$. What this comes down to is whether or not $t_{n-1, \alpha / 2}$ represents a lower quantile or an upper quantile. For a lower quantile, you look at the p.d.f. and start integrating from negative infinity, stop when the integral equals $\alpha$, and that take value to be the quantile. This is the only sane way to do things, although it requires being slightly more verbose so it can be inconvenient at times. Thus the upper quantile, in which case you do the same integration but start from positive infinity.

The difference is easiest to understand with a picture:

got_ggplot2 <- requireNamespace("ggplot2", quietly = TRUE)
got_cowplot <- requireNamespace("cowplot", quietly = TRUE)
got_ggplot2_and_cowplot <- got_ggplot2 * got_cowplot

library(ggplot2)
library(cowplot)

grid <- seq(-4, 4, length.out = 300)
density <- pdf(T_9, grid)
lower_trunc <- ifelse(grid <= 1.83, density, 0)
upper_trunc <- ifelse(grid >= 1.83, density, 0)

lower_quantile_plot <- ggplot(data = NULL) +
  geom_area(aes(grid, lower_trunc, alpha = 0.2), fill = "steelblue") +
  geom_line(aes(grid, density), size = 1, color = "grey") +
  geom_vline(xintercept = 1.83, size = 1, color = "darkgrey") +
  geom_text(
    aes(x = 3.2, y = 0.3, label = "t[0.975] == 1.83"),
    parse = TRUE,
    size = 4
  ) +
  labs(
    title = "Lower tail quantile of a T (df = 9)",
    subtitle = "Integral of shaded region is 0.975",
    y = "Density",
    x = "Support"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

upper_quantile_plot <- ggplot(data = NULL) +
  geom_area(aes(grid, upper_trunc, alpha = 0.2), fill = "steelblue") +
  geom_line(aes(grid, density), size = 1, color = "grey") +
  geom_vline(xintercept = 1.83, size = 1, color = "darkgrey") +
  geom_text(
    aes(x = 3.2, y = 0.3, label = "t[0.025] == 1.83"),
    parse = TRUE,
    size = 4
  ) +
  labs(
    title = "Upper tail quantile of a T (df = 9)",
    subtitle = "Integral of shaded region is 0.025",
    y = "Density",
    x = "Support"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

cowplot::plot_grid(lower_quantile_plot, upper_quantile_plot)

Note that distributions always returns a lower quantile from the quantile() function.

If you are truly unfortunate, your instructor may use $t_{n-1, \alpha / 2}$ to mean lower tail quantiles at times and upper tail quantiles at other times. If this is the case, only god can help you.

Using the t.test() function

If you want to verify that your calculation is correct, R has a function t.test() that performs T-tests and calculates T confidence intervals for means.

To get a confidence interval for a single sample, we pass t.test() a vector of data, and tell it the confidence coefficient (recall ours was 0.88) via the conf.level argument. conf.level defaults to 0.95, which means if we don't specify a confidence interval we get a 95 percent confidence interval.

t.test(x, conf.level = 0.88)

t.test() returns the interval (2.63, 6.37), exactly what we calculated by hand, which is reassuring. It also does a one-sample T-test for us as well.

If we don't specify conf.level, we get a 95 percent confidence interval: