In jrminter/statshelpR: Convenience functions for statistical analysis

knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(ggplot2))))
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(mosaic))))
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(plotly))))
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(openintro))))
library(png)
library(grid)
library(statshelpR)
library(pander)
data(meanTime25)

Introduction

This is adapted from YouTube video from OpenIntro.Org and some information from davidmlane.com.

A point estimate provides a single plausible value for a parameter. However, a point estimate is rarely perfect. Usually there is some error in the estimate. To account for this error, we provide a plausible range of values for the parameter.

An approximate 95% confidence level

A single point estimate such as the sample mean will not likely be equal to the exact population parameter like the true population mean. A plausible range of values is called a confidence interval.

The standard error

The standard error is a measure of the uncertainty associated with the point estimate and provides a guide for how large we should make the confidence level. The standard error represents the standard deviation associated with the estimate.

The standard error of the mean is designated as: $\sigma_M$. It is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean is:

$$\sigma_M = \frac{\sigma}{\sqrt{N}}$$

where $\sigma$ is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon). This formula does not assume a normal distribution. However, many of the uses of the formula do assume a normal distribution.

The formula shows that the larger the sample size*, the smaller the standard error of the mean. More specifically, the size of the standard error of the mean is inversely proportional to the square root of the sample size**.

A plot of the effect of sample size on the standard error the standard error for a standard deviation of 10 is show below.

sigma <- 10
n <- seq(1, 30, 1)
func <- function(n){
  return(sigma/sqrt(n))
}

se <- sapply(n,func)
df <- data.frame(n=n, se=se)

plt <-  ggplot(data=df, aes(x=n,y=se)) + geom_point() +
        xlab(label="number of samples") + ylab("SE") +
        scale_y_continuous(breaks = seq(from = 0, to = 10, by = 1),
                           limits = c(0,10)) +
        ggtitle(expression(paste(sigma[M], " as a function of sample size"))) +
        theme(axis.text=element_text(size=12),
              axis.title=element_text(size=14),
              plot.title = element_text(hjust = 0.5)) # center the title
print(plt)

If the interval spreads out 2 standard errors from the point estimate, we can be roughly 95% confident that we have captured the true parameter.

$$point\ estimate \pm 2 \times SE$$

A sampling distribution for the mean

What does 95% confidence mean? Suppose we took many samples and built a 95% confidence interval from each sample. Then 95% of those intervals would contain the actual mean.

manySampImg <- readPNG('inc/many-samples.png')
grid.raster(manySampImg)

Note that only one of these 25 estimates did not capture the true mean of the population. 96% of the samples captured the true mean.

Let's take 100,000 samples, calculate the mean of each, and plot them in a histogram to get an especially accurate depiction of the sampling distribution.

df <- data.frame(obs=meanTime25)

plt <- ggplot(data=df, aes(obs)) + 
       geom_histogram(breaks=seq(85, 105, by = .5), col="black",
                      fill="blue", alpha = .5) +
        ylab(label="frequency") + xlab("sample mean") +
        ggtitle("Sample Mean Distribution") +
        theme(axis.text=element_text(size=12),
              axis.title=element_text(size=14),
              plot.title = element_text(hjust = 0.5)) # center the title
print(plt)

And we can compute a normal probability plot to show that the distribution is indeed normal.

qqPlt <- ggplot(data=df, aes(sample=obs))+ stat_qq() +
         ylab(label="sample means") + xlab("theoretical quantiles") +
         ggtitle("Normal Probability Plot") +
         theme(axis.text=element_text(size=12),
               axis.title=element_text(size=14),
               plot.title = element_text(hjust = 0.5)) # center the title
print(qqPlt)

Both plots show that the collection of means follow a normal model. This result can be explained by the Central Limit Theorem.

An informal description is

If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the sample mean is well approximated by a normal model.

Changing the confidence level

Perhaps we want a 99% confidence level, we would need to widen our interval. If we want a lower confidence level, we could narrow our interval.

There are 3 components:

The point estimate: The 95% confidence interval structure provides guidance in how to make intervals with new confidence levels: point estimate $\pm 1.96 \times SE$ To create a 99% confidence interval we change the point estimate to $\pm 2.58 \times SE$. Our statshelpR package supplies the function calc_z_star to compute this value. An example to compute $Z^*$ for a 99% confidence interval is shown below:

z_star <- calc_z_star(0.99)
names(z_star) <- "z_star"
pander(z_star)

Conditions required for the sampling distribution of $\bar{x}$ is nearly normal and the estimate of the standard error is sufficiently accurate.
The sample observations are independent. These include random assignment in an experiment or that they are obtained by a random sample of < 10% of the population.
The sample size is large. $n > 30$ is a good rule of thumb.
The distribution of observations is not strongly skewed.

Proper interpretation of confidence levels

We need to be careful with our language in expressing confidence levels. An example is "We are x% confident that the population parameter is between the lower and upper bound of our confidence interval."

Incorrect language might try to describe the confidence interval as capturing the population with a certain probability. This is one of the most common errors. The probability is that the parameter is only within our boundaries.
Confidence intervals only try to capture the population parameter. A confidence interval says nothing about the confidence of capturing individual observations, a proportion of observations, or about capturing point estimates.