knitr::opts_chunk$set(echo = TRUE, warning=FALSE) suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(ggplot2)))) suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(mosaic)))) suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(plotly)))) suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(openintro)))) library(png) library(grid) library(statshelpR) library(pander) data(meanTime25)
This is adapted from YouTube video from OpenIntro.Org and some information from davidmlane.com.
A point estimate provides a single plausible value for a parameter. However, a point estimate is rarely perfect. Usually there is some error in the estimate. To account for this error, we provide a plausible range of values for the parameter.
A single point estimate such as the sample mean will not likely be equal to the exact population parameter like the true population mean. A plausible range of values is called a confidence interval.
The standard error is a measure of the uncertainty associated with the point estimate and provides a guide for how large we should make the confidence level. The standard error represents the standard deviation associated with the estimate.
The standard error of the mean is designated as: $\sigma_M$. It is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean is:
$$\sigma_M = \frac{\sigma}{\sqrt{N}}$$
where $\sigma$ is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon). This formula does not assume a normal distribution. However, many of the uses of the formula do assume a normal distribution.
The formula shows that the larger the sample size*, the smaller the standard error of the mean. More specifically, the size of the standard error of the mean is inversely proportional to the square root of the sample size**.
A plot of the effect of sample size on the standard error the standard error for a standard deviation of 10 is show below.
sigma <- 10 n <- seq(1, 30, 1) func <- function(n){ return(sigma/sqrt(n)) } se <- sapply(n,func) df <- data.frame(n=n, se=se) plt <- ggplot(data=df, aes(x=n,y=se)) + geom_point() + xlab(label="number of samples") + ylab("SE") + scale_y_continuous(breaks = seq(from = 0, to = 10, by = 1), limits = c(0,10)) + ggtitle(expression(paste(sigma[M], " as a function of sample size"))) + theme(axis.text=element_text(size=12), axis.title=element_text(size=14), plot.title = element_text(hjust = 0.5)) # center the title print(plt)
If the interval spreads out 2 standard errors from the point estimate, we can be roughly 95% confident that we have captured the true parameter.
$$point\ estimate \pm 2 \times SE$$
What does 95% confidence mean? Suppose we took many samples and built a 95% confidence interval from each sample. Then 95% of those intervals would contain the actual mean.
manySampImg <- readPNG('inc/many-samples.png') grid.raster(manySampImg)
Note that only one of these 25 estimates did not capture the true mean of the population. 96% of the samples captured the true mean.
Let's take 100,000 samples, calculate the mean of each, and plot them in a histogram to get an especially accurate depiction of the sampling distribution.
df <- data.frame(obs=meanTime25) plt <- ggplot(data=df, aes(obs)) + geom_histogram(breaks=seq(85, 105, by = .5), col="black", fill="blue", alpha = .5) + ylab(label="frequency") + xlab("sample mean") + ggtitle("Sample Mean Distribution") + theme(axis.text=element_text(size=12), axis.title=element_text(size=14), plot.title = element_text(hjust = 0.5)) # center the title print(plt)
And we can compute a normal probability plot to show that the distribution is indeed normal.
qqPlt <- ggplot(data=df, aes(sample=obs))+ stat_qq() + ylab(label="sample means") + xlab("theoretical quantiles") + ggtitle("Normal Probability Plot") + theme(axis.text=element_text(size=12), axis.title=element_text(size=14), plot.title = element_text(hjust = 0.5)) # center the title print(qqPlt)
Both plots show that the collection of means follow a normal model. This result can be explained by the Central Limit Theorem.
An informal description is
If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the sample mean is well approximated by a normal model.
Perhaps we want a 99% confidence level, we would need to widen our interval. If we want a lower confidence level, we could narrow our interval.
There are 3 components:
statshelpR package supplies the
function calc_z_star to compute this value. An example to
compute $Z^*$ for a 99% confidence interval is shown below:z_star <- calc_z_star(0.99) names(z_star) <- "z_star" pander(z_star)
Conditions required for the sampling distribution of $\bar{x}$ is nearly normal and the estimate of the standard error is sufficiently accurate.
The sample observations are independent. These include random assignment in an experiment or that they are obtained by a random sample of < 10% of the population.
The sample size is large. $n > 30$ is a good rule of thumb.
The distribution of observations is not strongly skewed.
We need to be careful with our language in expressing confidence levels. An example is "We are x% confident that the population parameter is between the lower and upper bound of our confidence interval."
Incorrect language might try to describe the confidence interval as capturing the population with a certain probability. This is one of the most common errors. The probability is that the parameter is only within our boundaries.
Confidence intervals only try to capture the population parameter. A confidence interval says nothing about the confidence of capturing individual observations, a proportion of observations, or about capturing point estimates.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.