knitr::opts_chunk$set(echo = TRUE)
This is a simple exploration of Analysis of Variance (ANOVA) in R inspired by Swinburne Commons, but using different data structures and filling in some details.
Our goal is to determine whether there is a significant difference in the height of three types of creatures: boys, girls, and aliens.
We use fictitious data, where samples are drawn from three populations that are all normally distributed. All individuals are 14-years old and height is measured in centimeters.
library(dplyr) n <- 50 boy_heights <- tibble( type = rep("boy", n), height = rnorm(n, mean = 163.8, sd = 12.1) ) girl_heights <- tibble( type = rep("girl", n), height = rnorm(n, mean = 161.3, sd = 11.7) ) alien_heights <- tibble( type = rep("alien", n), height = rnorm(n, mean = 127.0, sd = 9.4) ) heights <- dplyr::bind_rows(boy_heights, girl_heights, alien_heights) %>% mutate( type = factor(type), height_units = rep("cm", 3 * n), age = rep(14, 3 * n), age_units = rep("years", 3 * n) ) heights
We have r n
observations each of boys, girls, and aliens for a total of r 3 * n
observations. Let's plot the height
variable by type
for a visual comparison of the types:
library(ggplot2) heights %>% ggplot(aes(x = type, y = height)) + geom_point(alpha = 0.5)
A boxplot of height
by type
gives us a visual summary of the mean values and quartiles for each of the three types:
heights %>% ggplot(aes(x = type, y = height)) + geom_boxplot()
Aliens appear to be quite different from humans with respect to height, but is the difference of the means statistically significant? The mean heights for boys and girls seem to be slightly different, but again we want to know if the difference in the mean height is statistically significant.
We use the aov function in R to fit an analysis of variance model. Note that our balanced dataset with equal numbers of observations for each of the three types satisfies the aov
assumption of a balanced design.
analysis <- aov(height ~ type, data = heights) summary(analysis)
The Pr(>F)
column is the p-value of the F-statistic. In this case, it shows that it is very unlikely that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true. Therefore, we conclude that the null hypothesis is false and the difference among means by type is statistically significant.
Let's compare the types pairwise to see where the differences lie. The Tukey Honest Significant Differences method creates a set of confidence intervals on the differences between the means of each of the levels of a factor:
multi_comparisons <- TukeyHSD(analysis) multi_comparisons
The p adj
value (p-value after adjustment for the multiple comparisons) is 0.0 for the boy-alien
comparison and for the girl-alien
comparison, indicating that the difference of means is significant for these pairs. However for the girl-boy
pair with the p adj
value of r multi_comparisons$type[3,4]
greater than 0.05, we cannot conclude that the difference of means between boys
and girls
is significant.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.