knitr::opts_chunk$set(echo = TRUE)

Source code.

ANOVA in R

This is a simple exploration of Analysis of Variance (ANOVA) in R inspired by Swinburne Commons, but using different data structures and filling in some details.

Our goal is to determine whether there is a significant difference in the height of three types of creatures: boys, girls, and aliens.

Manufacturing some fictitious data

We use fictitious data, where samples are drawn from three populations that are all normally distributed. All individuals are 14-years old and height is measured in centimeters.

library(dplyr)

n <- 50
boy_heights <- tibble(
  type = rep("boy", n),
  height = rnorm(n, mean = 163.8, sd = 12.1)
)
girl_heights <- tibble(
  type = rep("girl", n),
  height = rnorm(n, mean = 161.3, sd = 11.7)
)
alien_heights <- tibble(
  type = rep("alien", n),
  height = rnorm(n, mean = 127.0, sd = 9.4)
)
heights <- dplyr::bind_rows(boy_heights, girl_heights, alien_heights) %>% 
  mutate(
    type = factor(type),
    height_units = rep("cm", 3 * n),
    age = rep(14, 3 * n),
    age_units = rep("years", 3 * n)
  )
heights

Exploratory data analysis

We have r n observations each of boys, girls, and aliens for a total of r 3 * n observations. Let's plot the height variable by type for a visual comparison of the types:

library(ggplot2)

heights %>% ggplot(aes(x = type, y = height)) +
  geom_point(alpha = 0.5)

A boxplot of height by type gives us a visual summary of the mean values and quartiles for each of the three types:

heights %>% ggplot(aes(x = type, y = height)) +
  geom_boxplot()

Aliens appear to be quite different from humans with respect to height, but is the difference of the means statistically significant? The mean heights for boys and girls seem to be slightly different, but again we want to know if the difference in the mean height is statistically significant.

Analysis of variance

We use the aov function in R to fit an analysis of variance model. Note that our balanced dataset with equal numbers of observations for each of the three types satisfies the aov assumption of a balanced design.

analysis <- aov(height ~ type, data = heights)
summary(analysis)

The Pr(>F) column is the p-value of the F-statistic. In this case, it shows that it is very unlikely that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true. Therefore, we conclude that the null hypothesis is false and the difference among means by type is statistically significant.

Let's compare the types pairwise to see where the differences lie. The Tukey Honest Significant Differences method creates a set of confidence intervals on the differences between the means of each of the levels of a factor:

multi_comparisons <- TukeyHSD(analysis)
multi_comparisons

The p adj value (p-value after adjustment for the multiple comparisons) is 0.0 for the boy-alien comparison and for the girl-alien comparison, indicating that the difference of means is significant for these pairs. However for the girl-boy pair with the p adj value of r multi_comparisons$type[3,4] greater than 0.05, we cannot conclude that the difference of means between boys and girls is significant.



jimtyhurst/codesamplerr documentation built on Aug. 13, 2021, 8:45 a.m.