$\$

SDS230::download_data("IPED_salaries_2016.rda")
# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

$\$

Part 1: One-way ANOVA to analyze if the mean salary differs for different faculty ranks

Let's assess an obvious question: do professors of different ranks have different salaries on average?

$\$

Part 1.1: State the null and alternative hypotheses

Let's start as always by stating the null and alternative hypotheses:

$\$

Part 1.2a: Plot the data

library(dplyr)

load("IPED_salaries_2016.rda")

IPED_2 <- IPED_salaries |>
  filter(endowment > 0) |>
  mutate(log_salary = log10(salary_tot)) |>
  filter(CARNEGIE %in% c(15, 31)) |>
  filter(rank_name %in% c("Assistant", "Associate", "Full")) |>
  group_by(school) |>
  mutate(num_ranks = n()) |>
  filter(num_ranks == 3)     # only use schools that have all three ranks

# could look at the log salary instead...

dim(IPED_2)



# create a boxplot of the data using ggplot
# let's get summary statistics of the data







# let's create another visualization of the data

$\$

Part 1.2b: Calculate the observed statistic

Our observed statistic is an F-statistic:

$$F = \frac{\frac{1}{K-1}\sum_{i=1}^K n_i(\bar{x}i - \bar{x}{tot})^2}{\frac{1}{N-K}\sum_{i=1}^K \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2}$$

We will cheat and use the lm() and anova() functions to get the F-statistic. On the homework you will need to calculate this statistic from the data using dplyr!

# Getting the observed F-statistic for the IPED data using built in R functions
# On the homework be sure to use dplyr to actually calculate this statistic!

$\$

Step 3: Create the null distribution

Let's visualize the null distribution.

# calculate the degrees of freedom







# visualize the null distribution 

$\$

Part 1.4: Calculate a p-value

# calculate the p-value

$\$

Step 1.5: Make a decision

$\$

Part 2: Connections between ANOVA and regression

Let's look at connections between the least squares fit we used when fitting linear regression models and our one-way ANOVA.

$\$

Step 2.1: Least squares offsets are the group means

Let's look at the mean salary for each rank and compare it to the least squares offsets that the lm() function finds.

# get the mean and sd of the salary for each faculty rank





# fit a linear model 




# check that the least squares fit offsets are the means of each group

$\$

Step 2.2: Using R to get an ANOVA table and to look at diagnostic plots

We can use the anova() function to create an ANOVA table, and we can use the plot() function to look at diagnostic plots to make sure our ANOVA conditions have been met.

# an easy way to get the ANOVA table using the ANOVA function 


# check that SST = SSG + SSE

         #  SST

         #  SSG + SSE



# we can use regression diagnostic plots to assess if ANOVA conditions have been met



# we should also check that the maximum and minimum standard deviations are not greater 
# than a factor of 2 apart

$\$

Part 2.3 Kruskal–Wallis test to see if any of our groups stochastically dominate another group.

If we are concerned that our one-way ANOVA conditions are not met, we can run a Kruskal–Wallis test which does not rely on the assumptions of normality and homoscedasticity. We could also run a permutation test which does not rely on these assumptions either.

# Kruskal–Wallis test





# compare to the ANOVA 

$\$

Part 3: Pairwise comparisons after running a one-way ANOVA

Part 3.1 Pairwise comparisons in R

If we run a one-way ANOVA and the results are statistically significant, there are a number of tests we can run to see which pairs of results are significantly different.

# test with no multiple comparisons adjustment (not great)



# with the Bonferroni correction



# Note, the Bonferroni p-values are 3 times larger than the p-values with no adjustment



# Tukey's HSD test using the TukeyHSD() function 
# It is giving results similar to the Bonferroni correction

$\$



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.