$\$
SDS230::download_data("IPED_salaries_2016.rda")
# install.packages("latex2exp") library(latex2exp) library(dplyr) library(ggplot2) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
Let's assess an obvious question: do professors of different ranks have different salaries on average?
$\$
Let's start as always by stating the null and alternative hypotheses:
$\$
library(dplyr) load("IPED_salaries_2016.rda") IPED_2 <- IPED_salaries |> filter(endowment > 0) |> mutate(log_salary = log10(salary_tot)) |> filter(CARNEGIE %in% c(15, 31)) |> filter(rank_name %in% c("Assistant", "Associate", "Full")) |> group_by(school) |> mutate(num_ranks = n()) |> filter(num_ranks == 3) # only use schools that have all three ranks # could look at the log salary instead... dim(IPED_2) # create a boxplot of the data using ggplot
# let's get summary statistics of the data # let's create another visualization of the data
$\$
Our observed statistic is an F-statistic:
$$F = \frac{\frac{1}{K-1}\sum_{i=1}^K n_i(\bar{x}i - \bar{x}{tot})^2}{\frac{1}{N-K}\sum_{i=1}^K \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2}$$
We will cheat and use the lm()
and anova()
functions to get the F-statistic.
On the homework you will need to calculate this statistic from the data using
dplyr!
# Getting the observed F-statistic for the IPED data using built in R functions # On the homework be sure to use dplyr to actually calculate this statistic!
$\$
Let's visualize the null distribution.
# calculate the degrees of freedom # visualize the null distribution
$\$
# calculate the p-value
$\$
$\$
Let's look at connections between the least squares fit we used when fitting linear regression models and our one-way ANOVA.
$\$
Let's look at the mean salary for each rank and compare it to the
least squares offsets that the lm()
function finds.
# get the mean and sd of the salary for each faculty rank # fit a linear model # check that the least squares fit offsets are the means of each group
$\$
We can use the anova()
function to create an ANOVA table, and we can use the
plot()
function to look at diagnostic plots to make sure our ANOVA conditions
have been met.
# an easy way to get the ANOVA table using the ANOVA function # check that SST = SSG + SSE # SST # SSG + SSE # we can use regression diagnostic plots to assess if ANOVA conditions have been met # we should also check that the maximum and minimum standard deviations are not greater # than a factor of 2 apart
$\$
If we are concerned that our one-way ANOVA conditions are not met, we can run a Kruskal–Wallis test which does not rely on the assumptions of normality and homoscedasticity. We could also run a permutation test which does not rely on these assumptions either.
# Kruskal–Wallis test # compare to the ANOVA
$\$
If we run a one-way ANOVA and the results are statistically significant, there are a number of tests we can run to see which pairs of results are significantly different.
# test with no multiple comparisons adjustment (not great) # with the Bonferroni correction # Note, the Bonferroni p-values are 3 times larger than the p-values with no adjustment # Tukey's HSD test using the TukeyHSD() function # It is giving results similar to the Bonferroni correction
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.