$\$
SDS230::download_data("IPED_salaries_2016.rda") # SDS230::download_data("Interactions_Categorical.csv")
# install.packages("latex2exp") library(latex2exp) library(dplyr) library(ggplot2) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
Let's look at connections between the least squares fit we used when fitting linear regression models and our one-way ANOVA.
$\$
library(dplyr) library(ggplot2) load("IPED_salaries_2016.rda") IPED_2 <- IPED_salaries |> filter(endowment > 0) |> mutate(log_salary = log10(salary_tot)) |> filter(CARNEGIE %in% c(15, 31)) |> filter(rank_name %in% c("Assistant", "Associate", "Full")) |> group_by(school) |> mutate(num_ranks = n()) |> filter(num_ranks == 3) # only use schools that have all three ranks # could look at the log salary instead... # IPED_2$salary_tot <- IPED_2$log_salary dim(IPED_2) # visualize the data IPED_2 |> ggplot(aes(rank_name, salary_tot, col = rank_name)) + geom_jitter(position = position_jitter(width = .2)) + xlab("Faculty rank") + ylab("Salary ($)")
$\$
Let's look at the mean salary for each rank and compare it to the
least squares offsets that the lm()
function finds.
# fit a linear model fit_salary <- lm(salary_tot ~ rank_name, data = IPED_2) # check that the least squares fit offsets are the means of each group (summary_salary <- summary(fit_salary)) fit_coefs <- coef(fit_salary) c(fit_coefs[1], fit_coefs[1] + fit_coefs[2], fit_coefs[1] + fit_coefs[3]) # get the mean and sd of the salary for each faculty rank IPED_stats <- IPED_2 |> group_by(rank_name) |> summarize(mean_salary = mean(salary_tot), sd_salary = sd(salary_tot)) IPED_stats$mean_salary
$\$
We can use the anova()
function to create an ANOVA table, and we can use the
plot()
function to look at diagnostic plots to make sure our ANOVA conditions
have been met.
# an easy way to get the ANOVA table using the ANOVA function (anova_table <- anova(fit_salary)) # we can use regression diagnostic plots to assess if ANOVA conditions have been met par(mfrow = c(2, 2)) plot(fit_salary) # we should also check that the maximum and minimum standard deviations are not greater # than a factor of 2 apart IPED_stats$sd_salary max(IPED_stats$sd_salary)/min(IPED_stats$sd_salary)
$\$
If we are concerned that our one-way ANOVA conditions are not met, we can run a Kruskal–Wallis test which does not rely on the assumptions of normality and homoscedasticity. We could also run a permutation test which does not rely on these assumptions either.
#Kruskal–Wallis test # compare to the ANOVA # anova(fit_salary)
$\$
If we run a one-way ANOVA and the results are statistically significant, there are a number of tests we can run to see which pairs of results are significantly different.
# test with no multiple comparisons adjustment (not great) # with the Bonferroni correction # Note, the Bonferroni p-values are 3 times larger than the p-values with no adjustment # Tukey's HSD test using the TukeyHSD() function # It is giving results similar to the Bonferroni correction
$\$
Let's use a two-way ANOVA to examine if ants are more attracted to particular types of sandwiches! This is a balanced design because there are the same number of observations at each factor level combination (in this case there are 4 observations at each factor combination level).
$\$
Let's start by visualizing the data (you can practice stating the null and alternative hypotheses on the homework).
# install.packages("Stat2Data") library(Stat2Data) data(SandwichAnts) # visualize the data # Two-way interaction plot using base R
$\$
Let's now use a two-way ANOVA to run a hypothesis test to see if the differences are statistically significant.
# Create a main effects only model # Order doesn't matter for a balanced design # Could look at diagnostic plots
$\$
In an unbalanced data, there are different numbers of measured responses at the different variable levels. When running an ANOVA on unbalanced data, one needs to be careful because there are different ways to calculate the sum of squares for the different factors, and this can lead to different results about which factors are statistically significant. Let's examine this using the IPED faculty salary data.
# Factor A: lecturer, assistant, associate, full professor # Factor B: liberal arts vs research university IPED_3 <- IPED_salaries |> filter(rank_name %in% c("Lecturer", "Assistant", "Associate", "Full")) |> mutate(rank_name = droplevels(rank_name)) |> filter(CARNEGIE %in% c(15, 31)) |> # na.omit() |> mutate(Inst_type = dplyr::recode(CARNEGIE, "31" = "Liberal arts", "15" = "Research extensive")) # examine properties of the data table(IPED_3$Inst_type, IPED_3$rank_name)
$\$
In type I sum of squares, the sum of squares are calculated sequentially, where first SSA is taken into account, and then SSB is consider. In particular:
# Create a main effects and interaction model
$\$
In type III sum of squares, the sum of squares the full model is fit SS(A, B, AB) and then the sum of squares for each factor is determined by taking the full model SS(A, B, AB) and subtracting out the fit when a given factor is missing.
# type III sum of squares the order that variables are added does not matter
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.