$\$
SDS230::download_data("IPED_salaries_2016.rda") SDS230::download_data("x_y_join.rda")
# install.packages("latex2exp") library(latex2exp) library(dplyr) library(ggplot2) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
In prior classes we have visualized multiple regression models with categorical
predictors using base R graphics. In particular, we created scatter plots for
data in different categories using the plot()
and points()
and used the
col
argument to color the points. We then added on regression lines using the
abline()
function. This method was useful for educational purposes so that we
could see the connection between the model parameters that were estimated using
the lm()
function, and the underlying data. However, if we want to create
better looking visualizations in a more efficient manner, then it is better to
use ggplot.
Let's now use ggplot to visualize a multiple regression model that has one quantitative and one categorical variable. In particular, let's recreate the plot from class 19 part 3.6 where we display faculty salaries as a function of log endowment separately for different faculty ranks.
# load the data into R load("IPED_salaries_2016.rda") # create the IPED data subset used in class 19 part 3 IPED_2 <- IPED_salaries |> filter(endowment > 0) |> mutate(log_endowment = log10(endowment)) |> filter(CARNEGIE %in% c(15, 31)) |> filter(rank_name %in% c("Assistant", "Associate", "Full")) dim(IPED_2) # using ggplot IPED_2 |> ggplot(aes(log_endowment, salary_tot, col = rank_name)) + geom_point() + geom_smooth(method = "lm") + xlab("Log endowment") + ylab("Salary ($)")
$\$
Let us fit a logistic regression model that gives the probability that a professor is a Full professor (compared to an Assistant Professor) based on the professor's salary; i.e., the model is giving P(Full professor | salary).
Note: In the IPED data, the cases are actually individual universities, and the salaries are averages over all professors of a particular rank, so what the model is really doing is giving P(Full professor rank | average salary) which is a little different than the variability that would be seen at the individual professor level.
Let's start by loading the data, doing a little data cleaning, and plotting the data.
# get assistant and full professor salaries assistant_full_data <- IPED_salaries |> filter(endowment > 0, rank_name %in% c("Assistant", "Full")) |> select(school, salary_tot, endowment, enroll_total, rank_name) |> na.omit() |> mutate(log_endowment = log10(endowment)) |> mutate(rank_name = relevel(rank_name, "Assistant")) # visualize the data assistant_full_data |> ggplot(aes(x = salary_tot, y = rank_name)) + geom_jitter(alpha = .1, position = position_jitter(height = .2)) + xlab("Salary ($)") + ylab("Faculty rank")
$\$
Let's now fit a logistic regression model using the glm()
function. To fit a
logistic regression model, we will use the family = "binomial" argument. What
this function is doing is finding the "maximum likelihood" values for
$\hat{\beta}_0$ and $\hat{\beta}_1$.
We can use the coefficients from the model to build a function that can give the probability of that the faculty rank is Full professor based on the salary value:
$$P(\text{Full professor} | \text{salary}) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot \text{salary}}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot \text{salary}}}$$
# build the logistic regression function lr_fit <- glm(rank_name ~ salary_tot, data = assistant_full_data, family = "binomial") # extract the coefficients b0 <- coefficients(lr_fit)[1] b1 <- coefficients(lr_fit)[2] # create the prediction function get_full_prof_probability <- function(salary) { prob <- (exp(b0 + b1 * salary)) / (1 + exp(b0 + b1 * salary)) names(prob) <- NULL return(prob) } # what is the probability that if a school is paying $80,000 on average to a # particular faculty rank, this rank corresponds to the Full professor rank? # (instead of the Assistant Professor rank)? get_full_prof_probability(80000)
$\$
Let's now plot the fitted logistic regression function to see how the probability that the rank corresponds to Full professor changes as a function of the salary amount.
# to plot this function we add 1 to it plot_full_prof_probability <- function(salary) { get_full_prof_probability(salary) + 1 } # plot the logistic regression function assistant_full_data |> ggplot(aes(x = salary_tot, y = rank_name)) + geom_jitter(alpha = .1, position = position_jitter(height = .2)) + xlab("Salary ($)") + ylab ("Pr( Full prof | salary)") + stat_function(fun = plot_full_prof_probability, color = "red") + xlim(-30000, 200000) + geom_vline(xintercept = 80000, col = "blue") + geom_hline(yintercept = plot_full_prof_probability(80000), col = "blue", linetype="dotted")
$\$
As we discussed the logistic regression model fits a model of the form: $log(odds) = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_1$. Thus if this model is appropriate, there should be a linear relationship between the $x$ and the log odds.
We can examine this by estimating probabilities in evenly spaced regions of the range of the x values and the proportion fo points in each neighborhood. We can then convert these proportions (i.e., estimated probabilities) into log(odds) and see if they are a linear function of the x values. Let's try this now with our simple logistic regression model.
# Breaking the data into salary groups at $10,000 increments and calculating the proportion # of full professors in each group (which is an estimate of the probability in each group). # Also adding the log-odds of the probability in each group bin_ends <- seq(30000, 140000, by = 10000) bin_centers <- bin_ends + 10000/2 assistant_full_data_augmented <- assistant_full_data |> mutate(salary_group = cut(salary_tot, bin_ends, labels = FALSE)) |> group_by(salary_group) |> mutate(prob_rank = mean(rank_name == "Full"), n = n()) |> mutate(log_odds_rank = log(prob_rank/(1 - prob_rank))) |> mutate(bin_center = bin_centers[salary_group]) # Plotting the log-odds as a linear function of the salary in each group to see if # the logistic regression model is a good fit to the data plot(assistant_full_data_augmented$bin_center, assistant_full_data_augmented$log_odds_rank, xlab = "Salary bin center", ylab = "Log-odds full professor") abline(b0, b1, col = "red") abline(v = bin_ends, col = 'green', lty = 3) # Can add a direct linear fit of log-odds as a function of salary to the plot # (not exactly the same because we are not using exactly the same data but very close # (i.e., omitting data, not using full range of data, taking the mean in each group to estimate probs etc.)) lm_fit <- lm(log_odds_rank ~ bin_center, data = na.omit(assistant_full_data_augmented)) abline(lm_fit, col = "blue")
$\$
We can use multiple predictors in our logistic regression function as well. Let's create a function that gives the probability that the rank corresponds to Full professor given a salary amount a log10 endowment size.
# fit the model lr_fit2 <- glm(rank_name ~ salary_tot + log_endowment, data = assistant_full_data, family = "binomial") # extract the regression coefficients b0 <- coefficients(lr_fit2)[1] b1 <- coefficients(lr_fit2)[2] b2 <- coefficients(lr_fit2)[3] # build a function that can predict faculty rank based on salary and endowment get_full_prof_probability2 <- function(salary, log_endowment) { (exp(b0 + (b1 * salary) + (b2 * log_endowment))) / (1 + exp(b0 + (b1 * salary) + (b2 * log_endowment))) } # predict probability Full professor if average salary is $80,000 # and the school has a $10 million endowment ($10^7) get_full_prof_probability2(80000, 7) # predict probability Full professor if average salary is $80,000 # and the school has a $10 billion endowment ($10^10) get_full_prof_probability2(80000, 10)
$\$
Let's now visualize the probability function P(Full professor | salary, log_endowment)!
# create a 2D plot of the probability that a car is new as a function of price and mileage salary_intervals <- seq(10000, 120000, by = 100) log_endowment_intervals <- seq(5, 10, by = .1) # there are more efficient ways to do this using apply or mapping functions salary_endowment_df <- data.frame() for (currSalary in salary_intervals) { curr_df <- data.frame(salary = currSalary, log_endowment = log_endowment_intervals, prob_full_prof = get_full_prof_probability2(currSalary, log_endowment_intervals)) salary_endowment_df <- rbind(salary_endowment_df, curr_df) } # create the plot salary_endowment_df |> ggplot(aes(salary, log_endowment)) + geom_raster(aes(fill = prob_full_prof)) + scale_fill_gradient(low = "black", high = "red") + xlab("Salary ($)") + ylab("Log endowment") + theme_classic()
$\$
In Poisson regression, we try to predict count data; i.e., our predictions are non-negative integers: 0, 1, 2, 3, ...
To explore Poisson regression, let's model how many times the Roy Kent said the word f.ck on each episode of the TV show Ted Lasso. In particular, we will assess whether he said f.ck at a higher rate during episodes when:
a. He was coaching soccer.
b. When he was dating Keeley Jones.
This analysis is based on the work of Deepsha Menghani and Julia Silge.
$\$
The R chunk below loads the data we will need for our analyses. In particular, the data is in a package packaged created by Deepsha Menghani which we will install from GitHub and then load the data into R.
The variables that are of interest for our analysis are:
F_count_RK
: How many times Roy Kent said F.ck in an episode. Coaching_flag
: Whether Roy Kent was coaching in an episode. Dating_flag
: Whether Roy Kent was dating Keeley in an episode. # devtools::install_github("deepshamenghani/richmondway") library(richmondway) data(richmondway)
$\$
Let's start by visualizing the data.
richmondway |> ggplot(aes(F_count_RK, col = Dating_flag)) + geom_density() richmondway |> ggplot(aes(F_count_RK, col = Coaching_flag)) + geom_density()
$\$
Let's now fit a Geralized Linear Model to the data, were we try to predict the number of times Roy Kent say f.ck as a function of whether he was coaching and or dating Keeley. To do this we will use the glm()
function and set the family to be poisson.
glm_fit <- glm(F_count_RK ~ Dating_flag + Coaching_flag, family = poisson, data = richmondway) summary(glm_fit)
Does there appear to be a statistically significant relationship between how many times Roy Kent said f.ck and whether he was coaching and/or dating Keeley?
$\$
We can also use the bootstrap to create confidence intervals for the regression parameters.
boot_dating <- NULL boot_coaching <- NULL for (i in 1:1000) { curr_data <- sample_n(richmondway, nrow(richmondway), replace = TRUE) curr_fit <- glm(F_count_RK ~ Dating_flag + Coaching_flag, family = poisson, data = curr_data) the_coefs <- coef(curr_fit) boot_dating[i] <- the_coefs[2] boot_coaching[i] <- the_coefs[3] } hist(boot_dating, breaks = 100) hist(boot_coaching, breaks = 100) quantile(boot_dating, c(.025, .975)) quantile(boot_coaching, c(.025, .975))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.