$\$
SDS230::download_data("IPED_salaries_2016.rda") SDS230::download_data("players_born_1901_1950.Rda") SDS230::download_data("x_y_join.rda")
# install.packages("latex2exp") library(latex2exp) library(dplyr) library(ggplot2) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
Let us fit a logistic regression model that gives the probability that a professor is a Full professor (compared to an Assistant Professor) based on the professor's salary; i.e., the model is giving P(Full professor | salary).
Note: In the IPED data, the cases are actually individual universities, and the salaries are averages over all professors of a particular rank, so what the model is really doing is giving P(Full professor rank | average salary) which is a little different than the variability that would be seen at the individual professor level.
Let's start by loading the data, doing a little data cleaning, and plotting the data.
# load the data into R load("IPED_salaries_2016.rda") # get assistant and full professor salaries assistant_full_data <- IPED_salaries |> filter(endowment > 0, rank_name %in% c("Assistant", "Full")) |> select(school, salary_tot, endowment, enroll_total, rank_name) |> na.omit() |> mutate(log_endowment = log10(endowment)) |> mutate(rank_name = relevel(rank_name, "Assistant")) # visualize the data assistant_full_data |> ggplot(aes(x = salary_tot, y = rank_name)) + geom_jitter(alpha = .1, position = position_jitter(height = .2)) + xlab("Salary ($)") + ylab("Faculty rank")
$\$
Let's now fit a logistic regression model using the glm()
function. To fit a
logistic regression model, we will use the family = "binomial" argument. What
this function is doing is finding the "maximum likelihood" values for
$\hat{\beta}_0$ and $\hat{\beta}_1$.
We can use the coefficients from the model to build a function that can give the probability of that the faculty rank is Full professor based on the salary value:
$$P(\text{Full professor} | \text{salary}) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot \text{salary}}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot \text{salary}}}$$
# build the logistic regression function lr_fit <- glm(rank_name ~ salary_tot, data = assistant_full_data, family = "binomial") # extract the coefficients b0 <- coefficients(lr_fit)[1] b1 <- coefficients(lr_fit)[2] # create the prediction function get_full_prof_probability <- function(salary) { prob <- (exp(b0 + b1 * salary)) / (1 + exp(b0 + b1 * salary)) names(prob) <- NULL return(prob) } # what is the probability that if a school is paying $80,000 on average to a # particular faculty rank, this rank corresponds to the Full professor rank? # (instead of the Assistant Professor rank)? get_full_prof_probability(80000)
$\$
Let's now plot the fitted logistic regression function to see how the probability that the rank corresponds to Full professor changes as a function of the salary amount.
# to plot this function we add 1 to it plot_full_prof_probability <- function(salary) { get_full_prof_probability(salary) + 1 } # plot the logistic regression function assistant_full_data |> ggplot(aes(x = salary_tot, y = rank_name)) + geom_jitter(alpha = .1, position = position_jitter(height = .2)) + xlab("Salary ($)") + ylab ("Pr( Full prof | salary)") + stat_function(fun = plot_full_prof_probability, color = "red") + xlim(-30000, 200000) + geom_vline(xintercept = 80000, col = "blue") + geom_hline(yintercept = plot_full_prof_probability(80000), col = "blue", linetype="dotted")
$\$
As we discussed the logistic regression model fits a model of the form: $log(odds) = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_1$. Thus if this model is appropriate, there should be a linear relationship between the $x$ and the log odds.
We can examine this by estimating probabilities in evenly spaced regions of the range of the x values and the proportion fo points in each neighborhood. We can then convert these proportions (i.e., estimated probabilities) into log(odds) and see if they are a linear function of the x values. Let's try this now with our simple logistic regression model.
# Breaking the data into salary groups at $10,000 increments and calculating the proportion # of full professors in each group (which is an estimate of the probability in each group). # Also adding the log-odds of the probability in each group bin_ends <- seq(30000, 140000, by = 10000) bin_centers <- bin_ends + 10000/2 assistant_full_data_augmented <- assistant_full_data |> mutate(salary_group = cut(salary_tot, bin_ends, labels = FALSE)) |> group_by(salary_group) |> mutate(prob_rank = mean(rank_name == "Full"), n = n()) |> mutate(log_odds_rank = log(prob_rank/(1 - prob_rank))) |> mutate(bin_center = bin_centers[salary_group]) # Plotting the log-odds as a linear function of the salary in each group to see if # the logistic regression model is a good fit to the data plot(assistant_full_data_augmented$bin_center, assistant_full_data_augmented$log_odds_rank, xlab = "Salary bin center", ylab = "Log-odds full professor") abline(b0, b1, col = "red") abline(v = bin_ends, col = 'green', lty = 3) # Can add a direct linear fit of log-odds as a function of salary to the plot # (not exactly the same because we are not using exactly the same data but very close # (i.e., omitting data, not using full range of data, taking the mean in each group to estimate probs etc.)) lm_fit <- lm(log_odds_rank ~ bin_center, data = na.omit(assistant_full_data_augmented)) abline(lm_fit, col = "blue")
$\$
In multiple logistic regression we have several explanatory variables to predict the probability that an outcome is in one of two categories. It is exactly analogous to how we extended simple linear regression to multiple linear regression, except out our response variable is a category variable that has two levels.
$\$
We can use multiple predictors in our logistic regression function as well. Let's create a function that gives the probability that the rank corresponds to Full professor given a salary amount a log10 endowment size.
# fit the model # extract the regression coefficients # build a function that can predict faculty rank based on salary and endowment # predict probability Full professor if average salary is $80,000 # and the school has a $10 million endowment ($10^7) # predict probability Full professor if average salary is $80,000 # and the school has a $10 billion endowment ($10^10)
$\$
Let's now visualize the probability function P(Full professor | salary, log_endowment)!
# create a 2D plot of the probability that a car is new as a function of price and mileage salary_intervals <- seq(10000, 120000, by = 100) log_endowment_intervals <- seq(5, 10, by = .1) # there are more efficient ways to do this using apply or mapping functions # create the plot
$\$
In Poisson regression, we try to predict count data; i.e., our predictions are non-negative integers: 0, 1, 2, 3, ...
To explore Poisson regression, let's model how many times the Roy Kent said the word f.ck on each episode of the TV show Ted Lasso. In particular, we will assess whether he said f.ck at a higher rate during episodes when:
a. He was coaching soccer.
b. When he was dating Keeley Jones.
This analysis is based on the work of Deepsha Menghani and Julia Silge.
$\$
The R chunk below loads the data we will need for our analyses. In particular, the data is in a package packaged created by Deepsha Menghani which we will install from GitHub and then load the data into R.
The variables that are of interest for our analysis are:
F_count_RK
: How many times Roy Kent said F.ck in an episode. Coaching_flag
: Whether Roy Kent was coaching in an episode. Dating_flag
: Whether Roy Kent was dating Keeley in an episode. # devtools::install_github("deepshamenghani/richmondway") library(richmondway) data(richmondway)
$\$
Let's start by visualizing the data.
$\$
Let's now fit a Geralized Linear Model to the data, were we try to predict the number of times Roy Kent say f.ck as a function of whether he was coaching and or dating Keeley. To do this we will use the glm()
function and set the family to be poisson.
Does there appear to be a statistically significant relationship between how many times Roy Kent said f.ck and whether he was coaching and/or dating Keeley?
$\$
We can also use the bootstrap to create confidence intervals for the regression parameters.
$\$
Let's look at some additional tidyverse packages that are useful for data cleaning and wrangling. In particular we can look at:
lubridate
for cleaning/processing datesstringr
for manipulating stringstidyr
for reshaping dataWe might not get through all these functions/packages today but I wanted to make you aware of them so you can learn more on your own, and we can potentially revisit some of these packages during the last week of class.
$\$
Let's try some basic examples by finding the month that major league baseball players played their final game (using players who played from 1901 to 1950).
library(lubridate) # load the baseball data load("players_born_1901_1950.Rda") # get the final game players played which is a string # convert to a date and extract the month # plot the months as a barplot
$\$
The stringr
package has many useful functions for manipulating strings. Let's just focus on the str_replace_all(original_string, "old", "new")
function which will replace all instances of the "old" string with the "new" string.
library(stringr) # replace an occurrence of a substring #str_replace("String", "old", "new") # Example: Let's download an article from the internet #base_name <- "https://www.foxbusiness.com/economy/millions-americans-have-received-pay-cut-since-biden-took-office" base_name <- "https://www.nytimes.com/2023/11/10/us/politics/biden-xi-meeting.html?searchResultPosition=7" article_name <- "politics.html" download.file(base_name, article_name) # viewer <- getOption("viewer") # viewer(article_name) # read the whole article as a single string the_article <- readChar(article_name, file.info(article_name)$size) # replace all occurrences of a string # write the new article #viewer("sleepy_article.html")
$\$
Let's see if we can compare men and women salary on the same plot using ggplot by first pivoting our longer.
library(tidyr) # get salaries for men and women men_women <- IPED_salaries |> filter(rank_name == "Full") |> select(school, endowment, salary_men, salary_women) |> na.omit() # how can plot men and women salaries on the same plot using ggplot? # let's pivot the data longer # visualize as a boxplot # visualize as a density plot
Does it appear that men and women are being paid differently?
$\$
Let's pivot back wider to see if we can come up with more informative plots using ggplot.
# create the data longer again and mutate on salary difference # visualize as a boxplot # visualize as a density
Does it appear that men and women are being paid differently?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.