In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

SDS230::download_data("IPED_salaries_2016.rda")

SDS230::download_data("players_born_1901_1950.Rda")

SDS230::download_data("x_y_join.rda")

# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

Review code of logistic regression
Multiple logistic regression
Poisson regression
Data wrangling functions

$\$

Part 1: Logistic regression review

Let us fit a logistic regression model that gives the probability that a professor is a Full professor (compared to an Assistant Professor) based on the professor's salary; i.e., the model is giving P(Full professor | salary).

Note: In the IPED data, the cases are actually individual universities, and the salaries are averages over all professors of a particular rank, so what the model is really doing is giving P(Full professor rank | average salary) which is a little different than the variability that would be seen at the individual professor level.

Let's start by loading the data, doing a little data cleaning, and plotting the data.

# load the data into R
load("IPED_salaries_2016.rda")

# get assistant and full professor salaries 
assistant_full_data <- IPED_salaries |>
  filter(endowment > 0, rank_name %in% c("Assistant", "Full")) |>
  select(school, salary_tot, endowment, enroll_total, rank_name) |>
  na.omit() |>
  mutate(log_endowment = log10(endowment)) |>
  mutate(rank_name = relevel(rank_name, "Assistant"))


# visualize the data
assistant_full_data  |> 
  ggplot(aes(x = salary_tot, y = rank_name)) + 
  geom_jitter(alpha = .1, position = position_jitter(height = .2)) + 
  xlab("Salary ($)") +
  ylab("Faculty rank")

$\$

Part 1.1.: Fitting a logistic model to the data

Let's now fit a logistic regression model using the glm() function. To fit a logistic regression model, we will use the family = "binomial" argument. What this function is doing is finding the "maximum likelihood" values for $\hat{\beta}_0$ and $\hat{\beta}_1$.

We can use the coefficients from the model to build a function that can give the probability of that the faculty rank is Full professor based on the salary value:

$$P(\text{Full professor} | \text{salary}) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot \text{salary}}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot \text{salary}}}$$

# build the logistic regression function 
lr_fit <- glm(rank_name ~ salary_tot, data = assistant_full_data, family = "binomial")


# extract the coefficients
b0 <- coefficients(lr_fit)[1]
b1 <- coefficients(lr_fit)[2]


# create the prediction function 
get_full_prof_probability <- function(salary) {
  prob  <- (exp(b0 + b1 * salary)) / (1 + exp(b0 + b1 * salary))
  names(prob) <- NULL
  return(prob)
}


# what is the probability that if a school is paying $80,000 on average to a 
# particular faculty rank, this rank corresponds to the Full professor rank?
# (instead of the Assistant Professor rank)? 
get_full_prof_probability(80000)

$\$

Part 1.2.: Plotting the logistic regression function

Let's now plot the fitted logistic regression function to see how the probability that the rank corresponds to Full professor changes as a function of the salary amount.

# to plot this function we add 1 to it
plot_full_prof_probability <- function(salary) {
  get_full_prof_probability(salary) + 1
}


# plot the logistic regression function  
assistant_full_data  |> 
  ggplot(aes(x = salary_tot, y = rank_name)) + 
  geom_jitter(alpha = .1, position = position_jitter(height = .2)) + 
  xlab("Salary ($)")  + 
  ylab ("Pr( Full prof | salary)") + 
  stat_function(fun = plot_full_prof_probability, color = "red") +
  xlim(-30000, 200000)  + 
  geom_vline(xintercept = 80000, col = "blue") + 
  geom_hline(yintercept = plot_full_prof_probability(80000), col = "blue", linetype="dotted")

$\$

Part 1.3: Checking the model fit

As we discussed the logistic regression model fits a model of the form: $log(odds) = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_1$. Thus if this model is appropriate, there should be a linear relationship between the $x$ and the log odds.

We can examine this by estimating probabilities in evenly spaced regions of the range of the x values and the proportion fo points in each neighborhood. We can then convert these proportions (i.e., estimated probabilities) into log(odds) and see if they are a linear function of the x values. Let's try this now with our simple logistic regression model.

# Breaking the data into salary groups at $10,000 increments and calculating the proportion
#  of full professors in each group (which is an estimate of the probability in each group).
#  Also adding the log-odds of the probability in each group

bin_ends <- seq(30000, 140000, by = 10000) 
bin_centers <- bin_ends + 10000/2

assistant_full_data_augmented <- assistant_full_data |>
  mutate(salary_group = cut(salary_tot, bin_ends, labels = FALSE)) |>
  group_by(salary_group) |>
  mutate(prob_rank = mean(rank_name == "Full"), n = n()) |>
  mutate(log_odds_rank = log(prob_rank/(1 - prob_rank))) |>
  mutate(bin_center = bin_centers[salary_group])



# Plotting the log-odds as a linear function of the salary in each group to see if
# the logistic regression model is a good fit to the data
plot(assistant_full_data_augmented$bin_center, 
     assistant_full_data_augmented$log_odds_rank,
     xlab = "Salary bin center",
     ylab = "Log-odds full professor")
abline(b0, b1, col = "red")
abline(v = bin_ends, col = 'green', lty = 3)


# Can add a direct linear fit of log-odds as a function of salary to the plot
# (not exactly the same because we are not using exactly the same data but very close
# (i.e., omitting data, not using full range of data, taking the mean in each group to estimate probs etc.))
lm_fit <- lm(log_odds_rank ~ bin_center, data = na.omit(assistant_full_data_augmented))
abline(lm_fit, col = "blue")

$\$

Part 2: Multiple logistic regression

In multiple logistic regression we have several explanatory variables to predict the probability that an outcome is in one of two categories. It is exactly analogous to how we extended simple linear regression to multiple linear regression, except out our response variable is a category variable that has two levels.

$\$

Part 2.1: Fitting multiple logistic regression models

We can use multiple predictors in our logistic regression function as well. Let's create a function that gives the probability that the rank corresponds to Full professor given a salary amount a log10 endowment size.

# fit the model



# extract the regression coefficients





# build a function that can predict faculty rank based on salary and endowment






# predict probability Full professor if average salary is $80,000 
# and the school has a $10 million endowment ($10^7)



# predict probability Full professor if average salary is $80,000 
# and the school has a $10 billion endowment ($10^10)

$\$

Part 2.2: Visualzing the multiple logistic regression function

Let's now visualize the probability function P(Full professor | salary, log_endowment)!

# create a 2D plot of the probability that a car is new as a function of price and mileage
salary_intervals <- seq(10000, 120000, by = 100)
log_endowment_intervals <- seq(5, 10, by = .1)



# there are more efficient ways to do this using apply or mapping functions









# create the plot

$\$

Part 3: Poission regression

In Poisson regression, we try to predict count data; i.e., our predictions are non-negative integers: 0, 1, 2, 3, ...

To explore Poisson regression, let's model how many times the Roy Kent said the word f.ck on each episode of the TV show Ted Lasso. In particular, we will assess whether he said f.ck at a higher rate during episodes when: a. He was coaching soccer.
b. When he was dating Keeley Jones.

This analysis is based on the work of Deepsha Menghani and Julia Silge.

$\$

Part 3.1: Loading the data

The R chunk below loads the data we will need for our analyses. In particular, the data is in a package packaged created by Deepsha Menghani which we will install from GitHub and then load the data into R.

The variables that are of interest for our analysis are:

F_count_RK: How many times Roy Kent said F.ck in an episode.
Coaching_flag: Whether Roy Kent was coaching in an episode.
Dating_flag: Whether Roy Kent was dating Keeley in an episode.

# devtools::install_github("deepshamenghani/richmondway")

library(richmondway)

data(richmondway)

$\$

Part 3.2: Visualizing the data

Let's start by visualizing the data.

$\$

Part 3.3: Building a GLM

Let's now fit a Geralized Linear Model to the data, were we try to predict the number of times Roy Kent say f.ck as a function of whether he was coaching and or dating Keeley. To do this we will use the glm() function and set the family to be poisson.

Does there appear to be a statistically significant relationship between how many times Roy Kent said f.ck and whether he was coaching and/or dating Keeley?

$\$

Part 3.4: Bootstrap confidence intervals for regression parameters

We can also use the bootstrap to create confidence intervals for the regression parameters.

$\$

Part 4: Data cleaning and wrangling with the tidyverse

Let's look at some additional tidyverse packages that are useful for data cleaning and wrangling. In particular we can look at:

lubridate for cleaning/processing dates
stringr for manipulating strings
tidyr for reshaping data

We might not get through all these functions/packages today but I wanted to make you aware of them so you can learn more on your own, and we can potentially revisit some of these packages during the last week of class.

$\$

Part 4.1: lubridate for processing dates

Let's try some basic examples by finding the month that major league baseball players played their final game (using players who played from 1901 to 1950).

library(lubridate)

# load the baseball data
load("players_born_1901_1950.Rda")


# get the final game players played which is a string




# convert to a date and extract the month




# plot the months as a barplot

$\$

Part 4.2: stringr for manipulating strings

The stringr package has many useful functions for manipulating strings. Let's just focus on the str_replace_all(original_string, "old", "new") function which will replace all instances of the "old" string with the "new" string.

library(stringr)


# replace an occurrence of a substring
#str_replace("String", "old", "new")











# Example: Let's download an article from the internet

#base_name <- "https://www.foxbusiness.com/economy/millions-americans-have-received-pay-cut-since-biden-took-office"

base_name <- "https://www.nytimes.com/2023/11/10/us/politics/biden-xi-meeting.html?searchResultPosition=7"

article_name <- "politics.html"
download.file(base_name, article_name)

# viewer <- getOption("viewer")
# viewer(article_name)


# read the whole article as a single string
the_article <- readChar(article_name, file.info(article_name)$size)





# replace all occurrences of a string




# write the new article




#viewer("sleepy_article.html")

$\$

Part 4.3a: tidyr for pivoting data longer

Let's see if we can compare men and women salary on the same plot using ggplot by first pivoting our longer.

library(tidyr)


# get salaries for men and women
men_women <- IPED_salaries |>
  filter(rank_name == "Full") |>
  select(school, endowment, salary_men, salary_women) |>
  na.omit()


# how can plot men and women salaries on the same plot using ggplot? 

# let's pivot the data longer




# visualize as a boxplot





# visualize as a density plot

Does it appear that men and women are being paid differently?

$\$

Part 4.3b: tidyr for pivoting data wider

Let's pivot back wider to see if we can come up with more informative plots using ggplot.

# create the data longer again and mutate on salary difference




# visualize as a boxplot






# visualize as a density

Does it appear that men and women are being paid differently?

emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Logistic regression review

Part 1.1.: Fitting a logistic model to the data

Part 1.2.: Plotting the logistic regression function

Part 1.3: Checking the model fit

Part 2: Multiple logistic regression

Part 2.1: Fitting multiple logistic regression models

Part 2.2: Visualzing the multiple logistic regression function

Part 3: Poission regression

Part 3.1: Loading the data

Part 3.2: Visualizing the data

Part 3.3: Building a GLM

Part 3.4: Bootstrap confidence intervals for regression parameters

Part 4: Data cleaning and wrangling with the tidyverse

Part 4.1: lubridate for processing dates

Part 4.2: stringr for manipulating strings

Part 4.3a: tidyr for pivoting data longer

Part 4.3b: tidyr for pivoting data wider

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Logistic regression review

Part 1.1.: Fitting a logistic model to the data

Part 1.2.: Plotting the logistic regression function

Part 1.3: Checking the model fit

Part 2: Multiple logistic regression

Part 2.1: Fitting multiple logistic regression models

Part 2.2: Visualzing the multiple logistic regression function

Part 3: Poission regression

Part 3.1: Loading the data

Part 3.2: Visualizing the data

Part 3.3: Building a GLM

Part 3.4: Bootstrap confidence intervals for regression parameters

Part 4: Data cleaning and wrangling with the tidyverse

Part 4.1: lubridate for processing dates

Part 4.2: stringr for manipulating strings

Part 4.3a: tidyr for pivoting data longer

Part 4.3b: tidyr for pivoting data wider

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis