$\$

# install.packages('car')

SDS230::download_data("IPED_salaries_2016.rda")


SDS230::download_image("tukey_ladder.png")
library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)


#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

$\$

Part 1: Exploring the faculty salary data

To explore linear regression models, we will examine data from the Integrated Postsecondary Education Data System (IPEDS) which has information about colleges and universities. I have created a subset of this data from 2016 that has information about college salaries, endowment sizes and other variables. Full datasets from several years are available in Microsoft Access format and can be downloaded here.

Below is code that loads the data. You can use the View() function from the console to look at the data. The glimpse() function in dplyr is also useful for getting a sense of what is in a data frame.

# load the data into R
load("IPED_salaries_2016.rda")

$\$

Part 1.1: Filtering and selecting to get only Assistant Professor data

To look at a more homogeneous data set, let's just look at the data for Assistant Professors. To get this subset of the data we can use dplyr's filter() function. Let's filter that data as follows:

To make things simpler, let's have our data frame only have the following variables:

assistant_data <- IPED_salaries |>
  filter(endowment > 0,  rank_name == "Assistant") |>
  select(school, salary_tot, endowment, enroll_total) |>
  na.omit()     # get only the cases that do not have missing data

$\$

Part 1.2a: Visualizing relationships

What could be driving the differences in faculty salaries across universities? Let's use linear models to see if we can find variables that predict differences in the salaries professors make at different schools.

One potential factor that could drive differences in faculty salaries is the size of a schools endowment. Let's begin by visualizing the relationship between endowment size and faculty salaries by using a scatter plot.

plot(assistant_data$endowment, assistant_data$salary_tot)

$\$

Part 1.2b: Visualizing relationships

There seem to be some schools with very large values on the x-axis (i.e., points with high leverage). Let's explore what these points are.

# we can see a list of schools with the smallest and largest endowments 
# using dplyr's arrange() function
head(arrange(assistant_data, endowment))
head(arrange(assistant_data, desc(endowment)))



# We can use ggplot and plotly to explore the data more
p <- assistant_data |>
  ggplot(aes(endowment, salary_tot)) + 
  geom_point(alpha = .5, aes(name = school)) # + 
  #geom_smooth(method = "lm", se = FALSE) # + 
  #scale_x_log10() + 
  #scale_y_log10() 

#ggplotly(p)

$\$

Part 1.3: Transforming the varaibles

As we can see from the above plot, the relationship between the explanatory and response variables does not appear to be very linear (this non-linear relationship is largely driven by schools with very large endowments). One way to deal with this non-linear relationship is to apply a transformation to our data to make the relationship more linear. If this transformation makes the relationship linear, we can go ahead fit a linear model to our data. A visualization based on Tukey's bulging rule can help up decide on which transformation to use to make the relationship linear.

From looking at the plot of our data in part 1.2, we can see it matches the shape in the upper left quadrant of the visualization above. Thus this suggests if we apply a log transformation to x, the relationship might be more linear. Let's try this by taking the $log_{10}$ of the variable endowment which will tell us how faculty salaries change with the order of magnitude of a schools endowment. To create this transformed variable we can use the dplyr mutate() function.

# use the mutate() function to create a variable that has the log10 of the endowment
assistant_data  <- assistant_data  |>
  mutate(log_endowment = log10(endowment)) |>
  arrange(log_endowment)


# plot the relationship between salaries and the log10 of a schools endowment
plot(assistant_data$log_endowment, assistant_data$salary_tot)

The relationship now appears to curve upward a bit which matches the pattern seen in the lower left quadrant of Tukey's bulging rule visualization. We could deal with this by also applying a log transformation of y but we will abstain from doing this for now.

$\$

Part 1.4: Fitting a linear model

Let's now fit a linear model to our data to predict faculty salaries from the log10 of the endowment.

# fit a regression model
lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data)


# plot the data and add the regression line to the plot
plot(assistant_data$log_endowment, assistant_data$salary_tot)
abline(lm_fit, col = "red")


# look at the coefficients (beta hat's)
coef(lm_fit)

Question: How do we interpret these coefficients?

Answer: For every order of magnitude change in endowment, the order of magnitude of faculty salaries are predicted to go up by r coef(lm_fit)[2].

$\$

Part 2: The standard deviation of residuals

Our theoretical model of the data is: $Y = \beta_0 + \beta_1 x + \epsilon$. In this model our errors $\epsilon$ come from a normal distribution with a mean of 0 and a standard deviation that is denoted $\sigma_{\epsilon}$; i.e., $\epsilon \sim N(0, \sigma_{\epsilon})$. The standard deviation of residual $\hat{\sigma}{\epsilon}$ is an estimate of the standard deviation of errors $\sigma{\epsilon}$. This gives us a sense of how far the points are spread from the regression line.

Let's calculate it in R.

# calculate the residual sum of squares
SSRes <- sum(lm_fit$residuals^2)


# to get an unbiased estimate of the standard deviation of errors
# our denominator is given by the regression degrees of freedom
regression_df <- nrow(assistant_data) - 2


# estimate of the standard deviation of errors
(SDE_hat <- sqrt(SSRes/regression_df))

$\$

Part 3: Hypothesis tests for regression coefficients using parametric methods

Let's use the t-statistic and t-distribution to do parametric inference on whether the regression offset $\beta_0$, and the regression slope $\beta_1$, are different than 0. This means we are testing the hypotheses:

$H_0: \beta_0 = 0$ $H_A: \beta_0 \ne 0$

and

$H_0: \beta_1 = 0$ $H_A: \beta_1 \ne 0$

Part 3.1: Assessing the statistical significance of the fit

We can calculate the $\hat{SE}_{\hat{\beta}_1}$ and a t-statistic using these equations:

$\hat{SE}{\hat{\beta}_1} = \frac{\hat{\sigma}\epsilon}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}}$

$t = \frac{\hat{\beta}1}{\hat{SE}{\hat{\beta}_1}}$

Our t-statistic will then come from a t-distribution with n-2 degrees of freedom (if the assumptions roughly met).

We can use R's built in summary() function to see if the regression coefficients are statistically significant.

# Assessing if the regression coefficients are statistically significant
summary(lm_fit)

$\$

Part 4: Calculating confidence intervals for regression coefficients

Part 4.1: t-distribution based confidence intervals on the regression intercept and slope

We can create confidence intervals for the regression coefficients $\beta_0$ and $\beta_1$ using a t-distribution via the confint() function.

confint(lm_fit, level = .95)

Interpretation?

The true intercept $\beta_0$ and slope $\beta_1$ will be captured by intervals like these most of the time.

$\$

Part 4.2: Using the bootstrap to plot confidence intervals for the regression line

We can also use resampling methods (i.e., the bootstrap) to get confidence intervals.

Rather than create confidence intervals here, let's instead use the bootstrap to plot several bootstrap regression lines which will give us a visualization of a range of possible regression lines that could reflect the real underlying relationship.

To understand how variable our estimate of the regression line is, we can sample rows of a data frame using dplyr's sample_n() function. We will use this function (setting the replace = TRUE argument) and fit regression lines for 100 different bootstrap samples. By plotting all these regression lines, we can get a sense of the variability of our estimate of the regression line, and hence an understanding of the precision of approximately where the true regression line lies.

# let's start by plotting the data
plot(assistant_data$log_endowment, 
     assistant_data$salary_tot,
     xlab = "Log endowment size (log($))", 
     ylab = "Salary ($)", 
     main = "Relationship between log endowment and salary")


# get the number of cases
n_cases <- dim(assistant_data)[1]

# use a loop to plot 100 regression lines that are consistent 
for (i in 1:100) {

  # get a bootstrap sample the same size as the original sample
  boot_sample <- sample_n(assistant_data, size = n_cases, replace = TRUE)

  # fit the regression model on the bootstrap sample and plot it
  boot_fit <- lm(salary_tot ~ log_endowment, data = boot_sample)

  abline(boot_fit, col = "red")

}

$\$

Part 4.3: Plotting confidence intervals and prediction intervals for the regression line

We can also use parametric methods to plot confidence and prediction intervals for the regression line.

# confidence intervals for the betas
(CI_betas <- confint(lm_fit))


# confidence interval for the regression line mu_y
CI_regression_line <- predict(lm_fit, interval="confidence", level = 0.95)


# prediction interval for the regression line
prediction_interval <- predict(lm_fit, interval="predict")


# # plot both confidence interval and the prediction interval
# let's start by plotting the data
plot(assistant_data$log_endowment, 
     assistant_data$salary_tot,
     xlab = "Log endowment size (log($))", 
     ylab = "Salary ($)", 
     main = "Relationship between log endowment and salary")


# plot confidence interval
points(assistant_data$log_endowment, CI_regression_line[, 1], col = "red", type = "l")
points(assistant_data$log_endowment, CI_regression_line[, 2], col = "green", type = "l")
points(assistant_data$log_endowment, CI_regression_line[, 3], col = "green", type = "l")


# plot prediction interval
points(assistant_data$log_endowment, prediction_interval[, 2], col = "blue", type = "l")
points(assistant_data$log_endowment, prediction_interval[, 3], col = "blue", type = "l")

$\$



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.