$\$
# install.packages('car') SDS230::download_data("IPED_salaries_2016.rda") SDS230::download_image("tukey_ladder.png")
library(latex2exp) library(dplyr) library(ggplot2) library(plotly) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
To explore linear regression models, we will examine data from the Integrated Postsecondary Education Data System (IPEDS) which has information about colleges and universities. I have created a subset of this data from 2016 that has information about college salaries, endowment sizes and other variables. Full datasets from several years are available in Microsoft Access format and can be downloaded here.
Below is code that loads the data. You can use the View()
function from the
console to look at the data. The glimpse()
function in dplyr is also useful
for getting a sense of what is in a data frame.
# load the data into R load("IPED_salaries_2016.rda")
$\$
To look at a more homogeneous data set, let's just look at the data for
Assistant Professors. To get this subset of the data we can use dplyr's
filter()
function. Let's filter that data as follows:
rank_name
of Assistant. na.omit()
function.To make things simpler, let's have our data frame only have the following variables:
school
: The name of the schoolsalary_tot
: The total salary combined for over both men and women.endowment
: The size of the school's endowmentrank_name
: The faculty rank levelassistant_data <- IPED_salaries |> filter(endowment > 0, rank_name == "Assistant") |> select(school, salary_tot, endowment, enroll_total) |> na.omit() # get only the cases that do not have missing data
$\$
What could be driving the differences in faculty salaries across universities? Let's use linear models to see if we can find variables that predict differences in the salaries professors make at different schools.
One potential factor that could drive differences in faculty salaries is the size of a schools endowment. Let's begin by visualizing the relationship between endowment size and faculty salaries by using a scatter plot.
plot(assistant_data$endowment, assistant_data$salary_tot)
$\$
There seem to be some schools with very large values on the x-axis (i.e., points with high leverage). Let's explore what these points are.
# we can see a list of schools with the smallest and largest endowments # using dplyr's arrange() function head(arrange(assistant_data, endowment)) head(arrange(assistant_data, desc(endowment))) # We can use ggplot and plotly to explore the data more p <- assistant_data |> ggplot(aes(endowment, salary_tot)) + geom_point(alpha = .5, aes(name = school)) # + #geom_smooth(method = "lm", se = FALSE) # + #scale_x_log10() + #scale_y_log10() #ggplotly(p)
$\$
As we can see from the above plot, the relationship between the explanatory and response variables does not appear to be very linear (this non-linear relationship is largely driven by schools with very large endowments). One way to deal with this non-linear relationship is to apply a transformation to our data to make the relationship more linear. If this transformation makes the relationship linear, we can go ahead fit a linear model to our data. A visualization based on Tukey's bulging rule can help up decide on which transformation to use to make the relationship linear.
From looking at the plot of our data in part 1.2, we can see it matches the
shape in the upper left quadrant of the visualization above. Thus this suggests
if we apply a log transformation to x, the relationship might be more linear.
Let's try this by taking the $log_{10}$ of the variable endowment which will
tell us how faculty salaries change with the order of magnitude of a schools
endowment. To create this transformed variable we can use the dplyr mutate()
function.
# use the mutate() function to create a variable that has the log10 of the endowment assistant_data <- assistant_data |> mutate(log_endowment = log10(endowment)) |> arrange(log_endowment) # plot the relationship between salaries and the log10 of a schools endowment plot(assistant_data$log_endowment, assistant_data$salary_tot)
The relationship now appears to curve upward a bit which matches the pattern seen in the lower left quadrant of Tukey's bulging rule visualization. We could deal with this by also applying a log transformation of y but we will abstain from doing this for now.
$\$
Let's now fit a linear model to our data to predict faculty salaries from the log10
of the endowment.
# fit a regression model lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data) # plot the data and add the regression line to the plot plot(assistant_data$log_endowment, assistant_data$salary_tot) abline(lm_fit, col = "red") # look at the coefficients (beta hat's) coef(lm_fit)
Question: How do we interpret these coefficients?
Answer: For every order of magnitude change in endowment, the order of
magnitude of faculty salaries are predicted to go up by r coef(lm_fit)[2]
.
$\$
Our theoretical model of the data is: $Y = \beta_0 + \beta_1 x + \epsilon$. In this model our errors $\epsilon$ come from a normal distribution with a mean of 0 and a standard deviation that is denoted $\sigma_{\epsilon}$; i.e., $\epsilon \sim N(0, \sigma_{\epsilon})$. The standard deviation of residual $\hat{\sigma}{\epsilon}$ is an estimate of the standard deviation of errors $\sigma{\epsilon}$. This gives us a sense of how far the points are spread from the regression line.
Let's calculate it in R.
# calculate the residual sum of squares SSRes <- sum(lm_fit$residuals^2) # to get an unbiased estimate of the standard deviation of errors # our denominator is given by the regression degrees of freedom regression_df <- nrow(assistant_data) - 2 # estimate of the standard deviation of errors (SDE_hat <- sqrt(SSRes/regression_df))
$\$
Let's use the t-statistic and t-distribution to do parametric inference on whether the regression offset $\beta_0$, and the regression slope $\beta_1$, are different than 0. This means we are testing the hypotheses:
$H_0: \beta_0 = 0$ $H_A: \beta_0 \ne 0$
and
$H_0: \beta_1 = 0$ $H_A: \beta_1 \ne 0$
We can calculate the $\hat{SE}_{\hat{\beta}_1}$ and a t-statistic using these equations:
$\hat{SE}{\hat{\beta}_1} = \frac{\hat{\sigma}\epsilon}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}}$
$t = \frac{\hat{\beta}1}{\hat{SE}{\hat{\beta}_1}}$
Our t-statistic will then come from a t-distribution with n-2 degrees of freedom (if the assumptions roughly met).
We can use R's built in summary()
function to see if the regression coefficients are statistically significant.
# Assessing if the regression coefficients are statistically significant summary(lm_fit)
$\$
We can create confidence intervals for the regression coefficients $\beta_0$ and $\beta_1$ using a t-distribution via the confint()
function.
confint(lm_fit, level = .95)
Interpretation?
The true intercept $\beta_0$ and slope $\beta_1$ will be captured by intervals like these most of the time.
$\$
We can also use resampling methods (i.e., the bootstrap) to get confidence intervals.
Rather than create confidence intervals here, let's instead use the bootstrap to plot several bootstrap regression lines which will give us a visualization of a range of possible regression lines that could reflect the real underlying relationship.
To understand how variable our estimate of the regression line is, we can sample rows of a data frame using dplyr's sample_n()
function. We will use this function (setting the replace = TRUE
argument) and fit regression lines for 100 different bootstrap samples. By plotting all these regression lines, we can get a sense of the variability of our estimate of the regression line, and hence an understanding of the precision of approximately where the true regression line lies.
# let's start by plotting the data plot(assistant_data$log_endowment, assistant_data$salary_tot, xlab = "Log endowment size (log($))", ylab = "Salary ($)", main = "Relationship between log endowment and salary") # get the number of cases n_cases <- dim(assistant_data)[1] # use a loop to plot 100 regression lines that are consistent for (i in 1:100) { # get a bootstrap sample the same size as the original sample boot_sample <- sample_n(assistant_data, size = n_cases, replace = TRUE) # fit the regression model on the bootstrap sample and plot it boot_fit <- lm(salary_tot ~ log_endowment, data = boot_sample) abline(boot_fit, col = "red") }
$\$
We can also use parametric methods to plot confidence and prediction intervals for the regression line.
# confidence intervals for the betas (CI_betas <- confint(lm_fit)) # confidence interval for the regression line mu_y CI_regression_line <- predict(lm_fit, interval="confidence", level = 0.95) # prediction interval for the regression line prediction_interval <- predict(lm_fit, interval="predict") # # plot both confidence interval and the prediction interval # let's start by plotting the data plot(assistant_data$log_endowment, assistant_data$salary_tot, xlab = "Log endowment size (log($))", ylab = "Salary ($)", main = "Relationship between log endowment and salary") # plot confidence interval points(assistant_data$log_endowment, CI_regression_line[, 1], col = "red", type = "l") points(assistant_data$log_endowment, CI_regression_line[, 2], col = "green", type = "l") points(assistant_data$log_endowment, CI_regression_line[, 3], col = "green", type = "l") # plot prediction interval points(assistant_data$log_endowment, prediction_interval[, 2], col = "blue", type = "l") points(assistant_data$log_endowment, prediction_interval[, 3], col = "blue", type = "l")
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.