$\$

# makes sure you have all the packages we have used in class installed
#SDS230::update_installed_packages()


SDS230::download_data("IPED_salaries_2016.rda")
# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

$\$

Part 0: Calculating confidence intervals for regression

Since we did not get through all the code calculating confidence intervals for regression last class, let's go through it now.

Part 0.0 Recreating the assistant faculty salary data

Let's start by recreating the Assistant professor salary data from the larger IPED data frame. We can then re-fit a linear regression model to the data.

# load the data into R
load("IPED_salaries_2016.rda")

# get the assistant professor data
assistant_data <- filter(IPED_salaries,  endowment > 0,  rank_name == "Assistant") |>
  select(school, salary_tot, endowment, enroll_total) |>
  na.omit() |>
  mutate(log_endowment = log10(endowment)) |>
  arrange(log_endowment)


# recreate our linear regression model from class 16
lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data)

Part 0.1: t-distribution based confidence intervals on the regression intercept and slope

We can create confidence intervals for the regression coefficients $\beta_0$ and $\beta_1$ using a t-distribution via the confint() function.

confint(lm_fit, level = .95)

Interpretation?

The true intercept $\beta_0$ and slope $\beta_1$ will be captured by intervals like these most of the time.

$\$

Part 0.2: Plotting confidence intervals and prediction intervals for the regression line

We can also use parametric methods to plot confidence and prediction intervals for the regression line.

# confidence interval for the regression line mu_y
CI_regression_line <- predict(lm_fit, interval="confidence", level = 0.95)


# prediction interval for the regression line
prediction_interval <- predict(lm_fit, interval="predict")


# # plot both confidence interval and the prediction interval
# let's start by plotting the data
plot(assistant_data$log_endowment, 
     assistant_data$salary_tot,
     xlab = "Log endowment size (log($))", 
     ylab = "Salary ($)", 
     main = "Relationship between log endowment and salary")


# plot confidence interval
points(assistant_data$log_endowment, CI_regression_line[, 1], col = "red", type = "l")
points(assistant_data$log_endowment, CI_regression_line[, 2], col = "green", type = "l")
points(assistant_data$log_endowment, CI_regression_line[, 3], col = "green", type = "l")


# plot prediction interval
points(assistant_data$log_endowment, prediction_interval[, 2], col = "blue", type = "l")
points(assistant_data$log_endowment, prediction_interval[, 3], col = "blue", type = "l")

$\$

Part 1: Assessing unusual points

Let's continue to use the faculty salary data to explore how unusual points can affect our regression model. As we discussed, points can be usual by being:

  1. High leverage: unusual explanatory variable values (i.e. points with unusual x-values)
  2. Outliers: unusual response variable values (i.e., points with unusual y-values)
  3. Influential points: High leverage and outlier points (i.e., points with unusual x and y values)

There are different statistics, and consequently R functions, to examine if data points are unusual along these dimensions.

$\$

Part 1.1: Let's explore high leverage points using the hatvalues() function which takes a linear model as the input argument.

Are any points greater than 4/n (high leverage), or 6/n (very high leverage)?

# plot a histogram of the hat values
hist(hatvalues(lm_fit), breaks = 100)
abline(v = 6/nrow(assistant_data), col = "red")



# mutate the hat-values onto the original data and plot them in color
assistant_data |>
  mutate(hatvalues = hatvalues(lm_fit)) |>
  ggplot(aes(endowment, salary_tot)) + 
  geom_point(aes(name = school, color = hatvalues)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_log10() + 
  scale_y_log10() +
  scale_color_continuous(low = "black", high = "red")

$\$

Part 1.2: Let's explore the standardized and studentized residuals using the rstandard() and rstudent() functions. Are any of the standardized or studentized residuals greater than 3?

# plot a histogram of the standarized residuals
hist(rstandard(lm_fit), breaks = 100)
abline(v = c(-3, 3), col = "red")


# plot a histogram of the studentized residuals
hist(rstudent(lm_fit), breaks = 100)
abline(v = c(-3, 3), col = "red")



# mutate on the absolute value of studentized residuals and plot them in color
assistant_data |>
  mutate(studentized = abs(rstudent(lm_fit))) |>
  ggplot(aes(endowment, salary_tot)) + 
  geom_point(aes(name = school, color = studentized )) + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_log10() + 
  scale_y_log10() +
  scale_color_continuous(low = "black", high = "red")


# we can explore this in plotly too!

$\$

Part 1.3: Let's also examine Cook's distance to see which points are influential.

# use the base R regression diagostic plots to show Cooke's distance
plot(lm_fit, 4)


# let's examine the points with the high Cook's distance
assistant_data |>
  slice(c(202, 668, 779))


# mutate on the Cook's distance and plot in color
assistant_data |>
  mutate(cooks = cooks.distance(lm_fit)) |>
  ggplot(aes(endowment, salary_tot)) + 
  geom_point(aes(name = school, color = cooks)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_log10() + 
  scale_y_log10() +
  scale_color_continuous(low = "black", high = "red")

$\$

Part 2: Analysis of variance (ANOVA) for regression

The code below will create an ANOVA table for regression for a model predicting salary as a function of the log of a school's endowment. We will use data from assistant professors from the IPED data.

anova(lm_fit)

$\$

Part 2: Multiple linear regression on the faculty salary data

Let's now use the faculty salary data to explore multiple linear regression for building a model to predict faculty salary from the endowment size of a school and the number of students enrolled.

Part 2.1: Exploring the enrollment data

In our previous analyses we have built models predicting faculty salaries based on the size of the school's endowment. Before we start on multiple regression, let's look at have faculty salaries are affected by the another variable, namely, number of students enrolled at a school. Before starting an analysis it is often worth thinking about our expectations. Here we might expect that schools that have higher enrollment numbers might be able to pay their faculty more since they have a higher revenue stream from the larger number of students paying tuition. Thus if we model $\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{enrollment}$ we might expect $\hat{\beta}_1$ to be positive.

Let's start our exploratory analysis by plotting the relationship between faculty salary and the number of students enrolled. If the relationship does not appear linear, we can transform the variables (as is often done in the "choose" step of model building).

# plot the relationship between salary and enrollment
plot(assistant_data$enroll_total, assistant_data$salary_tot)

$\$

From looking at this plot we see that the there is a large range of enrollment values with a few large numbers. Thus it might be better to log transform the x-values (often when values can only be positive, taking a log transformation leads to more linear relationship). Let's mutate on a variable log_enroll to our assistant data frame and then plot the relationship between salary and log_enroll. We will use log10 here to be consistent with our transformation of endowment and also since it is easier for us to think in terms of "order of magnitude" for this problem.

# the relationship does not appear linear, let's mutate on log enrollment
assistant_data <- assistant_data |>
  mutate(log_enroll = log10(enroll_total)) 


# plot the relationship between salary and log enrollment
plot(assistant_data$log_enroll, assistant_data$salary_tot)

Question: Does the relationship appear more linear now?

$\$

Part 2.2: Fittting a simple linear regression model for predicting salary as a function of log enrollment

Let's now fit a simple linear regression model $\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(enrollment)}$. Let's also plot the model and look at some inferential statistics by using the summary() function on our model.

# fit a linear regression model of salary as a function of log enrollment and plot it
plot(assistant_data$log_enroll, assistant_data$salary_tot)
lm_fit_enroll <- lm(salary_tot ~ log_enroll, data = assistant_data)
abline(lm_fit_enroll, col = "red")


# look at some inferential statistics using the summary() function
(summary_enroll <- summary(lm_fit_enroll))

$\$

Part 2.3: Comparing the simple linear regression enrollment and endowment models

Let's compare these models predicting salary from enrollment vs. endowment in terms of which model can explain most of the variability in salaries in terms of the $r^2$ statistic. Let's also create a scatter plot of the relationships between these three variables using the paris() function.

# compare r^2 and look at all scatter plots
lm_fit_endow <- lm(salary_tot ~ log_endowment, data = assistant_data)
summary_endow <- summary(lm_fit_endow)
#plot(assistant_data$log_endowment, assistant_data$salary_tot)
#abline(lm_fit_endow, col = "red")


paste("Endowment model r^2", round(summary_endow$r.squared, 4))
paste("Enrollment model r^2", round(summary_enroll$r.squared, 4))

pairs(select(assistant_data, salary_tot, log_endowment, log_enroll))

$\$

Part 2.4: Multiple regression

Let's now fit a multiple regression model for predicting salary using both endowment and enrollment as explanatory variables to see if using both these variables allows us to better predict salary than either variable along. In particular, we are fitting the model $\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_1 \cdot \text{log(enrollment)}$.

(lm_fit_endow_enroll <- lm(salary_tot ~ log_endowment + log_enroll, data = assistant_data))

summary(lm_fit_endow_enroll)

Question: Does this model account for more of the variability than the simple regression models we fit?

$\$

Part 2.5: Test for comparing nested models

When we have nested models, we can use an ANOVA test based on the F-statistic to assess if adding additional explanatory variables leads to a model that can account for more of the variability in a response variable.

A Model 1 is nested in a Model 2 if the parameters in Model 1 are a subset of the parameters in Model 2. Here, our model using only the endowment as the explanatory variable is nested within in the model that uses endowment and enrollment as explanatory variables. Let's uses the anova() function to test if adding the enrollment explanatory leads to a statistically significant increase in the amount of variability that can be accounted for.

anova(lm_fit_endow, lm_fit_endow_enroll)

$\$



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.