In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# makes sure you have all the packages we have used in class installed
#SDS230::update_installed_packages()


SDS230::download_data("IPED_salaries_2016.rda")

# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

Assessing unusual points
Analysis of variance for regression
Multiple linear regression

$\$

Part 1: Assessing unusual points

Let's continue to use the faculty salary data to explore how unusual points can affect our regression model. As we discussed, points can be usual by being:

High leverage: unusual explanatory variable values (i.e. points with unusual x-values)
Outliers: unusual response variable values (i.e., points with unusual y-values)
Influential points: High leverage and outlier points (i.e., points with unusual x and y values)

There are different statistics, and consequently R functions, to examine if data points are unusual along these dimensions.

$\$

Part 1.0 Recreating the assistant faculty salary data

Let's start by recreating the Assistant professor salary data from the larger IPED data frame. We can then re-fit a linear regression model to the data.

# load the data into R
load("IPED_salaries_2016.rda")

# get the assistant professor data
assistant_data <- filter(IPED_salaries,  endowment > 0,  rank_name == "Assistant") |>
  select(school, salary_tot, endowment, enroll_total) |>
  na.omit() |>
  mutate(log_endowment = log10(endowment))


# recreate our linear regression model from class 16
lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data)

$\$

Part 1.1: Let's explore high leverage points using the hatvalues() function which takes a linear model as the input argument.

Are any points greater than 4/n (high leverage), or 6/n (very high leverage)?

# plot a histogram of the hat values



# mutate the hat-values onto the original data and plot them in color

$\$

Part 1.2: Let's explore the standardized and studentized residuals using the rstandard() and rstudent() functions. Are any of the standardized or studentized residuals greater than 3?

# plot a histogram of the standarized residuals



# plot a histogram of the studentized residuals




# mutate on the absolute value of studentized residuals and plot them in color









# we can explore this in plotly too!

$\$

Part 1.3: Let's also examine Cook's distance to see which points are influential.

# use the base R regression diagostic plots to show Cooke's distance



# let's examine the points with the high Cook's distance




# mutate on the Cook's distance and plot in color

$\$

Part 1.3b: Cooks distance can also be calculated by assessing how much the predicted values $\hat{y}$ change when the $i^th$ data point is removed from the model. If the predictions change a lot when the $i^th$ is removed, then the $i^th$ has a large influence on the model and hence a large Cook's distance.

To illustrate this, we can create Cook's distance by fitting the regression model with the ith point removed, and comparing the difference in the predicted values from fitting the model with all the points; i.e.,

$$D_i = \frac{\sum_{j = 1}^{n-1} (\hat{y}j - \hat{y}{j(i)})^2}{(k+1) \cdot \hat{\sigma}_e^2}$$

# from the full model of all the data...
n_points <- nrow(assistant_data)
all_residuals <- lm_fit$residuals

p <- 2    # p = 1 + k, where k is the number of predictors

residual_standard_devation <- summary(lm_fit)$sigma      # sqrt(sum(all_residuals^2)/(n_points - p))


reconstructed_cooks <- c()
for (i in 1:n_points) {


  curr_data <- slice(assistant_data, -i)
  curr_fit <- lm(salary_tot ~ log_endowment, data = curr_data)


  # remove the ith residual
  other_residuals <- all_residuals[-i]

  reconstructed_cooks[i] <- sum((curr_fit$residuals - other_residuals)^2)/(residual_standard_devation^2 * p)

}


plot(cooks.distance(lm_fit), reconstructed_cooks,
     xlab = "Using the cooks.distance() function",
     ylab = "Reconsturcted by leaving out ith point")

$\$

Part 2: Analysis of variance (ANOVA) for regression

The code below will create an ANOVA table for regression for a model predicting salary as a function of the log of a school's endowment. We will use data from assistant professors from the IPED data.

anova(lm_fit)

$\$

Part 3: Multiple linear regression on the faculty salary data

Let's now use the faculty salary data to explore multiple linear regression for building a model to predict faculty salary from the endowment size of a school and the number of students enrolled.

Part 3.1: Exploring the enrollment data

In our previous analyses we have built models predicting faculty salaries based on the size of the school's endowment. Before we start on multiple regression, let's look at have faculty salaries are affected by the another variable, namely, number of students enrolled at a school. Before starting an analysis it is often worth thinking about our expectations. Here we might expect that schools that have higher enrollment numbers might be able to pay their faculty more since they have a higher revenue stream from the larger number of students paying tuition. Thus if we model $\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{enrollment}$ we might expect $\hat{\beta}_1$ to be positive.

Let's start our exploratory analysis by plotting the relationship between faculty salary and the number of students enrolled. If the relationship does not appear linear, we can transform the variables (as is often done in the "choose" step of model building).

# plot the relationship between salary and enrollment

$\$

From looking at this plot we see that the there is a large range of enrollment values with a few large numbers. Thus it might be better to log transform the x-values (often when values can only be positive, taking a log transformation leads to more linear relationship). Let's mutate on a variable log_enroll to our assistant data frame and then plot the relationship between salary and log_enroll. We will use log10 here to be consistent with our transformation of endowment and also since it is easier for us to think in terms of "order of magnitude" for this problem.

# the relationship does not appear linear, let's mutate on log enrollment



# plot the relationship between salary and log enrollment

Question: Does the relationship appear more linear now?

$\$

Part 3.2: Fittting a simple linear regression model for predicting salary as a function of log enrollment

Let's now fit a simple linear regression model $\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(enrollment)}$. Let's also plot the model and look at some inferential statistics by using the summary() function on our model.

# fit a linear regression model of salary as a function of log enrollment and plot it



# look at some inferential statistics using the summary() function

$\$

Part 3.3: Comparing the simple linear regression enrollment and endowment models

Let's compare these models predicting salary from enrollment vs. endowment in terms of which model can explain most of the variability in salaries in terms of the $r^2$ statistic. Let's also create a scatter plot of the relationships between these three variables using the paris() function.

# compare r^2 and look at all scatter plots

$\$

Part 3.4: Multiple regression

Let's now fit a multiple regression model for predicting salary using both endowment and enrollment as explanatory variables to see if using both these variables allows us to better predict salary than either variable along. In particular, we are fitting the model $\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_1 \cdot \text{log(enrollment)}$.

Question: Does this model account for more of the variability than the simple regression models we fit?

$\$

Part 3.5: Test for comparing nested models

When we have nested models, we can use an ANOVA test based on the F-statistic to assess if adding additional explanatory variables leads to a model that can account for more of the variability in a response variable.

A Model 1 is nested in a Model 2 if the parameters in Model 1 are a subset of the parameters in Model 2. Here, our model using only the endowment as the explanatory variable is nested within in the model that uses endowment and enrollment as explanatory variables. Let's uses the anova() function to test if adding the enrollment explanatory leads to a statistically significant increase in the amount of variability that can be accounted for.

$\$

emeyers/SDS230 documentation built on Feb. 6, 2025, 4:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com