In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# install.packages('car')

SDS230::download_data("IPED_salaries_2016.rda")


SDS230::download_image("tukey_ladder.png")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)


#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

Exploring faculty salary data
Hypothesis tests for regression coefficients
Confidence intervals for regression coefficients

$\$

Part 1: Review - fitting simple linear regression model to faculty salary data

To explore linear regression models, we will examine data from the Integrated Postsecondary Education Data System (IPEDS) which has information about colleges and universities. I have created a subset of this data from 2016 that has information about college salaries, endowment sizes and other variables. Full datasets from several years are available in Microsoft Access format and can be downloaded here.

Below is code that loads the data. You can use the View() function from the console to look at the data. The glimpse() function in dplyr is also useful for getting a sense of what is in a data frame.

# load the data into R
load("IPED_salaries_2016.rda")

$\$

Part 1.1: Filtering and selecting to get only Assistant Professor data

To look at a more homogeneous data set, let's just look at the data for Assistant Professors. To get this subset of the data we can use dplyr's filter() function. Let's filter that data as follows:

Get only rank_name of Assistant.
Get colleges that have endowments greater than $0
Get only data where there is no missing information using the na.omit() function.

To make things simpler, let's have our data frame only have the following variables:

school: The name of the school
salary_tot: The total salary combined for over both men and women.
endowment: The size of the school's endowment
rank_name: The faculty rank level

assistant_data <- filter(IPED_salaries,  endowment > 0,  rank_name == "Assistant") |>
  select(school, salary_tot, endowment, enroll_total) |>
  na.omit()     # get only the cases that do not have missing data

$\$

Part 1.2: Visualizing relationships

What could be driving the differences in faculty salaries across universities? Let's use linear models to see if we can find variables that predict differences in the salaries professors make at different schools.

One potential factor that could drive differences in faculty salaries is the size of a schools endowment. Let's begin by visualizing the relationship between endowment size and faculty salaries by using a scatter plot.

plot(assistant_data$endowment, assistant_data$salary_tot)

$\$

Part 1.3: Transforming the varaibles

As we can see from the above plot, the relationship between the explanatory and response variables does not appear to be very linear (this non-linear relationship is largely driven by schools with very large endowments). One way to deal with this non-linear relationship is to apply a transformation to our data to make the relationship more linear. If this transformation makes the relationship linear, we can go ahead fit a linear model to our data. A visualization based on Tukey's bulging rule can help up decide on which transformation to use to make the relationship linear.

From looking at the plot of our data in part 1.2, we can see it matches the shape in the upper left quadrant of the visualization above. Thus this suggests if we apply a log transformation to x, the relationship might be more linear. Let's try this by taking the $log_{10}$ of the variable endowment which will tell us how faculty salaries change with the order of magnitude of a schools endowment. To create this transformed variable we can use the dplyr mutate() function.

# use the mutate() function to create a variable that has the log10 of the endowment
assistant_data  <- assistant_data  |>
  mutate(log_endowment = log10(endowment)) |>
  arrange(log_endowment)


# plot the relationship between salaries and the log10 of a schools endowment
plot(assistant_data$log_endowment, assistant_data$salary_tot)

The relationship now appears to curve upward a bit which matches the pattern seen in the lower left quadrant of Tukey's bulging rule visualization. We could deal with this by also applying a log transformation of y but we will abstain from doing this for now.

$\$

Part 1.4: Fitting a linear model

Let's now fit a linear model to our data to predict faculty salaries from the log10 of the endowment.

# fit a regression model
lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data)


# plot the data and add the regression line to the plot
plot(assistant_data$log_endowment, assistant_data$salary_tot)
abline(lm_fit, col = "red")


# look at the coefficients (beta hat's)
coef(lm_fit)

Question: How do we interpret these coefficients?

Answer: For every order of magnitude change in endowment, the order of magnitude of faculty salaries are predicted to go up by r coef(lm_fit)[2].

$\$

Part 1.5: The regression standard error

Our theoretical model of the data is: $Y = \beta_0 + \beta_1 x + \epsilon$. In this model our errors $\epsilon$ come from a normal distribution with a mean of 0 and a standard deviation that is denoted $\sigma_{\epsilon}$; i.e., $\epsilon \sim N(0, \sigma_{\epsilon})$. The regression standard error $\hat{\sigma}{\epsilon}$ is an estimate of the standard deviation of errors $\sigma{\epsilon}$. This gives us a sense of how much the points are spread from the regression line.

Let's calculate it in R.

# calculate the residual sum of squares
SSRes <- sum(lm_fit$residuals^2)


# to get an unbiased estimate of the standard deviation of errors
# our denominator is given by the regression degrees of freedom
regression_df <- nrow(assistant_data) - 2


# estimate of the residual standard error
(RSE_hat <- sqrt(SSRes/regression_df))

$\$

Part 2: Hypothesis tests for regression coefficients using parametric methods

Let's use the t-statistic and t-distribution to do parametric inference on whether the regression offset $\beta_0$, and the regression slope $\beta_1$, are different than 0. This means we are testing the hypotheses:

$H_0: \beta_0 = 0$ $H_A: \beta_0 \ne 0$

and

$H_0: \beta_1 = 0$ $H_A: \beta_1 \ne 0$

Part 2.1: Assessing the statistical significance of the fit

We can calculate the $\hat{SE}_{\hat{\beta}_1}$ and a t-statistic using these equations:

$\hat{SE}{\hat{\beta}_1} = \frac{\hat{\sigma}\epsilon}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}}$

$t = \frac{\hat{\beta}1}{\hat{SE}{\hat{\beta}_1}}$

Our t-statistic will then come from a t-distribution with n-2 degrees of freedom (if the assumptions roughly met).

We can use R's built in summary() function to see if the regression coefficients are statistically significant.

# Assessing if the regression coefficients are statistically significant
summary(lm_fit)

$\$

Part 3: Calculating confidence intervals for regression coefficients

Part 3.1: t-distribution based confidence intervals on the regression intercept and slope

We can create confidence intervals for the regression coefficients $\beta_0$ and $\beta_1$ using a t-distribution via the confint() function.

Interpretation?

$\$

Part 3.2: Using the bootstrap to plot confidence intervals for the regression line

We can also use resampling methods (i.e., the bootstrap) to get confidence intervals.

Rather than create confidence intervals here, let's instead use the bootstrap to plot several bootstrap regression lines which will give us a visualization of a range of possible regression lines that could reflect the real underlying relationship.

To understand how variable our estimate of the regression line is, we can sample rows of a data frame using dplyr's sample_n() function. We will use this function (setting the replace = TRUE argument) and fit regression lines for 100 different bootstrap samples. By plotting all these regression lines, we can get a sense of the variability of our estimate of the regression line, and hence an understanding of the precision of approximately where the true regression line lies.

# let's start by plotting the data



# get the number of cases



# use a loop to plot 100 regression lines that are consistent

$\$

Part 3.3: Plotting confidence intervals and prediction intervals for the regression line

We can also use parametric methods to plot confidence and prediction intervals for the regression line.

# confidence intervals for the betas



# confidence interval for the regression line mu_y



# prediction interval for the regression line



# # plot both confidence interval and the prediction interval
# let's start by plotting the data




# plot confidence interval






# plot prediction interval

$\$

Part 4: Regression diagnostics

When making inferences about regression coefficients, there are a number of assumptions that need to be met to make these tests/confidence intervals valid. The (LINE) assumptions are:

Linearity: A line can describe the relationship between x and y.
Independence: Each data point is independent from the other points.
Normality: The errors (are normally distributed.
Equal variance (homoscedasticity): The variance of the errors is constant over the whole range of x values.

To check whether these assumptions seem to be met by creating a set of diagnostic plots.

$\$

Part 4.1: To check for linearity and homoscedasticity (conditions 1 and 4), we can create a plot of the residuals as a function of the fitted values.

Question Does it appear that the data is homoscedastic and linear?

$\$

We can examine this further some more of R's built in functions and well as other functions from the car package and getting the studentized residuals using the Mass package.

$\$

Part 4.2: To check whether the residuals are normally distributed (condition 3) we can use create a Q-Q plot. The car package has a nice function to create these plots called qqPlot() to create these plots that uses the residuals from model fit with the lm() function, i.e, the lm_fit object.

# install.packages('car')
#library(car)

# create a Q-Q plot of the residuals


# we can also create a histogram of the residuals

Question: Do the residuals seem normal?

$\$

Part 4.3: To check if the data points are independent (condition 2) requires knowledge of how the data was collected. Do you think the data is independent here? i.e., do you think that the salary of one school effects the salaries at another school?

$\$

Part 5: Assessing unusual points

Let's continue to use the faculty salary data to explore how unusual points can affect our regression model. As we discussed, points can be usual by being:

High leverage: unusual explanatory variable values (i.e. points with unusual x-values)
Outliers: unusual response variable values (i.e., points with unusual y-values)
Influential points: High leverage and outlier points (i.e., points with unusual x and y values)

There are different statistics, and consequently R functions, to examine if data points are unusual along these dimensions.

$\$

Part 5.1: Let's explore high leverage points using the hatvalues() function which takes a linear model as the input argument.

Are any points greater than 4/n (high leverage), or 6/n (very high leverage)?

# plot a histogram of the hat values



# mutate the hat-values onto the original data and plot them in color

$\$

Part 5.2: Let's explore the standardized and studentized residuals using the rstandard() and rstudent() functions. Are any of the standardized or studentized residuals greater than 3?

# plot a histogram of the standarized residuals



# plot a histogram of the studentized residuals




# mutate on the absolute value of studentized residuals and plot them in color









# we can explore this in plotly too!

$\$

Part 5.3: Let's also examine Cook's distance to see which points are influential.

# use the base R regression diagostic plots to show Cooke's distance



# let's examine the points with the high Cook's distance




# mutate on the Cook's distance and plot in color

$\$

emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Review - fitting simple linear regression model to faculty salary data

Part 1.1: Filtering and selecting to get only Assistant Professor data

Part 1.2: Visualizing relationships

Part 1.3: Transforming the varaibles

Part 1.4: Fitting a linear model

Part 1.5: The regression standard error

Part 2: Hypothesis tests for regression coefficients using parametric methods

Part 2.1: Assessing the statistical significance of the fit

Part 3: Calculating confidence intervals for regression coefficients

Part 3.1: t-distribution based confidence intervals on the regression intercept and slope

Part 3.2: Using the bootstrap to plot confidence intervals for the regression line

Part 3.3: Plotting confidence intervals and prediction intervals for the regression line

Part 4: Regression diagnostics

Part 5: Assessing unusual points

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Review - fitting simple linear regression model to faculty salary data

Part 1.1: Filtering and selecting to get only Assistant Professor data

Part 1.2: Visualizing relationships

Part 1.3: Transforming the varaibles

Part 1.4: Fitting a linear model

Part 1.5: The regression standard error

Part 2: Hypothesis tests for regression coefficients using parametric methods

Part 2.1: Assessing the statistical significance of the fit

Part 3: Calculating confidence intervals for regression coefficients

Part 3.1: t-distribution based confidence intervals on the regression intercept and slope

Part 3.2: Using the bootstrap to plot confidence intervals for the regression line

Part 3.3: Plotting confidence intervals and prediction intervals for the regression line

Part 4: Regression diagnostics

Part 5: Assessing unusual points

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis