In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# makes sure you have all the packages we have used in class installed
# SDS230::update_installed_packages()


SDS230::download_data("IPED_salaries_2016.rda")

# install.packages("latex2exp")

library(latex2exp)
library(dplyr)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

Review: multiple regression with categorical predictors
Interaction effects with quantitative and categorical predictors
Log transformation of the response variable y
Multi-colinearity
Polynomial regression

$\$

Part 1: Review - Multiple linear regression with categorical explanatory variables

Let's now examine how much faculty salaries increase as a function of log endowment size taking into account the rank that different professors have.

Part 1.1: Creating a new data set

Let's start by creating a data set called IPED_3 which is modified IPED_salaries in the following way:

Only include data from institutions with a CARNEGIE classification of 15 or 31 (these correspond to R1 institutions and liberal arts colleges).
Only use the faculty ranks of Assistant, Associate, and Full professors
It selects only a subset of variables that are relevant for today's analysis

load("IPED_salaries_2016.rda")


IPED_3 <- IPED_salaries |> 
  filter(endowment > 0) |>
  mutate(log_endowment = log10(endowment)) |>
  filter(CARNEGIE %in% c(15, 31)) |>
  filter(rank_name %in% c("Assistant", "Associate", "Full")) |>
  select("UNITID", "school", "CARNEGIE", "salary_tot", "endowment",  "log_endowment", "rank_name")


dim(IPED_3)

# View(IPED_3)

$\$

Part 1.2: Manually mutate on Assistant professor and associate professor indicator variables

Let's manually add columns with indicator variables that indicate if a faculty member is an assistant (or associate) professor.

$\$

Part 1.3: Fitting a linear model to the manually and automatically added indicator variables

Let's now fit a linear regression model for total salary as a function of log endowment size, but use a separate y-intercept for each of the 3 faculty ranks (and use the same slope for all ranks). We can then use the summary() function to extract information about the model.

# Fit model with manually added indicator vectors







# Fit model with automatically added indicator vectors

$\$

Part 1.4: Visualizing the model fit

Let's recreate the scatter plot we created in part 3.2 using the same colors. Now, however, let's also add on the regression lines with different y-intercepts that you fit in part 3.3 (using the appropriate colors to match the colors of the points).

# recreate the scatter plot of the data with color indicating faculty rank
plot(salary_tot ~ log_endowment, data = filter(IPED_3, rank_name == "Assistant"), ylim = c(0, 250000), 
     ylab = "Salary ($)", xlab = "Log endowment")

points(salary_tot ~ log_endowment, data = filter(IPED_3, rank_name == "Associate"), col = 'red')

points(salary_tot ~ log_endowment, data = filter(IPED_3, rank_name == "Full"), col = 'blue')



# extract the coefficients and plot separate regression lines for each faculty rank
the_coefs <- coef(lm_auto_offset)

abline(the_coefs[1], the_coefs[2], col = "blue")
abline(the_coefs[1] + the_coefs[3], the_coefs[2], col = "red")
abline(the_coefs[1] + the_coefs[4], the_coefs[2], col = "black")


# add a legend
legend("topleft", c("Full", "Associate", "Assistant"), text.col = c("blue", "red", "black"), bty = "n")

Question Does using the same slope but different offsets seem to be adequate for capturing the trends in the data?

$\$

Part 2: Fitting models with different intercepts and slopes

Now let's fit a linear regression model for total salary as a function of log endowment size, but use separate y-intercepts and slopes for each of the 3 faculty ranks.

$\$

Part 2.1: Manually mutate on Assistant professor and associate professor indicator variables

$\$

Part 2.2: Fitting a linear model to the manually and automatically added offsets and interactions

# Fit model with manually added indicator vectors






# Fit model with automatically added indicator vectors

$\$

Question: How much of the total sum of squares of faculty salary is this model capturing?

$\$

Part 2.3: Visualizing the model

Now let's again recreate the scatter plot you created in part 3.2 using the same colors and let's add on the regression line with different y-intercepts and slopes based on the model you fit in part 3.5 (again use the appropriate colors).

# recreate the scatter plot of the data with color indicating faculty rank
plot(salary_tot ~ log_endowment, data = filter(IPED_3, rank_name == "Assistant"), ylim = c(0, 250000))

points(salary_tot ~ log_endowment, data = filter(IPED_3, rank_name == "Associate"), col = 'red')

points(salary_tot ~ log_endowment, data = filter(IPED_3, rank_name == "Full"), col = 'blue')




# extract the coefficients and plot separate regression lines for each faculty rank

the_coefs <- coef(lm_auto_interaction)

abline(the_coefs[1], the_coefs[2], col = "blue")
abline(the_coefs[1] + the_coefs[3], the_coefs[2] + the_coefs[5], col = "red")
abline(the_coefs[1] + the_coefs[4], the_coefs[2] + the_coefs[6], col = "black")

Question Does this model seem like a better fit and do you think there are ways you could further improve on this model?

$\$

Part 2.4 Comparing models

The model you fit in with only offsets is nested within the model you fit with interactions. We can use the anova() function to compare such nested models. Does adding the additional slopes for each rank seem to improve the model fit?

$\$

Part 3: Log transformation of the response variable y

We have previously discussed log-transforming an explanatory variable x to make the relationship between the explanatory x and response variable y more linear. For example, we modeled faculty salary as a function of the log of a school's endowment.

In other circumstances, it can be useful to take a log transformation of the response variable y to make the relationship more linear. Let's explore this now by trying to predict a school's endowment size based on how much assistant professors are being paid (i.e., let's try swapping the explanatory and response variables from our previous endowment-salary analyses).

$\$

Part 3.1

The code below loads the IPED data and creates a version that only has assistant professor data. Let's plot schools' endowments as a function of assistant professor's salary, and also plot the log of the schools' endowments as a function of assistant professor's salary, which as expected, will be a more linear relationship.

load("IPED_salaries_2016.rda")


# get the assistant professor data
# we will keep all the variables and all schools in the data frame for this analysis
assistant_data <- filter(IPED_salaries,  endowment > 0,  rank_name == "Assistant")  



# plot the data




# plot the data after taking a log transformation of the response variable y, which is a school's endowment

$\$

Part 3.2

Let's now fit a linear regression model of log(endowment) as a function of assistant professors salary. Rather than doing the transformation beforehand (e.g., via a dplyr mutate() operation) let's do the transformation inside the the lm() function. Let's also transform the y variable using the natural log via the log() function rather than using the log base 10, log10() function.

To understand the model, recreate the scatter plot of the log of the schools' endowments as a function of assistant professor's salary, add the regression line to the plot, and use the summary() function to get some statistics about the linear regression model.

# recreate our linear regression model from class 16




# plot the data with the regression line






# get statistics about the model fit

$\$

Part 3.3

As we see above, the regression coefficient on the assistant professor's salary is useful for predicting the log of a school's endowment. This can be seen by the fact that the regression coefficient for salary_tot is statistically significantly different from 0 (i.e., there is a small p-value). This means that $\beta_1$ is not zero, or stated another way, faculty salaries are predictive of a school's endowment. However, how do we interpret what the magnitude of the regression coefficient, $\hat{\beta}_1$, means?

To understand what the value of the regression coefficient is telling us, we can transform the equation to be expressed in terms of y rather than log(y) via exponentiation applied both sides of the regression equation:

Predicted value:

$log(\hat{y}) = \hat{\beta}_0 + \hat{\beta}_1 \cdot x$

Exponentiation applied to both sides we get:

$\hat{y} = e^{\hat{\beta}_0 + \hat{\beta}_1 \cdot x} = e^{\hat{\beta}_0} \cdot e^{\hat{\beta}_1 \cdot x}$

This equation shows us that:

If x = 0, then $\hat{y} = e^{\hat{\beta}_0}$
For every unit increase in x, the predicted predicted value increases by a factor of $e^{\hat{\beta}_1}$. If we write $\hat{y} = \hat{f}(x)$ can state this another way as $\frac{\hat{f}(x + 1)}{\hat{f}(x)} = e^{\hat{\beta}_1}$

For example, if $e^{\hat{\beta}_1} = 1.1$, then if $\hat{f}(x) = 100$, $\hat{f}(x + 1) = 110$, and if $\hat{f}(x) = 1,000,000$, then $\hat{f}(x + 1) = 11,000,000$

Let's exponentiate our coefficient values to get the values for when x = 0, and the factor that $\hat{y}$ increases by for every unit increase in x.

A reason that the natural log is frequently used instead of say $log_{10}$ is because for small values of $\hat{\beta}$, $log(\hat{\beta}) = 1 + \hat{\beta}$. This allows one to use the regression coefficients directly without needed to exponentiate to get their (approximate) values. Since using the exp() function in R is easy to use now that we have modern computers, this is less of an advantage these days.

$\$

Part 4: Multicolinearity

When building multiple regression models, if certain predictors are linear combinations of other predictors (i.e., roughly speaking, this means that predictors are strongly correlated) then issues can arise that affect our ability to infer the regression coefficients for these predictor. Let's explore this now...

$\$

Part 4.1

Previously we tried to predict the log of a school's endowment based on the total salary of assistant professors. The IPED data set also has the faculty salaries reported separately for men and women. Let's see how accurately we can predict the log of a school's endowment using men vs. women salaries.

Using the $R^2$ statistic, do men or women salaries better predict a school's endowment?

# fit a model using only men's salary




# fit a model using only women's salary

$\$

Part 4.2

Perhaps we can get even better predictions if we build a multiple regression model that includes men and women salaries. To make the model possibly even better, let's also include total salary!

Try this analysis and look at:

What is the $R^2$ value.
Are the regression coefficients statistically significant?

Does anything seem odd?

$\$

Part 4.3: Multicolinearity

Multicolinearity a term that is used to describe when predictors in a linear regression model are linear combinations of other predictors (i.e., roughly speaking, this means that predictors are strongly correlated with each other). When multicolinearity occurs in a regression model, our estimate of regression coefficients ($\hat{\beta}_i$'s) become unstable and lead to large standard errors, and hence large p-values that are not statistically significant.

To understand why this is the case, imagine that we try to predict people's heights based on their left and right shoe size, i.e., $\hat{y} = \hat{\beta_0} + \hat{\beta_L} x_L + \hat{\beta_R} x_R$. Clearly there is a strong relationship between shoe size and height, but the multiple regression model could do about equally well by putting all coefficient weight on left foot (i.e., $\hat{\beta_R} = 0$), or all coefficient weight on the right foot (i.e., $\hat{\beta_L} = 0$), or putting an equal weight on the coefficient for both feet (i.e., $\hat{\beta_L} = \hat{\beta_R}$). These multiple possible solutions leads to instability in the ability to estimate the regression coefficients where slightly different data sets could lead to very different values for the regression coefficients (i.e., $\hat{\beta_L}$ and $\hat{\beta_R}$ could be very different for different data sets). As a result, the standard error of these regression coefficients is large and both coefficients might not be statistically significant even if we can make very accurate predictions with the model.

$\$

Part 4.4: Detecting multicolinearity

A statistic that can be used to detect multicolinearity is the variance inflated factor (VIF). The VIF for a given explanatory variable $x_j$ is defined as: $VIF_j = \frac{1}{1 - R_j}$, where $R_j$ is the $R^2$ value one would get if one built a regression model that tried to predict the explanatory variable $x_j$ using all the other explanatory variable in the model. A rule of thumb is that a VIF value of greater than 5, indicates that a particular explanatory variable has a high degree of multicolinearity with other explanatory variables in the model.

The car package has a function called vif() that can be used to calculate the VIF for each explanatory variable in a multiple regression model. Let's try it on our model where we predicted log(endowment) as a function of men, women and total salaries.

# #install.packages("car")

Do any of the vif values indicate a high degree of multicolinearity?

$\$

Part 4.5: What to do about multicolinearity?

For making accurate predictions, it does not matter if our model has multicolinearity. If we want a simpler model, and two explanatory variables are equally good and highly correlated, we can just choose one of them. However, if multiple explanatory variables are correlated, we are not going to be able say which of these variables is "causing" our response variable to change - e.g., was it the large left or right foot that caused people to be taller? As we've already discussed in this class, in general it is hard/impossible to make causal claims with observational data, and multicolinearity - as well variables we might have inadvertently omitted from our model that could be correlated with the variables in our model - are reasons why.

$\$

Part 5: Polynomial regression

We can also fit models that have non-linear transformations to our explanatory variables. This allows us to fit a much broader range of models to our data. In particular, we can fit models that are polynomial functions of our original explanatory variables:

$y = \hat{\beta}_0 + \hat{\beta}_1 \cdot x + \hat{\beta}_2 \cdot x^2 + ... + \hat{\beta}_k \cdot + x^k$

$\$

Part 5.1: Non-linear transformations of the explanatory variable x

Let's explore fitting a model that predicts salary from log endowment, log endowment squared and log endowment cubed using our salary data from assistant professors:

$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_2 \cdot \text{log(endowment)}^2 + \hat{\beta}_2 \cdot + \text{log(endowment)}^3$

To do this we can use the poly() function which creates polynomial expansions of our explanatory variables that are "orthogonal", which means that they will not have multicolinearity. If we had instead just created polynomial expansions of our data using say a dplyr mutate() operation, then the predictor values could be highly correlated and thus lead to multicolinearity (e.g., if x = 1:10, then x and x^2 will be highly correlated leading to multicolinearity).

Let's fit a degree 3 polynomial linear regression model to our data.

# refit our original degree 1 model





# fit a cubic model




# get statistics on our cubic fit



# compare the r^2

Question: Are the higher order terms statistically significant?

$\$

Part 5.2: Let's visualize our model

Question: Does this model visually appear to fit the data better?

$\$

Part 5.3: Using an ANOVA to compare the nested models

Because our linear model is nested within our cubic model, we can use an ANOVA to evaluate whether our cubic model accounts for more of the variability than our linear model.

emeyers/SDS230 documentation built on Feb. 6, 2025, 4:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Review - Multiple linear regression with categorical explanatory variables

Part 1.1: Creating a new data set

Part 1.2: Manually mutate on Assistant professor and associate professor indicator variables

Part 1.3: Fitting a linear model to the manually and automatically added indicator variables

Part 1.4: Visualizing the model fit

Part 2: Fitting models with different intercepts and slopes

Part 2.1: Manually mutate on Assistant professor and associate professor indicator variables

Part 2.2: Fitting a linear model to the manually and automatically added offsets and interactions

Part 2.3: Visualizing the model

Part 2.4 Comparing models

Part 3: Log transformation of the response variable y

Part 3.1

Part 3.2

Part 3.3

Part 4: Multicolinearity

Part 4.1

Part 4.2

Part 4.3: Multicolinearity

Part 4.4: Detecting multicolinearity

Part 4.5: What to do about multicolinearity?

Part 5: Polynomial regression

Part 5.1: Non-linear transformations of the explanatory variable x

Part 5.2: Let's visualize our model

Part 5.3: Using an ANOVA to compare the nested models

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Review - Multiple linear regression with categorical explanatory variables

Part 1.1: Creating a new data set

Part 1.2: Manually mutate on Assistant professor and associate professor indicator variables

Part 1.3: Fitting a linear model to the manually and automatically added indicator variables

Part 1.4: Visualizing the model fit

Part 2: Fitting models with different intercepts and slopes

Part 2.1: Manually mutate on Assistant professor and associate professor indicator variables

Part 2.2: Fitting a linear model to the manually and automatically added offsets and interactions

Part 2.3: Visualizing the model

Part 2.4 Comparing models

Part 3: Log transformation of the response variable y

Part 3.1

Part 3.2

Part 3.3

Part 4: Multicolinearity

Part 4.1

Part 4.2

Part 4.3: Multicolinearity

Part 4.4: Detecting multicolinearity

Part 4.5: What to do about multicolinearity?

Part 5: Polynomial regression

Part 5.1: Non-linear transformations of the explanatory variable x

Part 5.2: Let's visualize our model

Part 5.3: Using an ANOVA to compare the nested models

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis