In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# makes sure you have all the packages we have used in class installed
#SDS230::update_installed_packages()


# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)

#options(scipen=999)

knitr::opts_chunk$set(echo = TRUE) set.seed(123) $\$Overview • Multiple linear regression • Multiple linear regression with categorical predictors$\$Part 1: Multiple linear regression on the faculty salary data Let's now use the faculty salary data to explore multiple linear regression for building a model to predict faculty salary from the endowment size of a school and the number of students enrolled.$\$Part 1.0 Recreating the assistant faculty salary data To start let's recreate the Assistant professor salary data from the larger IPED data frame. We will create new variables (columns) in our data frame that have both the log of the endowment and the log of the enrollment. # load the data into R load("IPED_salaries_2016.rda") # get the assistant professor data assistant_data <- filter(IPED_salaries, endowment > 0, rank_name == "Assistant") |> select(school, salary_tot, endowment, enroll_total) |> na.omit() |> mutate(log_endowment = log10(endowment)) |> mutate(log_enroll = log10(enroll_total)) $\$Part 1.1: Exploring the enrollment data In our previous analyses we have built models predicting faculty salaries based on the size of the school's endowment. Before we start on multiple regression, let's look at have faculty salaries are affected by the another variable, namely, number of students enrolled at a school. Before starting an analysis it is often worth thinking about our expectations. Here we might expect that schools that have higher enrollment numbers might be able to pay their faculty more since they have a higher revenue stream from the larger number of students paying tuition. Thus if we model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{enrollment}$we might expect$\hat{\beta}_1$to be positive. Let's start our exploratory analysis by plotting the relationship between faculty salary and the number of students enrolled. If the relationship does not appear linear, we can transform the variables (as is often done in the "choose" step of model building). # plot the relationship between salary and enrollment $\$From looking at this plot we see that the there is a large range of enrollment values with a few large numbers. Thus it might be better to log transform the x-values (often when values can only be positive, taking a log transformation leads to more linear relationship). Let's mutate on a variable log_enroll to our assistant data frame and then plot the relationship between salary and log_enroll. We will use log10 here to be consistent with our transformation of endowment and also since it is easier for us to think in terms of "order of magnitude" for this problem. # the relationship does not appear linear, let's mutate on log enrollment # plot the relationship between salary and log enrollment  Question: Does the relationship appear more linear now?$\$Part 1.2: Fittting a simple linear regression model for predicting salary as a function of log enrollment Let's now fit a simple linear regression model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(enrollment)}$. Let's also plot the model and look at some inferential statistics by using the summary() function on our model. # fit a linear regression model of salary as a function of log enrollment and plot it # look at some inferential statistics using the summary() function $\$Part 1.3: Comparing the simple linear regression enrollment and endowment models Let's compare these models predicting salary from enrollment vs. endowment in terms of which model can explain most of the variability in salaries in terms of the$r^2$statistic. Let's also create a scatter plot of the relationships between these three variables using the pairs() function. # compare r^2 values between the endowment and enrollment models # create a scatter plot of the relationship between variables using the pairs() function # we can also use the ggpairs function in the GGally package to create a better looking pairs plot # install.packages("GGally") #GGally::ggpairs(select(assistant_data, salary_tot, log_endowment, log_enroll)) $\$Part 1.4: Multiple regression Let's now fit a multiple regression model for predicting salary using both endowment and enrollment as explanatory variables to see if using both these variables allows us to better predict salary than either variable along. In particular, we are fitting the model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_1 \cdot \text{log(enrollment)}$.   Question: Does this model account for more of the variability than the simple regression models we fit?$\$Part 1.5: Test for comparing nested models When we have nested models, we can use an ANOVA test based on the F-statistic to assess if adding additional explanatory variables leads to a model that can account for more of the variability in a response variable. A Model 1 is nested in a Model 2 if the parameters in Model 1 are a subset of the parameters in Model 2. Here, our model using only the endowment as the explanatory variable is nested within in the model that uses endowment and enrollment as explanatory variables. Let's uses the anova() function to test if adding the enrollment explanatory leads to a statistically significant increase in the amount of variability that can be accounted for.  $\$Part 3: Multiple linear regression with categorical explanatory variables Let's now examine how much faculty salaries increase as a function of log endowment size taking into account the rank that different professors have. Part 3.1: Creating a new data set Let's start by creating a data set called IPED_2 which is modified IPED_2 in the following way: 1. Only include data from institutions with a CARNEGIE classification of 15 or 31 (these correspond to R1 institutions and liberal arts colleges). 2. Only use the faculty ranks of Assistant, Associate, and Full professors IPED_2 <- IPED_salaries |> filter(endowment > 0) |> mutate(log_endowment = log10(endowment)) |> filter(CARNEGIE %in% c(15, 31)) |> filter(rank_name %in% c("Assistant", "Associate", "Full")) dim(IPED_2) $\$Part 3.2: Visualizing the data Let's visualize the data by creating a scatter plot showing the total salary that faculty get paid (salary_tot) as a function of the log endowment size, where each faculty rank is in a different color using base R graphics.  $\$Part 3.3: Fitting a linear model to the data Let's now fit a linear regression model for total salary as a function of log endowment size, but use a separate y-intercept for each of the 3 faculty ranks (and use the same slope for all ranks). We can then use the summary() function to extract information about the model.   Question: How much of the total variability does the model explain? Are there differences between the Full Professors and the other ranks in terms of their intercepts? Also, how much less is an Assistant Professor making relative to a Full Professor?$\$Part 3.4: Visualizing the model fit Let's recreate the scatter plot we created in part 3.2 using the same colors. Now, however, let's also add on the regression lines with different y-intercepts that you fit in part 3.3 (using the appropriate colors to match the colors of the points). # recreate the scatter plot of the data with color indicating faculty rank # extract the coefficients and plot separate regression lines for each faculty rank # add a legend  Question Does using the same slope but different offsets seem to be adequate for capturing the trends in the data?$\$Part 3.5: Fitting models with different intercepts and slopes Now let's fit a linear regression model for total salary as a function of log endowment size, but use separate y-intercepts and slopes for each of the 3 faculty ranks.  $\$Question: How much of the total sum of squares of faculty salary is this model capturing?$\$Part 3.6: Visualizing the model Now let's again recreate the scatter plot you created in part 3.2 using the same colors and let's add on the regression line with different y-intercepts and slopes based on the model you fit in part 3.5 (again use the appropriate colors). # recreate the scatter plot of the data with color indicating faculty rank # extract the coefficients and plot separate regression lines for each faculty rank  Question Does this model seem like a better fit and do you think there are ways you could further improve on this model?$\$Part 3.7 Comparing models The model you fit in Part 3.3 is nested within the model you fit in Part 3.5. We can use the anova() function to compare such nested models. Does adding the additional slopes for each rank seem to improve the model fit?  $\$Bonus: Relating simple and multiple regression coefficients$\$Part b.1: Comparing simple and multiple regression coefficients When running a multiple regression model$y = \hat{\beta}{0(2)} + \hat{\beta}{1(2)} x_1 + \hat{\beta}{2(2)} x_2$, we can view the regression coefficient$\hat{\beta}{1(2)}$as how much$y$changes for a unit change in$x_1$when holding$x_2$at a fixed value. When running a simple linear regression model,$y = \hat{\beta}{0(1)} + \hat{\beta}{1(1)}x_1$, we can view the coefficient$\hat{\beta}_{1(1)}$as how much$y$changes for a unit change in$x_1$without controlling for changes in$x_2$'s value. We can see this leads to different regression coefficients estimates for$\hat{\beta}{1(1)}$and$\hat{\beta}{1(2)}$. # compare the beta-hat coefficient for log(endowment) in simple linear regression to when log(enroll) is in the model # directly compare the coefficients on x_1 i.e., the coefficient on log(endowment) # get the regression coefficient on log(enroll) for our multiple regression model $\$Part b.2: Relating simple and multiple regression coefficients Q: How are the coefficients$\hat{\beta}{1(1)}$from simple regression and$\hat{\beta}{1(2)}$from multiple regression related? If$x_1$and$x_2$are correlated, then a change in$x_1$will be associated with a change in$x_2$. Thus, in the simple linear regression model$y = \hat{\beta}{0(1)} + \hat{\beta}{1(1)}x_1$, the change seen in$y$is due to the change in$x_1$plus how much$x_1$is associated with a change in$x_2$times how much$x_2$is associated with a change in$y$(which is given by$\hat{\beta}_{2(2)}$). We can measure the change in$x_2$that is associated with a change in$x_1$using a regression equation:$x_2 = \delta_{0} + \delta_{1}x_1$. This allows us to related our simple and multiple regression coefficients in terms of how much$y$changes with a change in$x_1$as:$\hat{\beta}{1(1)} x_1 = \hat{\beta}{1(2)} x_1 + \hat{\beta}{2(2)} \delta{1} x_1\$,

Let's examine this in R:

# predict log(enroll) as a function of log(endow) to get the delta_1 coefficients

# get the regression coefficient delta_1 from predicting x_2 from x_1

# reconstruct the simple regression coefficient beta_11 from beta12, delta_1 and beta_22


emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.