# In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# makes sure you have all the packages we have used in class installed
#SDS230::update_installed_packages()


# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)

#options(scipen=999)

knitr::opts_chunk$set(echo = TRUE) set.seed(123) $\$## Overview • Assessing unusual points • Analysis of variance for regression • Multiple linear regression$\$## Part 1: Assessing unusual points Let's continue to use the faculty salary data to explore how unusual points can affect our regression model. As we discussed, points can be usual by being: 1. High leverage: unusual explanatory variable values (i.e. points with unusual x-values) 2. Outliers: unusual response variable values (i.e., points with unusual y-values) 3. Influential points: High leverage and outlier points (i.e., points with unusual x and y values) There are different statistics, and consequently R functions, to examine if data points are unusual along these dimensions.$\$Part 1.0 Recreating the assistant faculty salary data Let's start by recreating the Assistant professor salary data from the larger IPED data frame. We can then re-fit a linear regression model to the data. # load the data into R load("IPED_salaries_2016.rda") # get the assistant professor data assistant_data <- filter(IPED_salaries, endowment > 0, rank_name == "Assistant") |> select(school, salary_tot, endowment, enroll_total) |> na.omit() |> mutate(log_endowment = log10(endowment)) # recreate our linear regression model from class 16 lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data) $\$Part 1.1: Let's explore high leverage points using the hatvalues() function which takes a linear model as the input argument. Are any points greater than 4/n (high leverage), or 6/n (very high leverage)? # plot a histogram of the hat values # mutate the hat-values onto the original data and plot them in color $\$Part 1.2: Let's explore the standardized and studentized residuals using the rstandard() and rstudent() functions. Are any of the standardized or studentized residuals greater than 3? # plot a histogram of the standarized residuals # plot a histogram of the studentized residuals # mutate on the absolute value of studentized residuals and plot them in color # we can explore this in plotly too! $\$Part 1.3: Let's also examine Cook's distance to see which points are influential. # use the base R regression diagostic plots to show Cooke's distance # let's examine the points with the high Cook's distance # mutate on the Cook's distance and plot in color $\$## Part 2: Analysis of variance (ANOVA) for regression The code below will create an ANOVA table for regression for a model predicting salary as a function of the log of a school's endowment. We will use data from assistant professors from the IPED data. anova(lm_fit) $\$## Part 2: Multiple linear regression on the faculty salary data Let's now use the faculty salary data to explore multiple linear regression for building a model to predict faculty salary from the endowment size of a school and the number of students enrolled. #### Part 2.1: Exploring the enrollment data In our previous analyses we have built models predicting faculty salaries based on the size of the school's endowment. Before we start on multiple regression, let's look at have faculty salaries are affected by the another variable, namely, number of students enrolled at a school. Before starting an analysis it is often worth thinking about our expectations. Here we might expect that schools that have higher enrollment numbers might be able to pay their faculty more since they have a higher revenue stream from the larger number of students paying tuition. Thus if we model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{enrollment}$we might expect$\hat{\beta}_1$to be positive. Let's start our exploratory analysis by plotting the relationship between faculty salary and the number of students enrolled. If the relationship does not appear linear, we can transform the variables (as is often done in the "choose" step of model building). # plot the relationship between salary and enrollment $\$From looking at this plot we see that the there is a large range of enrollment values with a few large numbers. Thus it might be better to log transform the x-values (often when values can only be positive, taking a log transformation leads to more linear relationship). Let's mutate on a variable log_enroll to our assistant data frame and then plot the relationship between salary and log_enroll. We will use log10 here to be consistent with our transformation of endowment and also since it is easier for us to think in terms of "order of magnitude" for this problem. # the relationship does not appear linear, let's mutate on log enrollment # plot the relationship between salary and log enrollment  Question: Does the relationship appear more linear now?$\$#### Part 2.2: Fittting a simple linear regression model for predicting salary as a function of log enrollment Let's now fit a simple linear regression model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(enrollment)}$. Let's also plot the model and look at some inferential statistics by using the summary() function on our model. # fit a linear regression model of salary as a function of log enrollment and plot it # look at some inferential statistics using the summary() function $\$#### Part 2.3: Comparing the simple linear regression enrollment and endowment models Let's compare these models predicting salary from enrollment vs. endowment in terms of which model can explain most of the variability in salaries in terms of the$r^2$statistic. Let's also create a scatter plot of the relationships between these three variables using the paris() function. # compare r^2 and look at all scatter plots $\$#### Part 2.4: Multiple regression Let's now fit a multiple regression model for predicting salary using both endowment and enrollment as explanatory variables to see if using both these variables allows us to better predict salary than either variable along. In particular, we are fitting the model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_1 \cdot \text{log(enrollment)}$.   Question: Does this model account for more of the variability than the simple regression models we fit?$\$#### Part 2.5: Test for comparing nested models When we have nested models, we can use an ANOVA test based on the F-statistic to assess if adding additional explanatory variables leads to a model that can account for more of the variability in a response variable. A Model 1 is nested in a Model 2 if the parameters in Model 1 are a subset of the parameters in Model 2. Here, our model using only the endowment as the explanatory variable is nested within in the model that uses endowment and enrollment as explanatory variables. Let's uses the anova() function to test if adding the enrollment explanatory leads to a statistically significant increase in the amount of variability that can be accounted for.  $\$### Part 3a: Relating simple and multiple regression coefficients When running a multiple regression model$y = \hat{\beta}{0(2)} + \hat{\beta}{1(2)} x_1 + \hat{\beta}{2(2)} x_2$, we can view the regression coefficient$\hat{\beta}{1(2)}$as how much$y$changes for a unit change in$x_1$when holding$x_2$at a fixed value. When running a simple linear regression model,$y = \hat{\beta}{0(1)} + \hat{\beta}{1(1)}x_1$, we can view the coefficient$\hat{\beta}_{1(1)}$as how much$y$changes for a unit change in$x_1$without controlling for changes in$x_2$'s value. We can see this leads to different regression coefficients estimates for$\hat{\beta}{1(1)}$and$\hat{\beta}{1(2)}$. # compare the beta-hat coefficient for log(endowment) in simple linear regression to when log(enroll) is in the model # directly compare the coefficients on x_1 i.e., the coefficient on log(endowment) # get the regression coefficient on log(enroll) for our multiple regression model $\$### Part 3b: Relating simple and multiple regression coefficients Q: How are the coefficients$\hat{\beta}{1(1)}$from simple regression and$\hat{\beta}{1(2)}$from multiple regression related? If$x_1$and$x_2$are correlated, then a change in$x_1$will be associated with a change in$x_2$. Thus, in the simple linear regression model$y = \hat{\beta}{0(1)} + \hat{\beta}{1(1)}x_1$, the change seen in$y$is due to the change in$x_1$plus how much$x_1$is associated with a change in$x_2$times how much$x_2$is associated with a change in$y$(which is given by$\hat{\beta}_{2(2)}$). We can measure the change in$x_2$that is associated with a change in$x_1$using a regression equation:$x_2 = \delta_{0} + \delta_{1}x_1$. This allows us to related our simple and multiple regression coefficients in terms of how much$y$changes with a change in$x_1$as:$\hat{\beta}{1(1)} x_1 = \hat{\beta}{1(2)} x_1 + \hat{\beta}{2(2)} \delta{1} x_1\$,

Let's examine this in R:

# predict log(enroll) as a function of log(endow) to get the delta_1 coefficients

# get the regression coefficient delta_1 from predicting x_2 from x_1

# reconstruct the simple regression coefficient beta_11 from beta12, delta_1 and beta_22


emeyers/SDS230 documentation built on Jan. 13, 2023, 5:16 a.m.