# In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# makes sure you have all the packages we have used in class installed
#SDS230::update_installed_packages()


# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(plotly)

#options(scipen=999)

knitr::opts_chunk$set(echo = TRUE) set.seed(123) $\$## Overview • Continuation of confidence intervals for regression • Assessing unusual points • Analysis of variance for regression • Multiple linear regression$\$## Part 0: Calculating confidence intervals for regression Since we did not get through all the code calculating confidence intervals for regression last class, let's go through it now. Part 0.0 Recreating the assistant faculty salary data Let's start by recreating the Assistant professor salary data from the larger IPED data frame. We can then re-fit a linear regression model to the data. # load the data into R load("IPED_salaries_2016.rda") # get the assistant professor data assistant_data <- filter(IPED_salaries, endowment > 0, rank_name == "Assistant") |> select(school, salary_tot, endowment, enroll_total) |> na.omit() |> mutate(log_endowment = log10(endowment)) |> arrange(log_endowment) # recreate our linear regression model from class 16 lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data)  ### Part 0.1: t-distribution based confidence intervals on the regression intercept and slope We can create confidence intervals for the regression coefficients$\beta_0$and$\beta_1$using a t-distribution via the confint() function.   Interpretation?$\$### Part 0.2: Plotting confidence intervals and prediction intervals for the regression line We can also use parametric methods to plot confidence and prediction intervals for the regression line. # confidence interval for the regression line mu_y # prediction interval for the regression line # # plot both confidence interval and the prediction interval # let's start by plotting the data # plot confidence interval # plot prediction interval $\$## Part 1: Assessing unusual points Let's continue to use the faculty salary data to explore how unusual points can affect our regression model. As we discussed, points can be usual by being: 1. High leverage: unusual explanatory variable values (i.e. points with unusual x-values) 2. Outliers: unusual response variable values (i.e., points with unusual y-values) 3. Influential points: High leverage and outlier points (i.e., points with unusual x and y values) There are different statistics, and consequently R functions, to examine if data points are unusual along these dimensions.$\$Part 1.0 Recreating the assistant faculty salary data Let's start by recreating the Assistant professor salary data from the larger IPED data frame. We can then re-fit a linear regression model to the data. # load the data into R load("IPED_salaries_2016.rda") # get the assistant professor data assistant_data <- filter(IPED_salaries, endowment > 0, rank_name == "Assistant") |> select(school, salary_tot, endowment, enroll_total) |> na.omit() |> mutate(log_endowment = log10(endowment)) # recreate our linear regression model from class 16 lm_fit <- lm(salary_tot ~ log_endowment, data = assistant_data) $\$Part 1.1: Let's explore high leverage points using the hatvalues() function which takes a linear model as the input argument. Are any points greater than 4/n (high leverage), or 6/n (very high leverage)? # plot a histogram of the hat values # mutate the hat-values onto the original data and plot them in color $\$Part 1.2: Let's explore the standardized and studentized residuals using the rstandard() and rstudent() functions. Are any of the standardized or studentized residuals greater than 3? # plot a histogram of the standarized residuals # plot a histogram of the studentized residuals # mutate on the absolute value of studentized residuals and plot them in color # we can explore this in plotly too! $\$Part 1.3: Let's also examine Cook's distance to see which points are influential. # use the base R regression diagostic plots to show Cooke's distance # let's examine the points with the high Cook's distance # mutate on the Cook's distance and plot in color $\$## Part 2: Analysis of variance (ANOVA) for regression The code below will create an ANOVA table for regression for a model predicting salary as a function of the log of a school's endowment. We will use data from assistant professors from the IPED data. anova(lm_fit) $\$## Part 2: Multiple linear regression on the faculty salary data Let's now use the faculty salary data to explore multiple linear regression for building a model to predict faculty salary from the endowment size of a school and the number of students enrolled. #### Part 2.1: Exploring the enrollment data In our previous analyses we have built models predicting faculty salaries based on the size of the school's endowment. Before we start on multiple regression, let's look at have faculty salaries are affected by the another variable, namely, number of students enrolled at a school. Before starting an analysis it is often worth thinking about our expectations. Here we might expect that schools that have higher enrollment numbers might be able to pay their faculty more since they have a higher revenue stream from the larger number of students paying tuition. Thus if we model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{enrollment}$we might expect$\hat{\beta}_1$to be positive. Let's start our exploratory analysis by plotting the relationship between faculty salary and the number of students enrolled. If the relationship does not appear linear, we can transform the variables (as is often done in the "choose" step of model building). # plot the relationship between salary and enrollment $\$From looking at this plot we see that the there is a large range of enrollment values with a few large numbers. Thus it might be better to log transform the x-values (often when values can only be positive, taking a log transformation leads to more linear relationship). Let's mutate on a variable log_enroll to our assistant data frame and then plot the relationship between salary and log_enroll. We will use log10 here to be consistent with our transformation of endowment and also since it is easier for us to think in terms of "order of magnitude" for this problem. # the relationship does not appear linear, let's mutate on log enrollment # plot the relationship between salary and log enrollment  Question: Does the relationship appear more linear now?$\$#### Part 2.2: Fittting a simple linear regression model for predicting salary as a function of log enrollment Let's now fit a simple linear regression model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(enrollment)}$. Let's also plot the model and look at some inferential statistics by using the summary() function on our model. # fit a linear regression model of salary as a function of log enrollment and plot it # look at some inferential statistics using the summary() function $\$#### Part 2.3: Comparing the simple linear regression enrollment and endowment models Let's compare these models predicting salary from enrollment vs. endowment in terms of which model can explain most of the variability in salaries in terms of the$r^2$statistic. Let's also create a scatter plot of the relationships between these three variables using the paris() function. # compare r^2 and look at all scatter plots $\$#### Part 2.4: Multiple regression Let's now fit a multiple regression model for predicting salary using both endowment and enrollment as explanatory variables to see if using both these variables allows us to better predict salary than either variable along. In particular, we are fitting the model$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_1 \cdot \text{log(enrollment)}$.   Question: Does this model account for more of the variability than the simple regression models we fit?$\$#### Part 2.5: Test for comparing nested models When we have nested models, we can use an ANOVA test based on the F-statistic to assess if adding additional explanatory variables leads to a model that can account for more of the variability in a response variable. A Model 1 is nested in a Model 2 if the parameters in Model 1 are a subset of the parameters in Model 2. Here, our model using only the endowment as the explanatory variable is nested within in the model that uses endowment and enrollment as explanatory variables. Let's uses the anova() function to test if adding the enrollment explanatory leads to a statistically significant increase in the amount of variability that can be accounted for.  $\\$

emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.