Home

/

GitHub

/

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# makes sure you have all the packages we have used in class installed
#SDS230::update_installed_packages()

SDS230::download_data("IPED_salaries_2016.rda")

# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

Polynomial regression
Comparing models with R^2, adjusted R^2, AIC and BIC
Cross-validation
Visualizing multiple regression models with ggplot

$\$

Part 1: Polynomial regression

We can also fit models that have non-linear transformations to our explanatory variables. This allows us to fit a much broader range of models to our data. In particular, we can fit models that are polynomial functions of our original explanatory variables:

$y = \hat{\beta}_0 + \hat{\beta}_1 \cdot x + \hat{\beta}_2 \cdot x^2 + ... + \hat{\beta}_k \cdot + x^k$

$\$

Let's start by recreating our assistant professor salary data...

load("IPED_salaries_2016.rda")


# get the assistant professor data
# we will keep all the variables and all schools in the data frame for this analysis
assistant_data <- filter(IPED_salaries,  endowment > 0,  rank_name == "Assistant")

$\$

Part 1.1: Non-linear transformations of the explanatory variable x

Let's explore fitting a model that predicts salary from log endowment, log endowment squared and log endowment cubed using our salary data from assistant professors:

$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_2 \cdot \text{log(endowment)}^2 + \hat{\beta}_2 \cdot + \text{log(endowment)}^3$

To do this we can use the poly() function which creates polynomial expansions of our explanatory variables that are "orthogonal", which means that they will not have multicolinearity. If we had instead just created polynomial expansions of our data using say a dplyr mutate() operation, then the predictor values could be highly correlated and thus lead to multicolinearity (e.g., if x = 1:10, then x and x^2 will be highly correlated leading to multicolinearity).

Let's fit a degree 3 polynomial linear regression model to our data.

# refit our original degree 1 model




# fit a cubic model




# get statistics on our cubic fit



# compare the r^2

Question: Are the higher order terms statistically significant?

$\$

Part 1.2: Let's visualize our model

# create a data frame with a sequence of x-values to make predictions on




# plot the data





# plot the polynomial predictions

Question: Does this model visually appear to fit the data better?

$\$

Part 1.3: Using an ANOVA to compare the nested models

Because our linear model is nested within our cubic model, we can use an ANOVA to evaluate whether our cubic model accounts for more of the variability than our linear model.

$\$

Part 1.4: Higher order polynomials

Let's try fitting even higher order polynomials!

# try degrees, 5, 10, etc. 



  # fit a cubic model

$\$

Part 1.5: Visualizing higher order polynomial fits

Let's visualize our 50 degree polynomial fit

# fit 50 degree polynomial



# data frame with x-values to get predicted y-hat values 



# predicted y values (i.e., the regression curve)



# visualize the data

What would Michael Jackson say about this?

https://www.youtube.com/watch?v=DQWI1kvmwRg

$\$

Part 2: Comparing models with R^2, adjusted R^2, AIC, BIC and cross-validation

We can compare models using several statistics including $R^2$, $R^2_{adj}$, $AIC$, $BIC$ and cross-validation. Let's try comparing models using these statistics to see how well each statistic captures our intuition about the best model to use.

$\$

Part 2.1: Comparing models with R^2, adjusted R^2, AIC, BIC

Let's compare models using these statistics:

R^2: where a higher value means a better fit
Adjusted R^2: where higher value means a better fit
AIC: where a lower value means a better model
BIC: where a lower value means a better model

We can use either the synthetic x, y, data we created above, or the assistant professor data. We will fit the model with different degree polynomials and wee which modle each statistic selects.

library(dplyr)


# fit models of degree 1, 3, and 5





# summarizing these models






# printing the R^2 values (higher is better)
cat('R^2 \n')






# printing the adjusted R^2 values (higher is better)
cat("\n Adjusted R^2 \n")






# printing the AIC values (lower is better)
cat("\n AIC \n")





# printing the BIC values (lower is better)
cat("\n BIC \n")

$\$

Part 2.2: Comparing models using cross-validation

Let's now compare the models using cross-validation. To keep things simple, we are just going to split the data once into a training set and a test set rather than do k-fold cross-validation.

We will compare the models based on their mean squared prediction error (MSPE) where a smaller MSPE is better.

# create the training set and the test set






# fit both models on the training data, and calculate the MSPE based on their predictions on the test set









# use a for loop to compare the MSPE for models of degree 1 to 7

$\$

emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com