$\$
# makes sure you have all the packages we have used in class installed #SDS230::update_installed_packages() SDS230::download_data("IPED_salaries_2016.rda")
# install.packages("latex2exp") library(latex2exp) library(dplyr) library(ggplot2) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
We can also fit models that have non-linear transformations to our explanatory variables. This allows us to fit a much broader range of models to our data. In particular, we can fit models that are polynomial functions of our original explanatory variables:
$y = \hat{\beta}_0 + \hat{\beta}_1 \cdot x + \hat{\beta}_2 \cdot x^2 + ... + \hat{\beta}_k \cdot + x^k$
$\$
Let's start by recreating our assistant professor salary data...
load("IPED_salaries_2016.rda") # get the assistant professor data # we will keep all the variables and all schools in the data frame for this analysis assistant_data <- filter(IPED_salaries, endowment > 0, rank_name == "Assistant")
$\$
Let's explore fitting a model that predicts salary from log endowment, log endowment squared and log endowment cubed using our salary data from assistant professors:
$\text{salary} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{log(endowment)} + \hat{\beta}_2 \cdot \text{log(endowment)}^2 + \hat{\beta}_2 \cdot + \text{log(endowment)}^3$
To do this we can use the poly()
function which creates polynomial expansions
of our explanatory variables that are "orthogonal", which means that they will
not have multicolinearity. If we had instead just created polynomial expansions
of our data using say a dplyr mutate()
operation, then the predictor values
could be highly correlated and thus lead to multicolinearity (e.g., if x = 1:10,
then x and x^2 will be highly correlated leading to multicolinearity).
Let's fit a degree 3 polynomial linear regression model to our data.
# refit our original degree 1 model # fit a cubic model # get statistics on our cubic fit # compare the r^2
Question: Are the higher order terms statistically significant?
$\$
# create a data frame with a sequence of x-values to make predictions on # plot the data # plot the polynomial predictions
Question: Does this model visually appear to fit the data better?
$\$
Because our linear model is nested within our cubic model, we can use an ANOVA to evaluate whether our cubic model accounts for more of the variability than our linear model.
$\$
Let's try fitting even higher order polynomials!
# try degrees, 5, 10, etc. # fit a cubic model
$\$
Let's visualize our 50 degree polynomial fit
# fit 50 degree polynomial # data frame with x-values to get predicted y-hat values # predicted y values (i.e., the regression curve) # visualize the data
What would Michael Jackson say about this?
https://www.youtube.com/watch?v=DQWI1kvmwRg
$\$
We can compare models using several statistics including $R^2$, $R^2_{adj}$, $AIC$, $BIC$ and cross-validation. Let's try comparing models using these statistics to see how well each statistic captures our intuition about the best model to use.
$\$
Let's compare models using these statistics:
R^2
: where a higher value means a better fitAdjusted R^2
: where higher value means a better fitAIC
: where a lower value means a better modelBIC
: where a lower value means a better modelWe can use either the synthetic x, y, data we created above, or the assistant professor data. We will fit the model with different degree polynomials and wee which modle each statistic selects.
library(dplyr) # fit models of degree 1, 3, and 5 # summarizing these models # printing the R^2 values (higher is better) cat('R^2 \n') # printing the adjusted R^2 values (higher is better) cat("\n Adjusted R^2 \n") # printing the AIC values (lower is better) cat("\n AIC \n") # printing the BIC values (lower is better) cat("\n BIC \n")
$\$
Let's now compare the models using cross-validation. To keep things simple, we are just going to split the data once into a training set and a test set rather than do k-fold cross-validation.
We will compare the models based on their mean squared prediction error (MSPE) where a smaller MSPE is better.
# create the training set and the test set # fit both models on the training data, and calculate the MSPE based on their predictions on the test set # use a for loop to compare the MSPE for models of degree 1 to 7
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.