library(learnr)
library(gradethis)
library(knitr)
library(tidyverse)
mobility <- UBCstat406labs::mobility

tutorial_options(exercise.timelimit = 5, exercise.checker = gradethis::grade_learnr)

knitr::opts_chunk$set(echo = FALSE)

The Data

Your goal is to use the mobility dataset to predict economic mobility.

Mobility The probability that a child born in 1980–1982 into the lowest quintile (20%) of household income will be in the top quintile at age 30. Individuals are assigned to the community they grew up in, not the one they were in as adults. Population in 2000.

Urban Is the community primarily urban or rural?

Black percentage of individuals who marked black (and nothing else) on census forms.

Racial Segregation a measure of residential segregation by race.

Income Segregation Similarly but for income.

Segregation of poverty Specifically a measure of residential segregation for those in the bottom quarter of the national income distribution.

Segregation of affluence Residential segregation for those in the top quarter.

Commute Fraction of workers with a commute of less than 15 minutes.

Mean income Average income per capita in 2000.

Gini A measure of income inequality, which would be 0 if all incomes were perfectly equal, and tends towards 100 as all the income is concentrated among the richest individuals (see Wikipedia, s.v. "Gini coefficient").

Share 1% Share of the total income of a community going to its richest 1%.

Gini bottom 99% Gini coefficient among the lower 99% of that community.

Fraction middle class Fraction of parents whose income is between the national 25th and 75th percentiles.

Local tax rate Fraction of all income going to local taxes.

Local government spending per capita.

Progressivity Measure of how much state income tax rates increase with income.

EITC Measure of how much the state contributed to the Earned Income Tax Credit (a sort of negative income tax for very low-paid wage earners).

School expenditures Average spending per pupil in public schools.

Student/teacher ratio Number of students in public schools divided by number of teachers.

Test scores Residuals from a linear regression of mean math and English test scores on household income per capita.

High school dropout rate Residuals from a linear regression of the dropout rate on per-capita income.

Colleges per capita

College tuition in-state, for full-time students

College graduation rate Again, residuals from a linear regression of the actual graduation rate on household income per capita.

Labor force participation Fraction of adults in the workforce.

Manufacturing Fraction of workers in manufacturing.

Chinese imports Growth rate in imports from China per worker between 1990 and 2000.

Teenage labor fraction of those age 14–16 who were in the labor force.

Migration in Migration into the community from elsewhere, as a fraction of 2000 population.

Migration out Ditto for migration into other communities.

Foreign fraction of residents born outside the US.

Social capital Index combining voter turnout, participation in the census, and participation in community organizations.

Religious Share of the population claiming to belong to an organized religious body.

Violent crime Arrests per person per year for violent crimes.

Single motherhood Number of single female households with children divided by the total number of households with children.

Divorced Fraction of adults who are divorced.

Married Ditto.

Longitude Geographic coordinate for the center of the community

ID A numerical code, identifying the community.

Name the name of principal city or town.

State the state of the principal city or town of the community.

Note that there are generally more observations than predictors. You should definitely exclude some variables in the data (which ones?).

Estimating Models

Using glmnet, estimate 4 models: the linear model, ridge regression, the lasso, and the elastic net ($\alpha=.5$).

Linear Model

Start by computing the linear model. Don't use the variables ID, Name, or State

library(glmnet)
linmod = __________________
coef(linmod)
library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
linmodWRONG = lm(Mobility~., data=mobility, y=TRUE)
sol <- coef(linmod)
wrong <- coef(linmodWRONG)


grade_result(
  pass_if(~ identical(.result, sol), "Correct!"),
  fail_if(~ identical(.result, wrong), "Did you forget to remove ID, Name, and State?"),
  fail_if(~ TRUE, "Incorrect.")
)

Ridge Regression

library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
set.seed(1234)
ridge = cv.glmnet(x=_____, y=______, alpha=______, lambda.min.ratio=1e-6)
coef(ridge)
library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
set.seed(1234)
ridge = cv.glmnet(X, y, alpha=0, lambda.min.ratio=1e-6)
sol <- coef(ridge)


grade_result(
  pass_if(~ identical(.result, sol), "Correct!"),
  fail_if(~ TRUE, "Incorrect.")
)

Lasso Regression

library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
set.seed(1234)
lasso = cv.glmnet(x=_____, y=_______, alpha=_______)
coef(lasso)
library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
set.seed(1234)
lasso = cv.glmnet(X, y)
sol <- coef(lasso)


grade_result(
  pass_if(~ identical(.result, sol), "Correct!"),
  fail_if(~ TRUE, "Incorrect.")
)

Elastic Net

library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
set.seed(1234)
enet = cv.glmnet(x=_____,y=_____,alpha=_____)
coef(enet)
library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
set.seed(1234)
enet = cv.glmnet(X,y,alpha=.5)
sol <- coef(enet)


grade_result(
  pass_if(~ identical(.result, sol), "Correct!"),
  fail_if(~ TRUE, "Incorrect.")
)

Plotting

Plot the CV curves for each of the three regularized models.

library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
lasso = cv.glmnet(X, y)
ridge = cv.glmnet(X, y, alpha=0, lambda.min.ratio=1e-6)
enet = cv.glmnet(X,y,alpha=.5)
par(mfrow=c(2,2))
plot(lasso)
plot(ridge)
plot(enet)

For enet and lasso, lambda.1se gives sparser models. For ridge, use lambda.min (more like GCV).

library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
lasso = cv.glmnet(X, y)
ridge = cv.glmnet(X, y, alpha=0, lambda.min.ratio=1e-6)
enet = cv.glmnet(X,y,alpha=.5)
lasso1 = as.numeric(coef(lasso, 'lambda.min'))
enet1 = as.numeric(coef(enet, 'lambda.min'))
ridge1 = as.numeric(coef(ridge, 'lambda.min'))
lasso1
enet1
ridge1

Plot the coefficients

Plot the coefficients for each of the 4 models on one figure. What do you notice? Which features are most important?

library(glmnet)
linmod = lm(Mobility~.-ID-Name-State, data=mobility, y=TRUE)
X = model.matrix(linmod)[,-1]
y = linmod$y
lasso = cv.glmnet(X, y)
ridge = cv.glmnet(X, y, alpha=0, lambda.min.ratio=1e-6)
enet = cv.glmnet(X,y,alpha=.5)
lasso1 = as.numeric(coef(lasso, 'lambda.min'))
enet1 = as.numeric(coef(enet, 'lambda.min'))
ridge1 = as.numeric(coef(ridge, 'lambda.min'))
ord = order(coef(linmod))
df = data.frame(lm = coef(linmod)[ord], lasso = lasso1[ord],
                elnet = enet1[ord], ridge = ridge1[ord])
df$var = rownames(df) 
df %>% mutate(var = str_replace_all(var, "_"," ")) %>%
  pivot_longer(names_to='method',values_to ='estimate', -var) %>%
  ggplot(aes(y=var,x=estimate,color=method)) + geom_point() + 
  geom_vline(xintercept = 0) +
  cowplot::theme_cowplot() + 
  scale_color_viridis_d(direction = -1) +
  theme(axis.title.y = element_blank())
  #scale_x_continuous(trans = scales::pseudo_log_trans(sigma=.1))


dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.