Instructions

  1. Rename this document with your student ID (not the 10-digit number, your IU username, e.g. dajmcdon). Include your buddy in the author field if you are working together.
  2. I have given you code to generate data and fit a linear model.
  3. Follow the instructions to replicate the bootstrap from the text/lecture on this new data.
  4. Answer the question at the end.

Weighted least squares, fake data

  1. Generate 250 observations from a linear model as follows: $x_i$ should uniform between -2 and 2, the $\epsilon_i$ should be normally distributed with mean zero and variance $x_i^2$, $y_i=3 + 2x_i + \epsilon_i$.

  2. Plot $y$ against $x$.

library(tidyverse)
library(cowplot)
set.seed(12032020)
n = 250
df = tibble(
  x = runif(n, -2, 2),
  eps = rnorm(n, 0, sqrt(x^2)),
  y = 3 + 2*x + eps)
ggplot(df, aes(x,y)) + geom_point() + theme_cowplot()
  1. Estimate the model using OLS and WLS (with appropriate weights). This is easy to do. Try ?lm if you're lost.

  2. Produce a plot that shows the original data and both estimated regression lines.

  3. Produce confidence intervals for both methods. What do you notice?

ls = lm(y~x, data=df)
wls = lm(y~x, data=df, weights = 1/x^2)
df$ls = fitted(ls)
df$wls = fitted(wls)
ggplot(pivot_longer(select(df, x, ls, wls), -x), aes(x,value,color=name)) +
  geom_line(size=2) +
  scale_color_brewer(palette = 'Set1') +
  geom_point(data=df, aes(x,y), color='purple', size=.5) + theme_cowplot()
confint(ls)
confint(wls)

Weighted least squares, real data

  1. Load the 301gradedist data set. This was downloaded from IU's grade distribution database.

  2. Regress avg_grade on instructor and avg_student_gpa using OLS without intercept. Perform the same regression using n_student as the weights. Why is this an appropriate weighting? Again, I suggest you consult the documentation ?lm.

s301 = read_csv("301gradedist.csv") %>% mutate(instructor = as.factor(instructor))
ls301 = lm(avg_grade ~ instructor + avg_student_gpa-1, data=s301)
wls301 = lm(avg_grade ~ instructor + avg_student_gpa-1, data=s301, weights = n_students)
  1. How do you interpret the coefficients on the instructor? Which instructor seems to be the best (in the sense of their students getting the highest grades)?

  2. Make one plot that shows all the data and the regression line (from WLS) for each instructor.

s301$preds = predict(wls301)
ggplot(s301, aes(avg_student_gpa, avg_grade, color=instructor)) +
  geom_point() + geom_line(aes(y=preds)) + scale_color_viridis_d() +
  theme_cowplot() 
  1. Produce confidence intervals for the weighted least squares version. What have you learned? Are there other variables you would want to include? Why did I have you do this without intercept?
cis = confint(wls301)
cis
  1. This many confidence intervals are best displayed in a graphic. Produce such a graphic for the different instructors.
conf301 = tibble(
  lower = cis[-nrow(cis),1],
  upper = cis[-nrow(cis),2],
  ests = coef(wls301)[-nrow(cis)],
  instructor = str_replace(names(ests), "instructor", "")
) %>% mutate(instructor = fct_reorder(instructor, ests))
ggplot(conf301, aes(instructor,ests,color=instructor)) + 
  geom_segment(aes(xend=instructor,y=lower,yend=upper)) +
  geom_point(color='black') + coord_flip() +
  scale_color_viridis_d() +
  theme_cowplot() + theme(legend.position = 'none')


dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.