library(learnr) library(gradethis) tutorial_options(exercise.timelimit = 5, exercise.checker = gradethis::grade_learnr) knitr::opts_chunk$set(echo = FALSE)
You have seen the equation
[ y_i = x_i^\top\beta + \epsilon_i ]
many times while learning about linear models.
In the first part of this lab, we're going to generate data from this model.
Why generate data? Because it forces us to think about where these objects come from.
First let's talk about notation.
quiz( question("What does the $i$ subscript indicate?", answer("the ith predictor"), answer("the ith observation", correct=TRUE), answer("the ith covariate"), answer("the ith random noise draw") ), question("Under standard assumptions, what distribution does $y_i$ have?", answer("Normal with mean 0 and variance 1"), answer("the same as $\\epsilon_i$"), answer("Normal with mean 0 and variance $\\sigma^2$"), answer("Normal with mean $x_i^\\top\\beta$ and variance $\\sigma^2$", correct = TRUE) ) )
We almost always think of the design matrix $X=[x_1\ x_2\ \cdots]^\top$ as fixed and $\beta$ as a population parameter. This means that the only randomness comes from $\epsilon$. Unfortunately, in order to create data, we need numbers in $X$ and in $\beta$.
Write the R code required to generate a sequence of integers from 1 to 10.
grade_result(pass_if(~identical(.result, 1:10), "Nicely done."))
Think of those integers as your population $\beta$, if we had 10 predictors. What if we had 20? Or $p$?
p = ___ beta = ___ beta
Now what if we create our design matrix $X$. How big should it be? Write code to create $X$ with $n=10$ rows and $p=5$ columns. Generate each entry as random uniform between -1 and 1.
n = ___ p = ___ X = ___ X
grade_result( fail_if(~ all(.result>0), "Did you accidentally use U(0,1)?"), pass_if(~ nrow(.result)==10L && ncol(.result)==5L && all(.result < 1) && all(.result > -1)), fail_if(~ any(.result>1), "Whoops, some Xij are too big."), fail_if(~ ncol(.result)!=5L, "Not the right number of predictors"), fail_if(~ nrow(.result)!=10L, "Wrong number of observations") )
Finally, let's write a function that generates data from a linear model. Our function should take 3 arguments
n
the number of observationsp
the number of predictorssigma
the standard deviation of epsilon
The starter code below has a default value for sigma=
. Set that to 1
.
Make $\beta$ a sequence of integers from 1 to p
.
Make $X$ random uniform numbers between -1 and 1.
Make $\epsilon$ satisfy our usual assumptions with the correct standard deviation.
Return a data.frame
that contains y
and X
.
generate_data <- function(___, ___, sigma = ___){ beta = ___ ___ = matrix(runif(___, -1, 1), nrow=n) epsilon = ___ y = ___ %*% ___ + ___ df = data.frame(___, ___) return(___) }
quiz( question("Does the linear model you just created have an intercept?", answer("Yes."), answer("No.", correct=TRUE) ) )
r fontawesome::fa('exclamation-triangle', fill="#ff4900", height=25)
For discussion, how would you estimate the linear model without intercept in R?
The graphic below uses the function you wrote before:
generate_data <- function(n, p, sigma = 1){ beta = 1:p X = matrix(runif(n*p, -1, 1), nrow = n) epsilon = rnorm(n, 0, sigma) y = X %*% beta + epsilon df = data.frame(y, X) return(df) }
We want to know what happens to $R^2$ if we add more predictors that are useless. That is, predictors whose coefficient is 0. Use the sliders
to see what happens when you change the parameters to our generate_data()
function.
sliderInput("n", "Number of observations:", min = 40, max = 150, value = 50) sliderInput("p", "Number of true predictors:", min = 1, max = 10, value = 5) sliderInput("fakep", "Number of garbage predictors:", min = 1, max=30, value=10) sliderInput("sig", "sigma-squared = ", min = .01, max=10, value=1) plotOutput("rsq_plot")
library(ggplot2) output$rsq_plot <- renderPlot({ generate_data <- function(n, p1, p2, sigma = 1){ beta = 1:p1 X = matrix(runif(n*(p1+p2), -1, 1), nrow = n) epsilon = rnorm(n, 0, sqrt(sigma)) y = X[,1:p1,drop=FALSE] %*% beta + epsilon cbind(y,X) } df = generate_data(input$n, input$p, input$fakep, input$sig) totp = input$p+input$fakep rsq = sapply(1:totp, function(x) summary(lm(df[,1]~df[,1:x+1,drop=FALSE]-1))$r.sq) ggplot(data.frame(nvars=1:totp, rsq), aes(x=nvars,y=rsq)) + xlab("number of predictors")+ ylab("R-squared") + geom_path() + geom_point(color='orange') + cowplot::theme_cowplot() })
r fontawesome::fa('exclamation-triangle', fill="#ff4900", height=25)
For discussion:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.