Put your name here!

require(mosaic)
require(NIMBIOS)
require(knitr)
opts_chunk$set( tidy=FALSE )

Background

You already know that adding new terms or variables to a model will tend to increase $R^2$. Certainly $R^2$ will never go down when you add more terms. That is, the nested model cannot have a higher $R^2$ than the model in which it is nested.

To illustrate, here are three models of the heights in Galton's data:

mod1 <- lm( height ~ sex, data=Galton )
mod2 <- lm( height ~ sex + mother + father, data=Galton )
mod3 <- lm( height ~ sex * mother + sex * father, data=Galton )

Task 1

Explain --- right here --- which of the models are nested in which. Calculate $R^2$ on each model and show whether or not they follow the pattern described in the Background section.

Task 2

Create another set of two or three nested models from whatever data set you like. Show that the expected $R^2$ pattern holds.

Here are some datasets you might choose from:

Use help() to get an explanation.

Statistical Significance

Sometimes, the purpose of a model is to determine whether and how potential explanatory variables or the terms constructed from variables are related to the response variable. For prediction models, it's helpful to choose meaningful explanatory variables: ones that are known before the response and which are relevant to the prediction.

One widely accepted standard to establish that an explanatory variable is related to the response is the "significance" of a hypothesis test. "Significance" is a very low standard of evidence. To be significant, the variable needs to be better than junk. That's all. Hardly "significant" in the everyday sense.

As a class, we're going to develop the theory behind a widely used test for significance. Here's the idea:

baseMod <- lm( wage ~ age + sex + educ, data=CPS85 )
coef(baseMod)
Questions

How many coefficients are there? What's the $R^2$?

exampleMod <- lm( wage ~ age + sex + educ + rand(5), data=CPS85 )
coef(exampleMod)

How many coefficients were added?

Generating Junk

Pick a number of coefficients between, say, 10 and 600. Use rand() to add that many of coefficients to the base model and find $R^2$.

The result should be a number $m$ giving the number of coefficients and the corresponding $R^2$. Enter your results into the spreadsheet at this link.

Do this several times with different $m$.

After a while, the class will assemble a good number of such instances.

Building a Theory of $R^2$

Read in the class's data.

classData <- fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dEtyWGxpWWFsN3Z0OUlZNG5xYmRVWWc&single=true&gid=0&output=csv")

Plot out the relationship between $R^2$ and $m$.

## Your plot here!  Hint: xyplot( R2 ~ m, data=classData )

Task

Explain (right here) what's going on in the relationship between $m$ and $R^2$.

Consider this particular model which might be used to investigate the possibility that race, union status, and sector of the economy might explain some of the wage.

newMod <- lm( wage ~ sex*age*educ + 
                sex*sector*race + educ*sector*union, 
              data=CPS85)
length(coef(newMod)) # this is m
rsquared(newMod)

See where newMod falls, with respect to junk, on your graph of $R^2$ versus $m$.

Task

Based on the graph, make up some test statistic that might reasonably be used in a significance test. Describe it here.

Your explanation goes here!

Talk with your colleagues and figure out how the F-test corresponds to the graph.

Your F-test description goes here

When you have this figured out, estimate F and the corresponding p-value entirely by eye.

Write down your estimate

Finally, use anova() to calculate the actual F statistic and p-value.

Remember to publish via Rpubs



dtkaplan/NIMBIOS documentation built on May 15, 2019, 4:58 p.m.