Home

/

GitHub

/

In dtkaplan/NIMBIOS: Data and editing support for the Teaching with R tutorial

Put your name here!

require(mosaic)
require(NIMBIOS)
require(knitr)
opts_chunk$set( tidy=FALSE )

Background

You already know that adding new terms or variables to a model will tend to increase $R^2$. Certainly $R^2$ will never go down when you add more terms. That is, the nested model cannot have a higher $R^2$ than the model in which it is nested.

To illustrate, here are three models of the heights in Galton's data:

mod1 <- lm( height ~ sex, data=Galton )
mod2 <- lm( height ~ sex + mother + father, data=Galton )
mod3 <- lm( height ~ sex * mother + sex * father, data=Galton )

Task 1

Explain --- right here --- which of the models are nested in which. Calculate $R^2$ on each model and show whether or not they follow the pattern described in the Background section.

Task 2

Create another set of two or three nested models from whatever data set you like. Show that the expected $R^2$ pattern holds.

Here are some datasets you might choose from:

CPS85 giving data from the Current Population Survey from 1985.
SwimRecords giving world-record times in the 100m freestyle.

Use help() to get an explanation.

Statistical Significance

Sometimes, the purpose of a model is to determine whether and how potential explanatory variables or the terms constructed from variables are related to the response variable. For prediction models, it's helpful to choose meaningful explanatory variables: ones that are known before the response and which are relevant to the prediction.

One widely accepted standard to establish that an explanatory variable is related to the response is the "significance" of a hypothesis test. "Significance" is a very low standard of evidence. To be significant, the variable needs to be better than junk. That's all. Hardly "significant" in the everyday sense.

As a class, we're going to develop the theory behind a widely used test for significance. Here's the idea:

Start with some model. We'll use

baseMod <- lm( wage ~ age + sex + educ, data=CPS85 )
coef(baseMod)

Questions

How many coefficients are there? What's the $R^2$?

Make some junk and use that as the explanatory variable. The rand() function lets you do this with little effort, simply specifying the number of new coefficients you want to add to the model. For instance:

exampleMod <- lm( wage ~ age + sex + educ + rand(5), data=CPS85 )
coef(exampleMod)

How many coefficients were added?

Generating Junk

Pick a number of coefficients between, say, 10 and 600. Use rand() to add that many of coefficients to the base model and find $R^2$.

The result should be a number $m$ giving the number of coefficients and the corresponding $R^2$. Enter your results into the spreadsheet at this link.

Do this several times with different $m$.

After a while, the class will assemble a good number of such instances.

Building a Theory of $R^2$

Read in the class's data.

classData <- fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dEtyWGxpWWFsN3Z0OUlZNG5xYmRVWWc&single=true&gid=0&output=csv")

Plot out the relationship between $R^2$ and $m$.

## Your plot here!  Hint: xyplot( R2 ~ m, data=classData )

Task

Explain (right here) what's going on in the relationship between $m$ and $R^2$.

Consider this particular model which might be used to investigate the possibility that race, union status, and sector of the economy might explain some of the wage.

newMod <- lm( wage ~ sex*age*educ + 
                sex*sector*race + educ*sector*union, 
              data=CPS85)
length(coef(newMod)) # this is m
rsquared(newMod)

See where newMod falls, with respect to junk, on your graph of $R^2$ versus $m$.

Task

Based on the graph, make up some test statistic that might reasonably be used in a significance test. Describe it here.

Your explanation goes here!

Talk with your colleagues and figure out how the F-test corresponds to the graph.

Your F-test description goes here

When you have this figured out, estimate F and the corresponding p-value entirely by eye.

Write down your estimate

Finally, use anova() to calculate the actual F statistic and p-value.

Remember to publish via Rpubs

dtkaplan/NIMBIOS documentation built on May 15, 2019, 4:58 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dtkaplan/NIMBIOS
Data and editing support for the Teaching with R tutorial

In dtkaplan/NIMBIOS: Data and editing support for the Teaching with R tutorial

Put your name here!

Background

Task 1

Task 2

Statistical Significance

Questions

Generating Junk

Building a Theory of $R^2$

Task

Task

Remember to publish via Rpubs

R Package Documentation

Browse R Packages

We want your feedback!

dtkaplan/NIMBIOS Data and editing support for the Teaching with R tutorial

In dtkaplan/NIMBIOS: Data and editing support for the Teaching with R tutorial

Put your name here!

Background

Task 1

Task 2

Statistical Significance

Questions

Generating Junk

Building a Theory of $R^2$

Task

Task

Remember to publish via Rpubs

R Package Documentation

Browse R Packages

We want your feedback!

dtkaplan/NIMBIOS
Data and editing support for the Teaching with R tutorial