knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(ggplot2))))
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(mosaic))))
suppressWarnings(suppressMessages(suppressPackageStartupMessages(library(plotly))))

Introduction

Linear regression is a useful technique in many sciences. One of the first studies was performed by Francis Galton, the nephew of Charles Darwin. Galton wanted to study the relationship between the height of parents and their children. We will discover that there is a significant linear trend for the population but that the individual variability means that the prediction is not very certain for a specific individual.

We can get the data from the mosaic package.

library(mosaic)
library(pander)
data(Galton)
pander(head(Galton, 10))

We want to add a column of data for the average parent height

Galton$midparentheight=(Galton$father+Galton$mother)/2

Plot the data for both males and females

Let's plot the data and let ggplot2() add the fits so we can easily see the results. We will use the facet_wrap() option to separate male and female genders.

library(grid)
plt <- ggplot(data = Galton, aes(x = height, y = midparentheight)) +
       geom_point(aes(colour = sex), size = 1.5) +
       geom_smooth(aes(colour = sex, fill = sex), method = "lm" ,size = .5) +
       facet_wrap(~ sex) +
       ggtitle("Galton height data with linear models") +
       theme(panel.spacing = unit(2, "lines"),
             axis.text=element_text(size=12),
             axis.title=element_text(size=12),
             plot.title=element_text(hjust = 0.5)) # center the title

print(plt)

Look more closely at the fits

The fit for females

female <- Galton[which(Galton$sex=='F'), ]
femaleLM <- lm(height ~ midparentheight, data = female)
pander(summary(femaleLM))

The fit for males

male <- Galton[which(Galton$sex=='M'), ]
maleLM <- lm(height ~ midparentheight, data = male)
pander(summary(maleLM))

Summary

We see that both fits are statistically significant for their respective populations. That said, the variance in the population tells us that the model cannot tell us much about a single individual.



jrminter/statshelpR documentation built on May 2, 2020, 12:08 a.m.