require(s20x) knitr::opts_chunk$set(echo = TRUE)
The data in this case study were collected in an investigation of environmental causes of disease. They show the annual mortality rate per 100,000 for males, averaged over the years 1958--1964, and the calcium concentration (in parts per million) in the drinking water supply for 61 large towns in England and Wales. (The higher the calcium concentration, the harder the water.) Towns at least as far north as Derby are identified in the data set with the code N. In this study we will use R to investigate how are mortality and water hardness related, and if there a geographical factor in the relationship. The data is in the file water.csv on Canvas
load(system.file("extdata", "water.df.rda", package = "s20x"))
water.df = read.csv("WATER.csv") head(water.df)
head(water.df)
It's always useful to check for missing values.
sum(is.na(water.df))
No missing values. All good. How about a plot? The ideal plot would use Location as a plotting symbol, however we can see that the towns to the South are coded as blanks. We should change that. How? How about making all the values that are not N be S?
water.df$Location[water.df$Location != 'N'] = 'S'
Hmm. That did not work, as planned. That is because Location is a factor. We have a number of choices. One is to make a new variable. The other is to make Location a character vector, make the change and, then make it a factor again. Although this second option sounds like a lot of work it is really only one line of code. It is preferable because we will not clutter our workspace with redundant information.
water.df$Location = with(water.df, { Location = as.character(Location); Location[is.na(Location)] = 'S'; Location = factor(Location) }) water.df$Location
Much better. Now how about that plot?
plot(Mortality~Ca, pch = as.character(water.df$Location), data = water.df)
water.fit = lm(Mortality ~ Ca * Location, data = water.df) plot(water.fit, which = 1)
normcheck(water.fit)
cooks20x(water.fit)
All looks pretty good.
summary(water.fit)
It does not look like there is any evidence of different slopes. Do we get any additional info from the ANOVA table?
anova(water.fit)
Nope. We only have two levels so we do not need the ANOVA table for this analysis.
So we do not have interaction, but it looks like there is a relationship with hardness and it looks like there is a North/South effect (Location). We should refit the model.
water.fit2 = lm(Mortality ~ Ca + Location, data = water.df) summary(water.fit2)
Minimal change in $R^2$. How do the fitted lines look on the plot?
b = coef(water.fit2) plot(Mortality~Ca, pch = as.character(water.df$Location), data = water.df) abline(b[1:2], col = "blue", lty = 2, lwd = 2) abline(b[1] + b[3], b[2], col = "red", lty = 2, lwd = 2) legend("topright", lty = 2, lwd = 2, col = c("blue", "red"), legend = c("North", "South"), bty = "n") # This code puts some confidence bounds around the lines. # It's pretty complicated if you don't know R and it isn't examinable pred.df = data.frame(Ca = rep(seq(0, 160, by = 20), 2), Location = rep(c('N', 'S'), c(9, 9))) water.pred = predict(water.fit, newdata = pred.df, interval = "confidence") for(i in 1:2){ idx = 1:9 + (i - 1) * 9 for(j in c("lwr", "upr")){ lines(pred.df$Ca[idx], water.pred[idx, j], lty = 3, lwd = 2, col = "lightgrey"); } }
So it looks like mortality decreases as the calcium concentration in the water increases, and that mortality is lower in the South than the North.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.