Case Study 5.1: Exam vs Attendance
In s20x: Functions for University of Auckland Course STATS 201/208 Data Analysis

## Do not delete this!
## It loads the s20x library for you. If you delete it 
## your document may not compile it.
require(s20x)

knitr::opts_chunk$set(
  dev = "png",
  fig.ext = "png",
  dpi = 96
)

Problem

We wish to determine if regular attendance in class is associated with exam mark. In particular, we want to estimate the expected exam marks of attendees and non-attendees, and predict the actual exam scores of individual attendees and non-attendees.

The variables of interest are:

Exam: Exam mark out of 100.
Attend: A two-level categorical variable which has the levels Yes and No.

Question of Interest

We want to quantify the relationship between exam marks and attendance. In particular, we want to estimate the expected exam marks of attendees and non-attendees, and predict the actual exam scores of individual attendees and non-attendees.

Read in and Inspect the Data

load(system.file("extdata", "Stats20x.df.rda", package = "s20x"))

Stats20x.df = read.table("STATS20x.txt", header = T)
# Could of also used plot(Exam ~ Attend, data = Stats20x.df)
boxplot(Exam ~ Attend, data = Stats20x.df)

boxplot(Exam ~ Attend, data = Stats20x.df)

The exam marks for the students attended seem to be centred than the students who did not attend lectures regularly. The boxplots look reasonably symmetric in both groups.

Model Building and Check Assumptions

examAttend.fit = lm(Exam ~ Attend, data = Stats20x.df)
eovcheck(examAttend.fit)
normcheck(examAttend.fit)
cooks20x(examAttend.fit)
summary(examAttend.fit)
confint(examAttend.fit)

Get additional Prediction Intervals

predAttend.df = data.frame(Attend = c("No", "Yes"))

# Using `interval = "confidence"' to get CI for expected exam score
# for each level of Attend
predict(examAttend.fit, predAttend.df, interval = "confidence")
# Using `interval = "predition"' to get PI for individual student's exam score 
# for each level of Attend
predict(examAttend.fit, predAttend.df, interval = "prediction")

conf = as.data.frame(predict(examAttend.fit, predAttend.df, interval = "confidence"))
resultStr1 = paste0(sprintf("%.1f", conf$lwr), " to ", sprintf("%.1f", conf$upr))

pre = as.data.frame(predict(examAttend.fit, predAttend.df, interval = "prediction"))
resultStr2 = paste0(sprintf("%.1f", pre$lwr), " to ", sprintf("%.1f", pre$upr))

Method and Assumption Checks

We wish to explain exam marks with attendance, a two-level factor. So, we have fitted a linear model with a single explanatory dummy variable. (Note, this is equivalent to conducting a two-sample t-test).

Four non-attending students did unusually well (i.e., large positive residuals), but since the sample size was large, this will be of little consequence. Hence, all model assumptions were satisfied.

Our final model is $$Exam_i=\beta_0 +\beta_1 \times Attend.Yes_i+\epsilon_i,$$ where $\epsilon_i \sim iid ~ N(0,\sigma^2)$. Here $Attend.Yes_i=1$ if the student regularly attended (i.e., answered "Yes"), otherwise it is zero (i.e. "No").

Our model explained a small 15% of the variability in the students' final exam marks.

Executive Summary

We wanted to quantify the relationship between exam marks and attendance.

There was strong evidence that exam marks were higher for students who attend class versus students who didn't attend class (P-value $\approx 10^{-6}$). We estimate that regular attendance could increase their expected exam mark between 9.5 to 21.6 exam marks (out of 100).

The expected exam marks of non-attendees and attendees are between r resultStr1[1], and r resultStr1[2], respectively.

The predicted exam marks for individual non-attendees and attendees are between r resultStr2[1], and r resultStr2[2], respectively.

Our model only explains 15% of the variability in the students' final exam marks, this would not be very good for prediction. We can see this in how wide our prediction intervals are.