Case Study 2.1: Exam vs Test
In s20x: Functions for University of Auckland Course STATS 201/208 Data Analysis

# Do not delete this!
# It loads the s20x library for you. If you delete it 
# your document may not compile it.
require(s20x)

knitr::opts_chunk$set(
  dev = "png",
  fig.ext = "png",
  dpi = 96
)

Problem

We wish to quantify the relationship between test mark and exam mark, especially for the purpose of being able to predict a student's exam mark with their test mark (to aid in making decisions about aegrotat passes for students who do not sit the exam). In particular, we want to predict a student's exam mark when their test mark is either 0, 10, or 20.

The variables of interest are:

Exam: Exam mark out of 100.
Test: Test mark out of 20.

Question of Interest

We want to build a model to predict exam marks with test marks. In particular, we want to predict a student's exam mark when their test mark is either 0, 10, or 20.

load(system.file("extdata", "Stats20x.df.rda", package = "s20x"))

Read in and Inspect the Data

Stats20x.df = read.table("STATS20x.txt", header = T)
plot(Exam ~ Test, data = Stats20x.df)

plot(Exam ~ Test, data = Stats20x.df)

The plot reveals a positive linear relationship between exam marks and test marks.

Model Building and Check Assumptions

examTest.fit = lm(Exam ~ Test, data = Stats20x.df)
plot(examTest.fit, which = 1)
normcheck(examTest.fit)
cooks20x(examTest.fit)
summary(examTest.fit)
confint(examTest.fit)

cf = as.data.frame(confint(examTest.fit))
resultConf = paste0(sprintf("%.1f", cf[2,1]), " to ", sprintf("%.1f", cf[2,2]))

Prediction Output

predTest.df = data.frame(Test = c(0, 10, 20))
predict(examTest.fit, predTest.df, interval = "prediction")

p = as.data.frame(predict(examTest.fit, predTest.df, interval = "prediction"))
resultStr = paste0(sprintf("%.1f", p$lwr), " to ", sprintf("%.1f", p$upr))

Method and Assumption Checks

A scatter plot of exam marks vs test marks showed a linear association with approximately constant scatter and so a linear model was fitted.

All model assumptions appear to be satisfied - a slight trend in the residual plot was observed but does not seem to be of major concern.

Our final model is $$Exam_i=\beta_0 +\beta_1\times Test_i+\epsilon_i,$$ where $\epsilon_i \sim iid~N(0,\sigma^2)$.

Our model explained a modest 59% of the variability in the students' final exam marks.

Executive Summary

We were interested in building a model to predict exam mark from test mark.

There was a significant linear relationship between test mark and exam mark (P-value $\approx$ 0). We estimate that each additional test mark (out of 20) obtained by the student would increase their exam mark by between r resultConf[1] (out of 100) on average.

For test marks of 0, 10 and 20, we predict exam marks (for individual students) between r resultStr[1], r resultStr[2], and r resultStr[3], respectively. These intervals are very wide\footnote{Due to considerable variabilty remaining even after taking into account the test mark.} and some of these intervals have bounds that are outside of the feasible values of exam mark (0-100). The model is not reliable for prediction.