Case Study 10.3: STATS 201/8 Extra Case Study - Multiple Regression Model

title: 'Case Study 15.1: Haddock retention in a trawl' output: html_document vignette: > %\VignetteIndexEntry{CS10_3} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}


knitr::opts_chunk$set(fig.height=3)
## Do not delete this!
## It loads the s20x library for you. If you delete it 
## your document may not compile
library(s20x)

Question 1

A leading car distributor invited visitors to its website to complete a survey to learn about how much they were willing to spend on a new car. It was of interest to see how this depended on the participant’s annual income, marital status, dependents, gender and age.

The resulting data is in the file Carspend.csv, which contains the variables:

Variable | Description ------------|-------------------------------------------------------- logMaxSpend | The natural log of the maximum participant will spend on a new car (\$). logIncome | The natural log of the participants annual income (\$). Partner | Yes if the participant had a partner, No if they did not. Dependents | Yes if the participant had financial dependents, No if they had none. Sex | M if the participant was Male or F if they were female. (Nobody that participated in the survey selected any other option.) Age | The age of the participant (in years).

Notes: It is better to work with both MaxSpend and Income logged. We have already logged these in the data set so you do not need to do this. You do not need to comment on the need to log these in the Method and Assumptions section.

Instructions:

Question of interest/goal of the study

It was of interest to see how maximum spend on a new car depended on annual income, partnership status, dependents, gender and age.

inspect the data:

load(system.file("extdata", "CarSpend.df.rda", package = "s20x"))
CarSpend.df=read.csv("CarSpend.csv", header=TRUE, stringsAsFactors = TRUE)
pairs20x(CarSpend.df)
pairs20x(CarSpend.df)

Comment on plot

There is a moderate increasing relationship between LogIncome and LogMaxSpend. Males tend to spend more than females. People without partners or dependents tend to spend more, but this difference isn't as great as the gender difference. There is also a weaker increasing relationship between age and logMaxSpend, but there is a hit of curve in this, with the relationship starting fairly flat. Age and LogIncome have the greatest correlation between the explanatory variables (0.42).

Fit an appropriate linear model and Check Assumptions

car.fit1=lm(LogMaxSpend ~ LogIncome+Partner+Dependents+Sex+Age,data=CarSpend.df)
modelcheck(car.fit1)

summary(car.fit1)

car.fit2=lm(LogMaxSpend ~ LogIncome+Partner+Dependents+Sex,data=CarSpend.df)

summary(car.fit2)

car.fit3=lm(LogMaxSpend ~ LogIncome+Partner+Sex,data=CarSpend.df)
modelcheck(car.fit3)

summary(car.fit3)

confint(car.fit3)

1.5^confint(car.fit3)[2,]

100*(1.5^confint(car.fit3)[2,]-1)

exp(confint(car.fit3)[4,])

100*(exp(confint(car.fit3)[4,])-1)
conf1 = as.data.frame(t(100*(exp(confint(car.fit3)[4,])-1)))
resultStr1 = paste0(sprintf("%.1f%%", conf1$`2.5 %`), " and ", sprintf("%.1f%%", conf1$`97.5 %`))

Methods and assumption checks

We have applied a multiple regression linear model to the logMaxSpend response variable, with explanatory variables logIncome, Sex, Age, Partner and Dependents. Age had the highest p-value (0.78) and the model was refitted with age removed. Dependents was then removed (p-value of 0.08) and the model refitted once more. All remaining terms were highly significant.

The residual plot showed approximately constant variability and no trend. Normality looks good and no influential points were detected. Model assumptions are satisfied.

Our model is:

$\log(MaxSpend_i) = \beta_0 + \beta_1 \times \log(Income_i) + \beta_2 \times SexMale_i + \beta_3 \times PartnerYes_i +\epsilon_i$

where $\epsilon_i \sim iid ~ N(0,\sigma^2)$ and $SexMale_i=1$ if male (else = 0 if female), and $PartnerYes_i=1$ if in a partnership (else = 0 if not).

Our model explained 37.5% of the variability in the logged survival times.

Write sentences as if for an Executive Summary to answer the following:

Estimate the change in MaxSpend when income is increased by 50%.

Holding all other variables constant, we estimate that an increase in income of 50% is associated with an increase of between 5.4% and 11.1% in the median maximum amount that people are prepared to spend for a new car.

Estimate the change in MaxSpend between males and females.

Holding all other variables constant, we estimate that, the median maximum amount that males are prepared to spend for a new car is between r resultStr1[1] higher than that for females.

Question 2

This question revisits the data from question 1.

Fit a linear regression model containing only the Age explanatory variable to explain LogMaxSpend.

car.fit4=lm(LogMaxSpend ~ Age,data=CarSpend.df)
summary(car.fit4)

What does this model reveal about the relationship between Age and logMaxSpend.

There is strong evidence of a positive association between age and logmaxspend.

You will have (hopefully) reached a different conclusion regarding the effect of age above than in question 1. Provide an explanation of why this happened.

As age increases, so does income, and as income increases so does spend amount. Hence there is an indirect association between age and spend. However, once income is taken into account in explaining spend amount there is no longer an effect of age. In other words, any effect that age has on spend amount is already explained by using income as an explanatory variable.



Try the s20x package in your browser

Any scripts or data that you put into this service are public.

s20x documentation built on Jan. 14, 2026, 9:07 a.m.