title: 'Case Study 15.1: Haddock retention in a trawl' output: html_document vignette: > %\VignetteIndexEntry{CS10_3} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
knitr::opts_chunk$set(fig.height=3)
## Do not delete this! ## It loads the s20x library for you. If you delete it ## your document may not compile library(s20x)
A leading car distributor invited visitors to its website to complete a survey to learn about how much they were willing to spend on a new car. It was of interest to see how this depended on the participant’s annual income, marital status, dependents, gender and age.
The resulting data is in the file Carspend.csv, which contains the variables:
Variable | Description ------------|-------------------------------------------------------- logMaxSpend | The natural log of the maximum participant will spend on a new car (\$). logIncome | The natural log of the participants annual income (\$). Partner | Yes if the participant had a partner, No if they did not. Dependents | Yes if the participant had financial dependents, No if they had none. Sex | M if the participant was Male or F if they were female. (Nobody that participated in the survey selected any other option.) Age | The age of the participant (in years).
Notes: It is better to work with both MaxSpend and Income logged. We have already logged these in the data set so you do not need to do this. You do not need to comment on the need to log these in the Method and Assumptions section.
Instructions:
It was of interest to see how maximum spend on a new car depended on annual income, partnership status, dependents, gender and age.
load(system.file("extdata", "CarSpend.df.rda", package = "s20x"))
CarSpend.df=read.csv("CarSpend.csv", header=TRUE, stringsAsFactors = TRUE) pairs20x(CarSpend.df)
pairs20x(CarSpend.df)
There is a moderate increasing relationship between LogIncome and LogMaxSpend. Males tend to spend more than females. People without partners or dependents tend to spend more, but this difference isn't as great as the gender difference. There is also a weaker increasing relationship between age and logMaxSpend, but there is a hit of curve in this, with the relationship starting fairly flat. Age and LogIncome have the greatest correlation between the explanatory variables (0.42).
car.fit1=lm(LogMaxSpend ~ LogIncome+Partner+Dependents+Sex+Age,data=CarSpend.df) modelcheck(car.fit1) summary(car.fit1) car.fit2=lm(LogMaxSpend ~ LogIncome+Partner+Dependents+Sex,data=CarSpend.df) summary(car.fit2) car.fit3=lm(LogMaxSpend ~ LogIncome+Partner+Sex,data=CarSpend.df) modelcheck(car.fit3) summary(car.fit3) confint(car.fit3) 1.5^confint(car.fit3)[2,] 100*(1.5^confint(car.fit3)[2,]-1) exp(confint(car.fit3)[4,]) 100*(exp(confint(car.fit3)[4,])-1)
conf1 = as.data.frame(t(100*(exp(confint(car.fit3)[4,])-1))) resultStr1 = paste0(sprintf("%.1f%%", conf1$`2.5 %`), " and ", sprintf("%.1f%%", conf1$`97.5 %`))
We have applied a multiple regression linear model to the logMaxSpend response variable, with explanatory variables logIncome, Sex, Age, Partner and Dependents. Age had the highest p-value (0.78) and the model was refitted with age removed. Dependents was then removed (p-value of 0.08) and the model refitted once more. All remaining terms were highly significant.
The residual plot showed approximately constant variability and no trend. Normality looks good and no influential points were detected. Model assumptions are satisfied.
Our model is:
$\log(MaxSpend_i) = \beta_0 + \beta_1 \times \log(Income_i) + \beta_2 \times SexMale_i + \beta_3 \times PartnerYes_i +\epsilon_i$
where $\epsilon_i \sim iid ~ N(0,\sigma^2)$ and $SexMale_i=1$ if male (else = 0 if female), and $PartnerYes_i=1$ if in a partnership (else = 0 if not).
Our model explained 37.5% of the variability in the logged survival times.
Holding all other variables constant, we estimate that an increase in income of 50% is associated with an increase of between 5.4% and 11.1% in the median maximum amount that people are prepared to spend for a new car.
Holding all other variables constant, we estimate that, the median maximum amount that males are prepared to spend for a new car is between r resultStr1[1] higher than that for females.
This question revisits the data from question 1.
What does this model reveal about the relationship between Age and logMaxSpend.
You will have (hopefully) reached a different conclusion regarding the effect of age in (b) above than in the executive summary from question 1. Provide an explanation of why this happened.
car.fit4=lm(LogMaxSpend ~ Age,data=CarSpend.df) summary(car.fit4)
There is strong evidence of a positive association between age and logmaxspend.
As age increases, so does income, and as income increases so does spend amount. Hence there is an indirect association between age and spend. However, once income is taken into account in explaining spend amount there is no longer an effect of age. In other words, any effect that age has on spend amount is already explained by using income as an explanatory variable.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.