# Do not delete this! # It loads the s20x library for you. If you delete it # your document may not compile it. require(s20x) knitr::opts_chunk$set( dev = "png", fig.ext = "png", dpi = 96 )
The question of interest is "Do females spend more money on their hair than males?" Also, "how much do students typically spend on haircuts?" To answer these questions, a lecturer carried out a survey on 200 students. Also, A variety of variables were measured including hair and sex.
The variables of interest are:
hair: The student's estimated monthly expenditure on haircuts.sex: The student's gender, Male or Female.Do females spend more money on their hair than males? Also, how much do students typically spend on haircuts?
load(system.file("extdata", "survey.df.rda", package = "s20x"))
survey.df = read.table("survey.txt", header = TRUE, stringsAsFactors = T) # To make things a little less cluttered we put the data in its own dataframe hair.df = with(survey.df, data.frame(hair = hair, sex = sex)) plot(hair ~ sex, data = hair.df)
hair.df = with(survey.df, data.frame(hair = hair, sex = sex)) plot(hair ~ sex, data = hair.df)
Females seem to spend more money on haircuts than males. We can see the data is quite right-skewed. There also appears to be a problem with equality of variance. Maybe taking logs will help.
plot(log(hair) ~ sex, data = hair.df)
Definitely makes things nicer, we should stick with the log-scale. So perhaps we should perform our inference on the log-scale. Even on the log-scale females appear to be spending more money on haircuts than males.
hair.fit = lm(log(hair) ~ sex, data = hair.df) plot(hair.fit, which = 1) normcheck(hair.fit) cooks20x(hair.fit) summary(hair.fit) # Column bind the backtransformed output together cbind(exp(coef(hair.fit)), exp(confint(hair.fit)))
pred.df = data.frame(sex = c("female", "male")) exp(predict(hair.fit, pred.df, interval = "confidence"))
conf1=as.data.frame(exp(predict(hair.fit, pred.df, interval = "confidence"))) resultStr0 = sprintf("$%s", format(round(conf1$fit,0), big.mark = ",", trim = TRUE) ) resultStr1 = sprintf("$%s and $%s", format(round(conf1$lwr,0), big.mark = ",", trim = TRUE), format(round(conf1$upr,0), big.mark = ",", trim = TRUE) ) conf2 = as.data.frame(abs(100*(exp(confint(hair.fit)) - 1))) resultStr2 = sprintf("%s%% and %s%%", format(round(conf2$`97.5 %`,0), big.mark = ",", trim = TRUE), format(round(conf2$`2.5 %`,0), big.mark = ",", trim = TRUE) )
The boxplot of hair vs sex revealed that the data were right-skewed. Hence, we logged hair. The boxplot of log(hair) vs sex show visible improvement. So, we have fitted a linear model with log(hair) being explained by sex.
We probably wouldn't say the data is normal from the Q-Q plot. What we have is evidence of a lot of rounding---people rounding to the nearest $5, $10, etc. However, the CLT should make things okay. The histogram of the residuals was unimodal and reasonably symmetric. The rest of our model assumptions have been satisfied.
Our final model is $$log(Hair_i) = \beta_0 + \beta_1 \times SexMale_i + \epsilon_i,$$ where $\epsilon_i \sim iid ~ N(0,\sigma^2)$. Here $SexMale_i=1$ if the student was male, otherwise it was zero.
Our model explained 29% of the variability in the logged students' hair expenditure.
We have strong evidence that females typically spend more money on their hair than males.
We estimate that males spend approximately half as much as females do on their hair. We are confident this factor is between r resultStr2[2].
We estimate the median amount females spend on their hair each month is r resultStr0[1] and are confident it is between r resultStr1[1].
For males we estimate a median spend of r resultStr0[2] and are confident it is between r resultStr1[2].
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.