This case study is from Chapter 6.7 of Roback, P. & Legler, J. Beyond Multiple Linear Regression. (2021). https://bookdown.org/roback/bookdown-BeyondMLR/.
For the background of this study, and information about the data organisation, please see the reading.
library(tidyverse) # get the data from the textbook risk2009 <- read_csv( "https://raw.githubusercontent.com/proback/BeyondMLR/master/data/risk2009.csv") # create binary sex variable that takes 1 if respondent is 'female' risk2009 <- risk2009 %>% mutate(female = ifelse(sex=="Female", 1, 0))
# What proportion of the sample are trying to lose weight? # What proportion are female? # What proportion play on one or more sports teams? # What proportion report each level of TV watching? # What are the mean and median BMIs for this sample? # Are the proportions of respondents seeiking to lose weight similar across sex? # sport? # media?
# This code creates a nicely formatted table # with the information from the above table. # We could also have done this more programmatially, sex <- c("Female"," ","Male"," ") wls <- c("No weight loss","Lose weight","No weight loss","Lose weight") mbmip <- c("43.2","72.4","58.8","85.7") sds <- c("25.8", "23.0", "28.2", "18.0") ncol <- c(89, 125, 157, 74) #ccbmiXsex.tab table3chp6 <- data.frame(sex,wls,mbmip,sds,ncol) colnames(table3chp6) <- c("Sex","Weight loss status","mean BMI percentile","SD","n") knitr::kable(table3chp6, caption="Mean BMI percentile by sex and desire to lose weight.")
The text in this section is directly from the reading Chapter 6 (ยง 6.7) of Roback, P. & Legler, J. Beyond Multiple Linear Regression. (2021). https://bookdown.org/roback/bookdown-BeyondMLR/.
If we consider including a BMI term in our model(s), the logit should be linearly related to BMI. We can investigate this assumption by constructing an empirical logit plot. In order to calculate empirical logits, we first divide our data by sex. Within each sex, we generate 10 groups of equal sizes, the first holding the bottom 10\% in BMI percentile for that sex, the second holding the next lowest 10\%, etc. Within each group, we calculate the proportion, $\hat{p}$ that reported wanting to lose weight, and then the empirical log odds, $log(\frac{\hat{p}}{1-\hat{p}})$, that a young person in that group wants to lose weight.
# create 10 groups of BMI levels risk2009 <- risk2009 %>% group_by(sex) %>% mutate(BMIcuts = cut_number(bmipct,10)) emplogit1 <- risk2009 %>% group_by(sex, BMIcuts) %>% summarise(prop.lose = mean(lose.wt.01), n = n(), midpoint = median(bmipct), .groups = "drop") %>% mutate(prop.lose = ifelse(prop.lose==0, .01, prop.lose), emplogit = log(prop.lose / (1 - prop.lose))) ggplot(emplogit1, aes(x = midpoint, y = emplogit, color = sex)) + geom_point(aes(shape = sex)) + geom_smooth(aes(linetype = sex), method = "lm", formula = "y~x") + xlab("BMI percentile") + ylab("Empirical logits") + theme_minimal() + scale_color_brewer(palette = "Dark2")
Figure \@ref(fig:logitBMIsex) presents the empirical logits for the BMI intervals by sex. Both males and females exhibit an increasing linear trend on the logit scale indicating that increasing BMI is associated with a greater desire to lose weight and that modeling log odds as a linear function of BMI is reasonable. The slope for the females appears to be similar to the slope for males, so we do not need to consider an interaction term between BMI and sex in the model.
ggplot(data = risk2009, aes(x = as.factor(media), fill = lose.wt)) + geom_bar(position = "fill") + facet_wrap(~sex) + ylab("Proportion") + scale_fill_grey() + theme_minimal() + xlab("Media")
emplogit2 <- risk2009 %>% group_by(sex, media) %>% summarise(prop.lose = mean(lose.wt.01), n = n()) %>% mutate(prop.lose = ifelse(prop.lose==0, .01, prop.lose)) %>% mutate(emplogit = log(prop.lose / (1 - prop.lose))) ggplot(emplogit2, aes(x = media, y = emplogit, color = sex)) + geom_point(aes(shape = sex)) + geom_smooth(aes(linetype = sex), method = "lm") + xlab("TV hours per day") + ylab("Empirical logits") + theme_minimal()
female
as the only predictor.model1 <- glm(lose.wt.01 ~ female, family = binomial, data = risk2009) coef(summary(model1))
Our estimated binomial regression model is:
[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)=-0.75+1.09\cdot \textrm{female}]
exp(coef(model1)) exp(confint(model1))
female
, BMI and the interaction of these varaibles as predictorsfemale
, BMI and NO interaction of these varaibles as predictors.Our estimated binomial regression model is:
[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)= -4.26+1.86\cdot\textrm{female}+0.047\cdot\textrm{bmipct}]
Note from the the reading: Sports participation was considered for inclusion in the model in three ways: an indicator of sports participation (0 = no teams, 1 = one or more teams), treating the number of teams (0, 1, 2, or 3) as numeric, and treating the number of teams as a factor. The models below treat sports participation using an indicator variable, but all three models produced similar results.
(The reading also discusses the media
variable and further interactions with sport
, these has been omitted here.)
Comparing models using differences in deviances requires that the models be nested, meaning each smaller model is a simplified version of the larger model. In our case, Models 1, 2, and 3 are nest. You'll see additional models fit in the reading, where Models 3 and 4 cannot be compared using a deviance test. There is an example of applying AIC.
Pay special attention to the Discussion section of this reading. It is a good example of discussing limitations for an analysis, which you will have to do for future weekly writing (TBC) and in your final project.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.