This case study is from Chapter 6.7 of Roback, P. & Legler, J. Beyond Multiple Linear Regression. (2021). https://bookdown.org/roback/bookdown-BeyondMLR/.

For the background of this study, and information about the data organisation, please see the reading.

Exploratory Data Analysis

library(tidyverse)

# get the data from the textbook
risk2009 <- read_csv(
  "https://raw.githubusercontent.com/proback/BeyondMLR/master/data/risk2009.csv")

# create binary sex variable that takes 1 if respondent is 'female'
risk2009 <- risk2009 %>%
  mutate(female = ifelse(sex=="Female", 1, 0))

Task 1: Glimpse the data.


Task 2: Calculate some basic summaries.

# What proportion of the sample are trying to lose weight?

# What proportion are female?


# What proportion play on one or more sports teams?

# What proportion report each level of TV watching?


# What are the mean and median BMIs for this sample?


# Are the proportions of respondents seeiking to lose weight similar across sex?

# sport?

# media?

Task 3: Create a stacked bar chart that displays the proportion of respondents with each waitlost group, for each sex


Task 4: Create a summary table with mean, sd and count of BMI for each combination of sex and weightloss group.


# This code creates a nicely formatted table
# with the information from the above table. 
# We could also have done this more programmatially, 

sex <- c("Female"," ","Male"," ")
wls <- c("No weight loss","Lose weight","No weight loss","Lose weight")
mbmip <- c("43.2","72.4","58.8","85.7")
sds <- c("25.8", "23.0", "28.2", "18.0")
ncol <- c(89, 125, 157, 74)
#ccbmiXsex.tab
table3chp6 <- data.frame(sex,wls,mbmip,sds,ncol)
colnames(table3chp6) <- c("Sex","Weight loss status","mean BMI percentile","SD","n")
knitr::kable(table3chp6, caption="Mean BMI percentile by sex and desire to lose weight.")

Empirical logit plot

The text in this section is directly from the reading Chapter 6 (ยง 6.7) of Roback, P. & Legler, J. Beyond Multiple Linear Regression. (2021). https://bookdown.org/roback/bookdown-BeyondMLR/.

If we consider including a BMI term in our model(s), the logit should be linearly related to BMI. We can investigate this assumption by constructing an empirical logit plot. In order to calculate empirical logits, we first divide our data by sex. Within each sex, we generate 10 groups of equal sizes, the first holding the bottom 10\% in BMI percentile for that sex, the second holding the next lowest 10\%, etc. Within each group, we calculate the proportion, $\hat{p}$ that reported wanting to lose weight, and then the empirical log odds, $log(\frac{\hat{p}}{1-\hat{p}})$, that a young person in that group wants to lose weight.

# create 10 groups of BMI levels
risk2009 <- risk2009 %>%
  group_by(sex) %>%
  mutate(BMIcuts = cut_number(bmipct,10))

emplogit1 <- risk2009 %>%
  group_by(sex, BMIcuts) %>%
  summarise(prop.lose = mean(lose.wt.01), 
            n = n(),
            midpoint = median(bmipct), .groups = "drop") %>%
  mutate(prop.lose = ifelse(prop.lose==0, .01, prop.lose),
         emplogit = log(prop.lose / (1 - prop.lose)))
ggplot(emplogit1, aes(x = midpoint, y = emplogit, color = sex)) +
  geom_point(aes(shape = sex)) +
  geom_smooth(aes(linetype = sex), 
              method = "lm", formula = "y~x") +
  xlab("BMI percentile") + 
  ylab("Empirical logits") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2")

Figure \@ref(fig:logitBMIsex) presents the empirical logits for the BMI intervals by sex. Both males and females exhibit an increasing linear trend on the logit scale indicating that increasing BMI is associated with a greater desire to lose weight and that modeling log odds as a linear function of BMI is reasonable. The slope for the females appears to be similar to the slope for males, so we do not need to consider an interaction term between BMI and sex in the model.

Task 5: Create a stacked bar chart showing both weight loss plans by sports participation and sex.


ggplot(data = risk2009, aes(x = as.factor(media), 
                            fill = lose.wt)) +
  geom_bar(position = "fill") +
  facet_wrap(~sex) +
  ylab("Proportion") + 
  scale_fill_grey() +
  theme_minimal() +
  xlab("Media")
emplogit2 <- risk2009 %>%
  group_by(sex, media) %>%
  summarise(prop.lose = mean(lose.wt.01), n = n()) %>%
  mutate(prop.lose = ifelse(prop.lose==0, .01, prop.lose)) %>%
  mutate(emplogit = log(prop.lose / (1 - prop.lose)))
ggplot(emplogit2, aes(x = media, y = emplogit, color = sex)) +
  geom_point(aes(shape = sex)) +
  geom_smooth(aes(linetype = sex), method = "lm") +
  xlab("TV hours per day") + 
  ylab("Empirical logits") + theme_minimal()

Modelling

Task 6: Fit an appropriate model for weight loss plan with female as the only predictor.

model1 <- glm(lose.wt.01 ~ female, family = binomial, 
              data = risk2009)

coef(summary(model1))

Our estimated binomial regression model is:

[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)=-0.75+1.09\cdot \textrm{female}]

Task 7: Interpret the odds difference between female and male students

exp(coef(model1))
exp(confint(model1))

Task 8: Fit an appropriate model with female, BMI and the interaction of these varaibles as predictors


Task 9: Fit an appropriate model with female, BMI and NO interaction of these varaibles as predictors.


Our estimated binomial regression model is:

[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)= -4.26+1.86\cdot\textrm{female}+0.047\cdot\textrm{bmipct}]

Task 10: For a 10 unit increase in BMI percentile, how do the odds of wanting lose weight change?


Task 11: Add sport to our model.

Note from the the reading: Sports participation was considered for inclusion in the model in three ways: an indicator of sports participation (0 = no teams, 1 = one or more teams), treating the number of teams (0, 1, 2, or 3) as numeric, and treating the number of teams as a factor. The models below treat sports participation using an indicator variable, but all three models produced similar results.


(The reading also discusses the media variable and further interactions with sport, these has been omitted here.)

Model comparisons

Task 12: Compare models 1 through 3 with a likelihood ratio test (also a 'drop-in-deviance' test).


Comparing models using differences in deviances requires that the models be nested, meaning each smaller model is a simplified version of the larger model. In our case, Models 1, 2, and 3 are nest. You'll see additional models fit in the reading, where Models 3 and 4 cannot be compared using a deviance test. There is an example of applying AIC.

Final comments

Pay special attention to the Discussion section of this reading. It is a good example of discussing limitations for an analysis, which you will have to do for future weekly writing (TBC) and in your final project.



sta303-bolton/sta303w8 documentation built on April 11, 2021, 12:44 p.m.