library(learnr) knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
Because the Framingham study is a prospective cohort, with certain limits to the data and with three data collection visits, there are restrictions to the types of questions we can ask and reliably answer.
question( "Choose the most valid and most appropriate question that we could ask of the Framingham data.", answer( "Will lower body mass during adolescence increase the risk for CVD?", correct = FALSE, message = "Incorrect. While cohorts could answer this question, Framingham participants are all in middle age so we can't answer questions outside of that time frame." ), answer( "Does smoking increase the risk for CVD?", correct = TRUE, message = "Correct! The Framingham dataset collected information on smoking status and can assess relative risk between exposure status." ), answer( "Does having many close friends lower the risk for CVD?", correct = FALSE, message = "Incorrect. While cohorts could answer this question, the Framingham Study did not collect this information." ), answer( "All of the above.", correct = FALSE, message = "Incorrect. One of the above is a valid question." ), allow_retry = TRUE )
"Remember, these are questions to ask *of the Framingham study*. The variables in the question must exist in the dataset." "Use `glimpse(tidied_framingham)` to see which variables are available." "Use `summary(tidied_framingham)` to see the minimum and maximum ages of the participants."
Let's get you familiar with using and running glmer()
models. There is some
tweaking involved when running glmer()
models, such as transforming variables
before hand. Often this requires some trial and error to get right. For now,
practice running some models.
The lme4
package has been loaded for you. Since glmer()
is computationally
expensive, the Framingham dataset has been reduced in size and is loaded as
sample_tidied_framingham
.
Recall that the pattern for using glmer()
is:
model <- glmer( outcome_var ~ predictor_var + (1 | person_id), data = dataset, family = binomial )
capture.output({ data(sample_tidied_framingham, package = "acdcourse") library(lme4, quietly = TRUE) }, file = tempfile())
Instructions:
total_cholesterol_scaled
relates to the outcome
got_cvd
(have subject_id
as the random term).# Model scaled cholesterol on CVD model <- glmer( ___ ~ ___ + ___, data = ___, # Use distribution for binary outcome family = ___ ) # View the model output summary(model)
"The formula should be `got_cvd ~ total_cholesterol_scaled + (1 | subject_id)`."
# Model scaled cholesterol on CVD model <- glmer( got_cvd ~ total_cholesterol_scaled + (1 | subject_id), data = sample_tidied_framingham, # Use distribution for binary outcome family = binomial ) # View the model output summary(model)
"Amazing!"
Instructions:
fasting_blood_glucose_scaled
as a
predictor instead of cholesterol.# Model scaled fasting blood glucose on CVD model <- glmer( ___ ~ ___ + ___, data = ___, # Use distribution for binary outcome family = ___ ) # View the model output summary(model)
"Using the same formula as the previous step, replace `total_cholesterol_scaled` with `fasting_blood_glucose_scaled`."
# Model scaled fasting blood glucose on CVD model <- glmer( got_cvd ~ fasting_blood_glucose_scaled + (1 | subject_id), data = sample_tidied_framingham, # Use distribution for binary outcome family = binomial ) # View the model output summary(model)
"Great job! You've become a bit more familiar with coding and running mixed effects models in R. You ran two models to get practice on setting predictors and the formula. Now we can get to more complicated modeling aspects."
You need to consider many things with glmer()
models, e.g. large variable
variances. Often glmer()
will warn you of a problem, which you must fix using
your knowledge of transformations. Getting it right often involves trial and
error.
These exercises will (likely) generate warnings or errors. Compare the different
transformations and notice why problems occur. Recall that we are using a
smaller dataset, sample_tidied_framingham
. The general template for glmer()
is:
model <- glmer( outcome_var ~ predictor_var + (1 | person_id), data = dataset, family = binomial )
capture.output({ data(sample_tidied_framingham, package = "acdcourse") library(lme4) }, file = tempfile())
Instructions:
total_cholesterol
instead (got_cvd
as the
outcome); you will get a warning.# Model total cholesterol on CVD model <- glmer( ___ ~ ___ + (1 | ___), data = sample_tidied_framingham, family = binomial ) # View the model output summary(model)
"Don't forget the random term: `(1 | subject_id)`."
# Model total cholesterol on CVD model <- glmer( got_cvd ~ total_cholesterol + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model output summary(model)
"Amazing job!"
Instructions:
total_cholesterol_centered
in the model; you will get a warning.# Model with centered cholesterol on CVD model <- glmer( got_cvd ~ ___ + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model output summary(model)
"Replace the original cholesterol variable with `total_cholesterol_centered` in the formula."
# Model with centered cholesterol on CVD model <- glmer( got_cvd ~ total_cholesterol_centered + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model output summary(model)
"Great!"
Instructions:
total_cholesterol_scaled
variable instead; the warning should now be
fixed.# Model with scaled cholesterol model <- glmer( got_cvd ~ ___ + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model output summary(model)
"Include `total_cholesterol_scaled` in the formula."
# Model with scaled cholesterol model <- glmer( got_cvd ~ total_cholesterol_scaled + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model output summary(model)
"Amazing! You've solved the warnings about non-convergence and rescaling issue! Often it requires some trial and error to find which transformations are optimal for the model technique."
Before the development of mixed effects modeling, analyzing longitudinal data was fairly difficult because repeated measures violated the assumption of independent observations. This time component is a key strength of longitudinal studies like with prospective cohorts. But to use that strength you need to, well, include time in the model!
Include an additional predictor (followup_visit_number
) in the glmer()
formula. Recall that got_cvd
is the outcome and subject_id
is the random
term.
Instructions:
total_cholesterol_scaled
and
followup_visit_number
.capture.output({ data(sample_tidied_framingham, package = "acdcourse") library(lme4) }, file = tempfile())
# Include followup visit number with cholesterol model <- glmer( # Add scaled cholesterol and visit got_cvd ~ ___ + ___ + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model summary summary(model)
"The formula should be `total_cholesterol_scaled + followup_visit_number`."
# Include followup visit number with cholesterol model <- glmer( # Add scaled cholesterol and visit got_cvd ~ total_cholesterol_scaled + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial ) # View the model summary summary(model)
"Awesome! Adding a time component to any analysis that has repeated measurements is quite important. It reduces model bias and allows you to interpret the results in the context of time."
Building a DAG that approximates the biology is difficult. It requires domain knowledge, so consult experts to confirm the DAG. Remember, you will build an incomplete DAG. This is but one step to finding confounders.
Let's determine which variables to adjust for when systolic blood pressure
(SBP
) is the exposure and CVD
is the outcome. Assume that:
Sex
influences SBP
and Smoking
Smoking
influences SBP
and CVD
BMI
influences CVD
, SBP
, and FastingGlucose
FastingGlucose
influences CVD
Create a dagitty()
object to identify what to adjust for.
Recall that for dagitty
: x -> y
means "x influences y" and that x -> {y z}
means "x influences y and z"; dagitty
is already loaded.
capture.output({ library(dagitty, quietly = TRUE) variable_pathways <- dagitty("dag { SBP -> CVD Sex -> {SBP Smoking} Smoking -> {SBP CVD} BMI -> {SBP CVD FastingGlucose} FastingGlucose -> CVD }") }, file = tempfile()) plot(graphLayout(variable_pathways))
Instructions:
variable_pathway
graph.# Include the links between variables variable_pathways <- dagitty("dag { SBP -> CVD ___ -> {___ ___} ___ -> {___ ___ ___} ___ -> ___ }") # Plot potential confounding pathways plot(graphLayout(variable_pathways))
"The form for a pathway is `start_variable -> {one or more end variables}`."
# Include the links between variables variable_pathways <- dagitty("dag { SBP -> CVD Sex -> {SBP Smoking} Smoking -> {SBP CVD} BMI -> {SBP CVD FastingGlucose} FastingGlucose -> CVD }") # Plot potential confounding pathways plot(graphLayout(variable_pathways))
"Great!"
Instructions:
adjustmentSets()
of variables from the
variable_pathways
graph, selecting "SBP"
as exposure and "CVD"
as outcome.# Include the links between variables variable_pathways <- dagitty("dag { SBP -> CVD Sex -> {SBP Smoking} Smoking -> {SBP CVD} BMI -> {SBP CVD FastingGlucose} FastingGlucose -> CVD}") # Plot potential confounding pathways plot(graphLayout(variable_pathways)) # Identify some confounders to adjust for adjustmentSets(___, exposure = ___, outcome = ___)
"The `adjustmentSets()` requires the DAG object and the outcome (CVD) and the predictor (SBP)."
# Include the links between variables variable_pathways <- dagitty("dag { SBP -> CVD Sex -> {SBP Smoking} Smoking -> {SBP CVD} BMI -> {SBP CVD FastingGlucose} FastingGlucose -> CVD}") # Plot potential confounding pathways plot(graphLayout(variable_pathways)) # Identify some confounders to adjust for adjustmentSets(variable_pathways, exposure = "SBP", outcome = "CVD")
"Excellent job!"
question("According to the output of the `adjustmentSets()`, what variables does DAG suggest you adjust for in the model to get less biased results?", answer("Sex and FastingGlucose", correct = FALSE), answer("BMI and Smoking", correct = TRUE), answer("Smoking, Sex, BMI", correct = FALSE), answer("All variables", correct = FALSE), answer("No variables", correct = FALSE), allow_retry = TRUE )
It's best to use multiple methods to decide on which variables to include in a
model. The information criterion methods are powerful tools for choosing
variables to adjust for. Using the functions from the MuMIn
package, determine
which model has the best fit for the models being compared by using AIC to rank
them. A smaller AIC is better.
As many models will be computed and compared, for these lesson purposes only,
we kept computing time short by: greatly reducing the sample size and number of
variables in the data, called model_sel_df
; and, setting nAQG = 0
(reduces
estimation precision, but increases speed). MuMIn
also requires na.action =
"na.fail"
to be set in glmer()
.
capture.output({ data(model_sel_df, package = "acdcourse") library(MuMIn) library(lme4) }, file = tempfile())
Instructions:
systolic_blood_pressure_scaled
, sex
, body_mass_index_scaled
,
currently_smokes
, and followup_visit_number
to the formula.# Set the model formula model <- glmer( got_cvd ~ ___ + ___ + ___ + ___ + ___ + (1 | subject_id), data = model_sel_df, family = binomial, na.action = "na.fail", # Speeds up computation, reduces precision nAGQ = 0 )
"Model formulas are in the form: `got_cvd ~ predictor1 + predictor2 + (1 | subject_id)`."
# Set the model formula model <- glmer( got_cvd ~ systolic_blood_pressure_scaled + body_mass_index_scaled + currently_smokes + sex + followup_visit_number + (1 | subject_id), data = model_sel_df, family = binomial, na.action = "na.fail", # Speeds up computation, reduces precision nAGQ = 0 )
"Great job!"
Instructions:
dredge()
through the combinations of variables, subset by
systolic_blood_pressure_scaled
in the model and rank by "AIC"
.selection
models.model <- glmer( got_cvd ~ systolic_blood_pressure_scaled + body_mass_index_scaled + currently_smokes + sex + followup_visit_number + (1 | subject_id), data = model_sel_df, family = binomial, na.action = "na.fail", nAGQ = 0 ) # Set the ranking method and subset selection <- dredge(___, rank = ___, subset = ___) # Print the top 3 head(as.data.frame(selection), 3)
"Give `model` as the first argument to `dredge()`." "Both `rank` and `subset` should be a character string."
model <- glmer( got_cvd ~ systolic_blood_pressure_scaled + body_mass_index_scaled + currently_smokes + sex + followup_visit_number + (1 | subject_id), data = model_sel_df, family = binomial, na.action = "na.fail", nAGQ = 0 ) # Set the ranking method and subset selection <- dredge(model, rank = "AIC", subset = "systolic_blood_pressure_scaled") # Print the top 3 head(as.data.frame(selection), 3)
"Great job!"
question("Based on the output of `dredge()`, what variables are adjusted for in the top model?", answer("BMI and smoking", correct = FALSE), answer("Sex and smoking", correct = FALSE), answer("Sex and BMI", correct = TRUE), answer("All of the variables", correct = FALSE), allow_retry = TRUE )
Hint: Check which variables have missingness in the rows selection
.
In the past (and still very common today), most research was done with mostly or entirely males. Clinical trials, experimental animal models, and observational studies tended to explicitly study males, as female hormonal cycles can act as a confounding factor. This often had harmful consequences, since there are massive gender differences in responses to drug treatment and other disease interventions. Most journals and funding agencies now require that differences in sex, and ethnicity, are investigated.
Since the Framingham study has almost entirely individuals of European-ancestry, we can only test sex interactions. Compare models without and with interactions for sex.
capture.output({ data("sample_tidied_framingham", package = "acdcourse") library(lme4) library(MuMIn) }, file = tempfile())
Instructions:
glmer()
models with total_cholesterol_scaled
, sex
, and
followup_visit_number
, without an interaction term.# Model without interaction model_no_interaction <- glmer( got_cvd ~ ___ + ___ + ___ + (1 | subject_id), data = sample_tidied_framingham, family = binomial) summary(model_no_interaction)
"The predictors should be added together like `predictor1 + predictor2`."
# Model without interaction model_no_interaction <- glmer( got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) summary(model_no_interaction)
"Great! Next step."
Instructions:
*
,
between total_cholesterol_scaled
and sex
.# Model without interaction model_no_interaction <- glmer( got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Model with sex interaction model_sex_interaction <- glmer( got_cvd ~ total_cholesterol_scaled ___ sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) summary(model_sex_interaction)
"The interaction should be `total_cholesterol_scaled * sex`."
# Model without interaction model_no_interaction <- glmer( got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Model with sex interaction model_sex_interaction <- glmer( got_cvd ~ total_cholesterol_scaled * sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) summary(model_sex_interaction)
"Great! Next step."
Instructions:
model.sel()
, with an "AIC"
rank.# Model without interaction model_no_interaction <- glmer( got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Model with sex interaction model_sex_interaction <- glmer( got_cvd ~ total_cholesterol_scaled * sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Test if interaction adds to model model.sel(___, ___, rank = ___)
"Include both models, `model_no_interaction` and `model_sex_interaction`, in the `model.sel()` function, separated by a comma."
# Model without interaction model_no_interaction <- glmer( got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Model with sex interaction model_sex_interaction <- glmer( got_cvd ~ total_cholesterol_scaled * sex + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Test if interaction adds to model model.sel(model_no_interaction, model_sex_interaction, rank = "AIC")
"Great!"
question("Does including a cholesterol by sex interaction provide more information for the model?", answer("Yes, but only by a bit.", correct = FALSE), answer("No, since the models are not different.", correct = TRUE), answer("No, but we should still add a sex interaction.", correct = FALSE), answer("None of the above.", correct = FALSE), allow_retry = TRUE )
Hint: Check which model has a higher weight
or lower AIC
.
Often times we make assumptions about our data and the participants that make up that data. For instance, with body mass index (BMI), we assume that the value represents a person regardless of how sick or healthy they are. However, usually if someone's BMI is really low (below around 18.5, which is considered underweight) or really high (above 40 which is considered morbidly obese), this could indicate a serious health problem that they may have. For example, people who are very ill usually lose a lot of weight. So if we include them in the model, we might get a biased estimate for the association of BMI on CVD. Run a sensitivity analysis removing these observations and compare the results.
Use the sample_tidied_framingham
dataset.
capture.output({ data("sample_tidied_framingham", package = "acdcourse") library(lme4) library(dplyr) }, file = tempfile())
Instructions:
body_mass_index
equal to or above 18.5 and equal to
or below 40.# Remove low and high body masses bmi_check_data <- sample_tidied_framingham %>% filter(___ >= ___, ___ <= ___)
"Include two conditions (`>=` and `<=`) to restrict the range of `body_mass_index`, separated by a comma."
# Remove low and high body masses bmi_check_data <- sample_tidied_framingham %>% filter(body_mass_index >= 18.5, body_mass_index <= 40)
"Excellent! Next step."
Instructions:
body_mass_index_scaled
and followup_visit_number
in the formula
and run the model with the sample_tidied_framingham
.bmi_check_data <- sample_tidied_framingham %>% filter(body_mass_index >= 18.5, body_mass_index <= 40) # Run and check model with original dataset original_model <- glmer( got_cvd ~ ___ + ___ + (1 | subject_id), data = ___, family = binomial) # Fix effect estimates fixef(original_model)
"Use `sample_tidied_framingham` in the data argument."
bmi_check_data <- sample_tidied_framingham %>% filter(body_mass_index >= 18.5, body_mass_index <= 40) # Run and check model with original dataset original_model <- glmer( got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Fix effect estimates fixef(original_model)
"Excellent! Next step."
Instructions:
bmi_check_data <- sample_tidied_framingham %>% filter(body_mass_index >= 18.5, body_mass_index <= 40) original_model <- glmer( got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Run and check model with the body mass checking bmi_check_model <- glmer( got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id), data = ___, family = binomial) # Fix effect estimates fixef(original_model) fixef(bmi_check_model)
"Run the same model but use the newly created `bmi_check_data`."
bmi_check_data <- sample_tidied_framingham %>% filter(body_mass_index >= 18.5, body_mass_index <= 40) original_model <- glmer( got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id), data = sample_tidied_framingham, family = binomial) # Run and check model with the body mass checking bmi_check_model <- glmer( got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id), data = bmi_check_data, family = binomial) # Fix effect estimates fixef(original_model) fixef(bmi_check_model)
"Amazing!"
question("Look at the fixed effects estimates between each model, how do they differ?", answer("The estimates for followup visit changes a lot.", correct = FALSE), answer("The estimates for body mass index decreases a bit.", correct = TRUE), answer("There is no difference between model estimates for any variable.", correct = FALSE, message = "FIXME"), answer("All of the above.", correct = FALSE), answer("None of the above.", correct = FALSE), allow_retry = TRUE )
Now that you've created several models, you need to do some tidying, adding
confidence intervals, and transforming. Tidying mixed effects models requires
the broom.mixed
package. You'll also need to transform the estimates by
exponentiating, since the model uses a binary outcome. Exponentiating converts
the estimates from log-odds to odds ratios.
A model has been created for you already called main_model
.
capture.output({ data("main_model", package = "acdcourse") library(dplyr) options(digits = 3, scipen = 4) }, file = tempfile())
Instructions:
tidy()
function on the main_model
object and set conf.int
and
exponentiate
to TRUE
.library(broom.mixed) # Tidy up main_model, include conf.int and exponentiate tidy_model <- ___(___, conf.int = ___, exponentiate = ___)
"Place `main_model` as the first argument."
library(broom.mixed) # Tidy up main_model, include conf.int and exponentiate tidy_model <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE)
"Amazing!"
Instructions:
select()
only the most important results: effect
, terms
, estimates
,
conf.low
, and conf.high
.library(broom.mixed) tidy_model <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE) # Select the important variables relevant_results <- tidy_model %>% select(___, ___, ___, ___, ___)
"Use the `select()` function as you have done in previous exercises."
library(broom.mixed) tidy_model <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE) # Select the important variables relevant_results <- tidy_model %>% select(effect, term, estimate, conf.low, conf.high)
"Amazing! You tidied up the model and extracted the most important results!"
You've now created a tidied model output and kept the most relevant results. Now time to interpret!
Note: SD means standard deviation, CI means confidence interval, and CVD means cardiovascular disease.
capture.output({ data("main_model", package = "acdcourse") library(dplyr) library(broom.mixed) options(digits = 3, scipen = 4) relevant_results <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE) %>% select(effect, term, estimate, conf.low, conf.high) }, file = tempfile()) relevant_results
question( "Which of the responses below is the *most* accurate interpretation of the results?", answer( "1 SD higher cholesterol has 1.1 times more CVD risk, but uncertain (0.5 to 2.8 CI).", correct = FALSE ), answer( "1 SD higher cholesterol has 1.1 times more CVD risk (ranges 0.5 to 2.8 times), adjusted for time.", correct = TRUE ), answer("No significant relationships exist: CI passes 1.", correct = FALSE), answer( "Cholesterol's relation to CVD is uncertain. Need more research.", correct = FALSE ), allow_retry = TRUE )
Hint: There are several that are right, but only one is the most accurate statement.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.