library(learnr)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

Statistical analyses for cohort studies


What questions can be asked from Framingham?

Because the Framingham study is a prospective cohort, with certain limits to the data and with three data collection visits, there are restrictions to the types of questions we can ask and reliably answer.

question(
  "Choose the most valid and most appropriate question that we could ask of the Framingham data.",
  answer(
    "Will lower body mass during adolescence increase the risk for CVD?",
    correct = FALSE,
    message = "Incorrect. While cohorts could answer this question, Framingham participants are all in middle age so we can't answer questions outside of that time frame."
  ),
  answer(
    "Does smoking increase the risk for CVD?",
    correct = TRUE,
    message = "Correct! The Framingham dataset collected information on smoking status and can assess relative risk between exposure status."
  ),
  answer(
    "Does having many close friends lower the risk for CVD?",
    correct = FALSE,
    message = "Incorrect. While cohorts could answer this question, the Framingham Study did not collect this information."
  ),
  answer(
    "All of the above.", 
    correct = FALSE, 
    message = "Incorrect. One of the above is a valid question."
  ),
  allow_retry = TRUE
)
"Remember, these are questions to ask *of the Framingham study*. The variables in the question must exist in the dataset."
"Use `glimpse(tidied_framingham)` to see which variables are available."
"Use `summary(tidied_framingham)` to see the minimum and maximum ages of the participants."

Get familiar with mixed effects models {data-progressive=TRUE}

Let's get you familiar with using and running glmer() models. There is some tweaking involved when running glmer() models, such as transforming variables before hand. Often this requires some trial and error to get right. For now, practice running some models.

The lme4 package has been loaded for you. Since glmer() is computationally expensive, the Framingham dataset has been reduced in size and is loaded as sample_tidied_framingham.

Recall that the pattern for using glmer() is:

model <- glmer(
    outcome_var ~ predictor_var + (1 | person_id),
    data = dataset,
    family = binomial
)
capture.output({
data(sample_tidied_framingham, package = "acdcourse")
library(lme4, quietly = TRUE)
}, file = tempfile())

Exercise step

Instructions:

# Model scaled cholesterol on CVD
model <- glmer(
    ___ ~ ___ + ___,
    data = ___,
    # Use distribution for binary outcome
    family = ___
    )

# View the model output
summary(model)
"The formula should be `got_cvd ~ total_cholesterol_scaled + (1 | subject_id)`."
# Model scaled cholesterol on CVD
model <- glmer(
    got_cvd ~ total_cholesterol_scaled + (1 | subject_id),
    data = sample_tidied_framingham,
    # Use distribution for binary outcome
    family = binomial
    )

# View the model output
summary(model)
"Amazing!"

Exercise step

Instructions:

# Model scaled fasting blood glucose on CVD
model <- glmer(
    ___ ~ ___ + ___,
    data = ___,
    # Use distribution for binary outcome
    family = ___
    )

# View the model output
summary(model)
"Using the same formula as the previous step, replace `total_cholesterol_scaled` with `fasting_blood_glucose_scaled`."
# Model scaled fasting blood glucose on CVD
model <- glmer(
    got_cvd ~ fasting_blood_glucose_scaled + (1 | subject_id), 
    data = sample_tidied_framingham,
    # Use distribution for binary outcome
    family = binomial
    ) 

# View the model output
summary(model)
"Great job! You've become a bit more familiar with coding and running mixed effects models in R. You ran two models to get practice on setting predictors and the formula. Now we can get to more complicated modeling aspects."

Why transforming may be required {data-progressive=TRUE}

You need to consider many things with glmer() models, e.g. large variable variances. Often glmer() will warn you of a problem, which you must fix using your knowledge of transformations. Getting it right often involves trial and error.

These exercises will (likely) generate warnings or errors. Compare the different transformations and notice why problems occur. Recall that we are using a smaller dataset, sample_tidied_framingham. The general template for glmer() is:

model <- glmer(
    outcome_var ~ predictor_var + (1 | person_id),
    data = dataset,
    family = binomial
)
capture.output({
data(sample_tidied_framingham, package = "acdcourse")
library(lme4)
}, file = tempfile())

Exercise step

Instructions:

# Model total cholesterol on CVD
model <- glmer(
    ___ ~ ___ + (1 | ___),
    data = sample_tidied_framingham, 
    family = binomial
    )

# View the model output
summary(model)
"Don't forget the random term: `(1 | subject_id)`."
# Model total cholesterol on CVD
model <- glmer(
    got_cvd ~ total_cholesterol + (1 | subject_id),
    data = sample_tidied_framingham, 
    family = binomial
    )

# View the model output
summary(model)
"Amazing job!"

Exercise step

Instructions:

# Model with centered cholesterol on CVD
model <- glmer(
    got_cvd ~ ___ + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial
) 

# View the model output
summary(model)
"Replace the original cholesterol variable with `total_cholesterol_centered` in the formula."
# Model with centered cholesterol on CVD
model <- glmer(
    got_cvd ~ total_cholesterol_centered + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial
) 

# View the model output
summary(model)
"Great!"

Exercise step

Instructions:

# Model with scaled cholesterol
model <- glmer(
    got_cvd ~ ___ + (1 | subject_id),
    data = sample_tidied_framingham, 
    family = binomial
) 

# View the model output
summary(model)
"Include `total_cholesterol_scaled` in the formula."
# Model with scaled cholesterol
model <- glmer(
    got_cvd ~ total_cholesterol_scaled + (1 | subject_id),
    data = sample_tidied_framingham, 
    family = binomial
) 

# View the model output
summary(model)
"Amazing! You've solved the warnings about non-convergence and rescaling issue! Often it requires some trial and error to find which transformations are optimal for the model technique."

Include time in the mixed effect model

Before the development of mixed effects modeling, analyzing longitudinal data was fairly difficult because repeated measures violated the assumption of independent observations. This time component is a key strength of longitudinal studies like with prospective cohorts. But to use that strength you need to, well, include time in the model!

Include an additional predictor (followup_visit_number) in the glmer() formula. Recall that got_cvd is the outcome and subject_id is the random term.

Instructions:

capture.output({
data(sample_tidied_framingham, package = "acdcourse")
library(lme4)
}, file = tempfile())
# Include followup visit number with cholesterol
model <- glmer(
    # Add scaled cholesterol and visit
    got_cvd ~ ___ + ___ + (1 | subject_id),
    data = sample_tidied_framingham, 
    family = binomial
    )

# View the model summary
summary(model)
"The formula should be `total_cholesterol_scaled + followup_visit_number`."
# Include followup visit number with cholesterol
model <- glmer(
    # Add scaled cholesterol and visit
    got_cvd ~ total_cholesterol_scaled + followup_visit_number + (1 | subject_id),
    data = sample_tidied_framingham, 
    family = binomial
    )

# View the model summary
summary(model)
"Awesome! Adding a time component to any analysis that has repeated measurements is quite important. It reduces model bias and allows you to interpret the results in the context of time."

Adjustment, confounding, and model building


Model selection using DAGs {data-progressive=TRUE}

Building a DAG that approximates the biology is difficult. It requires domain knowledge, so consult experts to confirm the DAG. Remember, you will build an incomplete DAG. This is but one step to finding confounders.

Let's determine which variables to adjust for when systolic blood pressure (SBP) is the exposure and CVD is the outcome. Assume that:

Create a dagitty() object to identify what to adjust for.

Recall that for dagitty: x -> y means "x influences y" and that x -> {y z} means "x influences y and z"; dagitty is already loaded.

capture.output({
library(dagitty, quietly = TRUE)

variable_pathways <- dagitty("dag {
    SBP -> CVD
    Sex -> {SBP Smoking}
    Smoking -> {SBP CVD}
    BMI -> {SBP CVD FastingGlucose}
    FastingGlucose -> CVD
}")
}, file = tempfile())
plot(graphLayout(variable_pathways))

Exercise step

Instructions:

# Include the links between variables
variable_pathways <- dagitty("dag {
    SBP -> CVD
    ___ -> {___ ___}
    ___ -> {___ ___ ___}
    ___ -> ___
    }")

# Plot potential confounding pathways
plot(graphLayout(variable_pathways))
"The form for a pathway is `start_variable -> {one or more end variables}`."
# Include the links between variables
variable_pathways <- dagitty("dag {
    SBP -> CVD
    Sex -> {SBP Smoking}
    Smoking -> {SBP CVD}
    BMI -> {SBP CVD FastingGlucose}
    FastingGlucose -> CVD
    }")

# Plot potential confounding pathways
plot(graphLayout(variable_pathways))
"Great!"

Exercise step

Instructions:

# Include the links between variables
variable_pathways <- dagitty("dag {
    SBP -> CVD
    Sex -> {SBP Smoking}
    Smoking -> {SBP CVD}
    BMI -> {SBP CVD FastingGlucose}
    FastingGlucose -> CVD}")

# Plot potential confounding pathways
plot(graphLayout(variable_pathways))

# Identify some confounders to adjust for
adjustmentSets(___, exposure = ___, outcome = ___)
"The `adjustmentSets()` requires the DAG object and the outcome (CVD) and the predictor (SBP)."
# Include the links between variables
variable_pathways <- dagitty("dag {
    SBP -> CVD
    Sex -> {SBP Smoking}
    Smoking -> {SBP CVD}
    BMI -> {SBP CVD FastingGlucose}
    FastingGlucose -> CVD}")

# Plot potential confounding pathways
plot(graphLayout(variable_pathways))

# Identify some confounders to adjust for
adjustmentSets(variable_pathways, exposure = "SBP", outcome = "CVD")
"Excellent job!"

Exercise step

question("According to the output of the `adjustmentSets()`, what variables does DAG suggest you adjust for in the model to get less biased results?",
  answer("Sex and FastingGlucose", correct = FALSE),
  answer("BMI and Smoking", correct = TRUE),
  answer("Smoking, Sex, BMI", correct = FALSE),
  answer("All variables", correct = FALSE),
  answer("No variables", correct = FALSE),
  allow_retry = TRUE
)

Model selection using Information Criterion {data-progressive=TRUE}

It's best to use multiple methods to decide on which variables to include in a model. The information criterion methods are powerful tools for choosing variables to adjust for. Using the functions from the MuMIn package, determine which model has the best fit for the models being compared by using AIC to rank them. A smaller AIC is better.

As many models will be computed and compared, for these lesson purposes only, we kept computing time short by: greatly reducing the sample size and number of variables in the data, called model_sel_df; and, setting nAQG = 0 (reduces estimation precision, but increases speed). MuMIn also requires na.action = "na.fail" to be set in glmer().

capture.output({
data(model_sel_df, package = "acdcourse")
library(MuMIn)
library(lme4)
}, file = tempfile())

Exercise step

Instructions:

# Set the model formula
model <- glmer(
    got_cvd ~ ___ + ___ +
        ___ + ___ + ___ + (1 | subject_id),
    data = model_sel_df, 
    family = binomial,
    na.action = "na.fail",
    # Speeds up computation, reduces precision
    nAGQ = 0 
)
"Model formulas are in the form: `got_cvd ~ predictor1 + predictor2 + (1 | subject_id)`."
# Set the model formula
model <- glmer(
    got_cvd ~ systolic_blood_pressure_scaled + body_mass_index_scaled +
        currently_smokes + sex + followup_visit_number + (1 | subject_id),
    data = model_sel_df, 
    family = binomial,
    na.action = "na.fail",
    # Speeds up computation, reduces precision
    nAGQ = 0 
)
"Great job!"

Exercise step

Instructions:

model <- glmer(
    got_cvd ~ systolic_blood_pressure_scaled + body_mass_index_scaled +
        currently_smokes + sex + followup_visit_number + (1 | subject_id),
    data = model_sel_df, 
    family = binomial, 
    na.action = "na.fail",
    nAGQ = 0
)

# Set the ranking method and subset
selection <- dredge(___, rank = ___, subset = ___)

# Print the top 3
head(as.data.frame(selection), 3)
"Give `model` as the first argument to `dredge()`."
"Both `rank` and `subset` should be a character string."
model <- glmer(
    got_cvd ~ systolic_blood_pressure_scaled + body_mass_index_scaled +
        currently_smokes + sex + followup_visit_number + (1 | subject_id),
    data = model_sel_df, 
    family = binomial,
    na.action = "na.fail",
    nAGQ = 0
)

# Set the ranking method and subset
selection <- dredge(model, rank = "AIC", subset = "systolic_blood_pressure_scaled")

# Print the top 3
head(as.data.frame(selection), 3)
"Great job!"

Exercise step

question("Based on the output of `dredge()`, what variables are adjusted for in the top model?",
  answer("BMI and smoking", correct = FALSE),
  answer("Sex and smoking", correct = FALSE),
  answer("Sex and BMI", correct = TRUE),
  answer("All of the variables", correct = FALSE),
  allow_retry = TRUE
)

Hint: Check which variables have missingness in the rows selection.

Testing for interactions and sensitivity analyses


Determining sex interaction with the predictor {data-progressive=TRUE}

In the past (and still very common today), most research was done with mostly or entirely males. Clinical trials, experimental animal models, and observational studies tended to explicitly study males, as female hormonal cycles can act as a confounding factor. This often had harmful consequences, since there are massive gender differences in responses to drug treatment and other disease interventions. Most journals and funding agencies now require that differences in sex, and ethnicity, are investigated.

Since the Framingham study has almost entirely individuals of European-ancestry, we can only test sex interactions. Compare models without and with interactions for sex.

capture.output({
data("sample_tidied_framingham", package = "acdcourse")
library(lme4)
library(MuMIn)
}, file = tempfile())

Exercise step

Instructions:

# Model without interaction
model_no_interaction <- glmer(
    got_cvd ~ ___ + ___ + ___ + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)
summary(model_no_interaction)
"The predictors should be added together like `predictor1 + predictor2`."
# Model without interaction
model_no_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)
summary(model_no_interaction)
"Great! Next step."

Exercise step

Instructions:

# Model without interaction
model_no_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham,
    family = binomial)

# Model with sex interaction
model_sex_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled ___ sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)
summary(model_sex_interaction)
"The interaction should be `total_cholesterol_scaled * sex`."
# Model without interaction
model_no_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)

# Model with sex interaction
model_sex_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled * sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)
summary(model_sex_interaction)
"Great! Next step."

Exercise step

Instructions:

# Model without interaction
model_no_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)

# Model with sex interaction
model_sex_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled * sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham, 
    family = binomial)

# Test if interaction adds to model
model.sel(___, ___, rank = ___)
"Include both models, `model_no_interaction` and `model_sex_interaction`, in the `model.sel()` function, separated by a comma."
# Model without interaction
model_no_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled + sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham,
    family = binomial)

# Model with sex interaction
model_sex_interaction <- glmer(
    got_cvd ~ total_cholesterol_scaled * sex + followup_visit_number + (1 | subject_id), 
    data = sample_tidied_framingham,
    family = binomial)

# Test if interaction adds to model
model.sel(model_no_interaction, model_sex_interaction, rank = "AIC")
"Great!"

Exercise step

question("Does including a cholesterol by sex interaction provide more information for the model?",
  answer("Yes, but only by a bit.", correct = FALSE),
  answer("No, since the models are not different.", correct = TRUE),
  answer("No, but we should still add a sex interaction.", correct = FALSE),
  answer("None of the above.", correct = FALSE),
  allow_retry = TRUE
)

Hint: Check which model has a higher weight or lower AIC.

Running sensitivity analyses with body mass index {data-progressive=TRUE}

Often times we make assumptions about our data and the participants that make up that data. For instance, with body mass index (BMI), we assume that the value represents a person regardless of how sick or healthy they are. However, usually if someone's BMI is really low (below around 18.5, which is considered underweight) or really high (above 40 which is considered morbidly obese), this could indicate a serious health problem that they may have. For example, people who are very ill usually lose a lot of weight. So if we include them in the model, we might get a biased estimate for the association of BMI on CVD. Run a sensitivity analysis removing these observations and compare the results.

Use the sample_tidied_framingham dataset.

capture.output({
data("sample_tidied_framingham", package = "acdcourse")
library(lme4)
library(dplyr)
}, file = tempfile())

Exercise step

Instructions:

# Remove low and high body masses
bmi_check_data <- sample_tidied_framingham %>% 
    filter(___ >= ___, ___ <= ___)
"Include two conditions (`>=` and `<=`) to restrict the range of `body_mass_index`, separated by a comma."
# Remove low and high body masses
bmi_check_data <- sample_tidied_framingham %>% 
    filter(body_mass_index >= 18.5, body_mass_index <= 40)
"Excellent! Next step."

Exercise step

Instructions:

bmi_check_data <- sample_tidied_framingham %>% 
    filter(body_mass_index >= 18.5, body_mass_index <= 40)

# Run and check model with original dataset
original_model <- glmer(
    got_cvd ~ ___ + ___ + (1 | subject_id),
    data = ___, family = binomial)

# Fix effect estimates
fixef(original_model)
"Use `sample_tidied_framingham` in the data argument."
bmi_check_data <- sample_tidied_framingham %>% 
    filter(body_mass_index >= 18.5, body_mass_index <= 40)

# Run and check model with original dataset
original_model <- glmer(
    got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id),
    data = sample_tidied_framingham, family = binomial)

# Fix effect estimates
fixef(original_model)
"Excellent! Next step."

Exercise step

Instructions:

bmi_check_data <- sample_tidied_framingham %>% 
    filter(body_mass_index >= 18.5, body_mass_index <= 40)

original_model <- glmer(
    got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id),
    data = sample_tidied_framingham, family = binomial)

# Run and check model with the body mass checking
bmi_check_model <- glmer(
    got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id),
    data = ___, family = binomial)

# Fix effect estimates
fixef(original_model)
fixef(bmi_check_model)
"Run the same model but use the newly created `bmi_check_data`."
bmi_check_data <- sample_tidied_framingham %>% 
    filter(body_mass_index >= 18.5, body_mass_index <= 40)

original_model <- glmer(
    got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id),
    data = sample_tidied_framingham, family = binomial)

# Run and check model with the body mass checking
bmi_check_model <- glmer(
    got_cvd ~ body_mass_index_scaled + followup_visit_number + (1 | subject_id),
    data = bmi_check_data, family = binomial)

# Fix effect estimates
fixef(original_model)
fixef(bmi_check_model)
"Amazing!"

Exercise step

question("Look at the fixed effects estimates between each model, how do they differ?",
  answer("The estimates for followup visit changes a lot.", correct = FALSE),
  answer("The estimates for body mass index decreases a bit.", correct = TRUE),
  answer("There is no difference between model estimates for any variable.", correct = FALSE, message = "FIXME"),
  answer("All of the above.", correct = FALSE),
  answer("None of the above.", correct = FALSE),
  allow_retry = TRUE
)

Tidying and interpreting model results


Tidy up with broom and interpret the results {data-progressive=TRUE}

Now that you've created several models, you need to do some tidying, adding confidence intervals, and transforming. Tidying mixed effects models requires the broom.mixed package. You'll also need to transform the estimates by exponentiating, since the model uses a binary outcome. Exponentiating converts the estimates from log-odds to odds ratios.

A model has been created for you already called main_model.

capture.output({
data("main_model", package = "acdcourse")
library(dplyr)
options(digits = 3, scipen = 4)
}, file = tempfile())

Exercise step

Instructions:

library(broom.mixed)

# Tidy up main_model, include conf.int and exponentiate
tidy_model <- ___(___, conf.int = ___, exponentiate = ___)
"Place `main_model` as the first argument."
library(broom.mixed)

# Tidy up main_model, include conf.int and exponentiate
tidy_model <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE)
"Amazing!"

Exercise step

Instructions:

library(broom.mixed)

tidy_model <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE)

# Select the important variables
relevant_results <- tidy_model %>% 
    select(___, ___, ___, ___, ___) 
"Use the `select()` function as you have done in previous exercises."
library(broom.mixed)

tidy_model <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE)

# Select the important variables
relevant_results <- tidy_model %>% 
    select(effect, term, estimate, conf.low, conf.high) 
"Amazing! You tidied up the model and extracted the most important results!"

Interpreting the tidied results

You've now created a tidied model output and kept the most relevant results. Now time to interpret!

Note: SD means standard deviation, CI means confidence interval, and CVD means cardiovascular disease.

capture.output({
data("main_model", package = "acdcourse")
library(dplyr)
library(broom.mixed)
options(digits = 3, scipen = 4)
relevant_results <- tidy(main_model, conf.int = TRUE, exponentiate = TRUE) %>% 
    select(effect, term, estimate, conf.low, conf.high) 
}, file = tempfile())
relevant_results
question(
  "Which of the responses below is the *most* accurate interpretation of the results?",
  answer(
    "1 SD higher cholesterol has 1.1 times more CVD risk, but uncertain (0.5 to 2.8 CI).",
    correct = FALSE
  ),
  answer(
    "1 SD higher cholesterol has 1.1 times more CVD risk (ranges 0.5 to 2.8 times), adjusted for time.",
    correct = TRUE
  ),
  answer("No significant relationships exist: CI passes 1.", correct = FALSE),
  answer(
    "Cholesterol's relation to CVD is uncertain. Need more research.",
    correct = FALSE
  ),
  allow_retry = TRUE
)

Hint: There are several that are right, but only one is the most accurate statement.



lwjohnst86/acdcourse documentation built on June 18, 2019, 8:26 p.m.