library(learnr)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

Introduction to cohort studies


What makes it a cohort?

The Framingham cohort was set up to study what factors may influence the risk of cardiovascular disease (CVD). People from the town of Framingham, USA were recruited and followed over time. Data was collected on risk factors and CVD outcomes every few years.

The framingham dataset has been loaded for you to optionally explore, however, do note that the dataset has not yet been tidied. We'll go through tidying it in Chapter 2.

question("What specifically distinguishes the Framingham study as a *cohort*?",
  answer("It studies a disease (CVD).", correct = FALSE, message = "Almost, but incorrect. Many types of scientific studies study a disease, but that alone doesn't distinguish them as a cohort study."),
  answer("Participants all came from the town of Framingham, USA.", correct = TRUE, message = "Cohorts are people who share a common characteristic. In this case, the participants come from the same town and so have a similar environment."),
  answer("Participants were followed over time.", correct = FALSE, message = "Almost, but incorrect. Many types of scientific studies follow their subjects over time (e.g. clinical trials), but that alone doesn't distinguish them as a cohort study."),
  answer("Participants had risk factors measured.", correct = FALSE, message = "Incorrect. Many types of scientific studies measure risk factors, but that alone doesn't distinguish them as a cohort."),
  allow_retry = TRUE
)
"Cohorts are people who have *something in common*."

What cohort type is the Framingham Heart Study? {data-progressive=TRUE}

It's usually possible to determine the cohort design from the variables within the dataset. There are at least two variables in the Framingham Heart study that give us some indication of the cohort design. Recall that cohorts involve a data collection period.

The dplyr package has been loaded, as well as the framingham dataset. Again, note that framingham has not yet been tidied up, which we will do later in the course.

capture.output({
  library(dplyr)
  library(acdcourse)
  data("framingham")
  framingham$time <- NULL
}, file = tempfile())

Exercise step

Instructions:

# Check out the variable names
names(framingham)

# Select the two columns that indicate design
framingham %>% 
    select(___, ___)
"The Framingham cohort was designed to study the disease `cvd`."
# Check out the variable names
names(framingham)

# Select two columns that indicate design
framingham %>% 
    select(period, cvd)
"Great job!"

Exercise step

question("What is Framingham's cohort design? Remember, *when* the disease occurs is what
distinguishes prospective from retrospective cohorts.",
  answer("Prospective.", correct = TRUE),
  answer("Retrospective.", correct = FALSE, message = "Incorrect. Participants enter the study with a disease. In Framingham, participants did not have the disease."),
  answer("Neither.", correct = FALSE, message = "Incorrect."),
  answer("Both.", correct = FALSE, message = "Incorrect. It can't be both!"),
  allow_retry = TRUE
)
"The study was designed to investigate how people *develop* CVD over time (i.e. they don't have the disease when the study starts)."

Cohort types, variables, and the Framingham Study


Select the outcome and some exposures

To properly analyze the data you need to know what each variable represents. Usually it's fairly easy to identify the outcome (the disease). However, knowing which variables are potential exposures to investigate can be tricky, since modern cohort studies often measure hundreds of variables on each participant.

Initially, it can be helpful to keep only the variables of interest. For now, select a few interesting variables, renaming them so they are more descriptive, and exploring them more.

Instructions:

capture.output({
  library(dplyr)
  library(acdcourse)
  data("framingham")
}, file = tempfile())
# Select and rename the potential predictors and outcome
explore_framingham <- framingham %>%
    select(
        # Format: new_variable_name = old_variable_name
        # Outcome
        _____ = cvd,
        # Predictors
        _____ = totchol,
        _____ = bmi,
        _____ = cursmoke,
        # Visit number
        _____ = period 
    )
"Rename `bmi` to `body_mass_index`, `totchol` to `total_cholesterol`, and `cursmoke` to `currently_smokes`."
# Select and rename the potential predictors and outcome
explore_framingham <- framingham %>%
    select(
        # Format: new_variable_name = old_variable_name
        # Outcome
        got_cvd = cvd,
        # Predictors
        total_cholesterol = totchol,
        body_mass_index = bmi,
        currently_smokes = cursmoke,
        # Visit number
        followup_visit_number = period 
    )
"Great job! You've selected and renamed the variables correctly."

Simple summary of the exposures by outcome

Getting some simple summaries of the exposures by those with and without the disease should be done early in any analysis of cohort datasets. Even more so when there is a time component to the study, so you can identify how variables change over time or are different between groups.

Using what was shown in the video, calculate some means based on some groupings.

Instructions:

capture.output({
  library(dplyr)
  library(acdcourse)
  data("framingham")
  explore_framingham <- framingham %>%
      select(
          got_cvd = cvd,
          total_cholesterol = totchol,
          body_mass_index = bmi,
          currently_smokes = cursmoke,
          followup_visit_number = period
      )
}, file = tempfile())
explore_framingham %>% 
    # Group by visit and CVD status
    group_by(___, ___) %>% 
    # Mean of body mass, smoking, and cholesterol
    summarize(
        body_mass_mean = mean(___, na.rm = ___),
        smokes_mean = ___,
        cholesterol_mean = ___
    )
"Use `na.rm = TRUE` with `mean()` to exclude `NA` from the mean calculation."
explore_framingham %>% 
    # Group by visit and CVD status
    group_by(followup_visit_number, got_cvd) %>% 
    # Mean of body mass, smoking, and cholesterol
    summarize(
        body_mass_mean = mean(body_mass_index, na.rm = TRUE),
        smokes_mean = mean(currently_smokes, na.rm = TRUE),
        cholesterol_mean = mean(total_cholesterol, na.rm = TRUE)
    )
"Awesome! You learned how to compare the difference in some basic predictors in those who did and did not get CVD over the study duration."

Prevalence and incidence in cohorts


Count number of participants and cases per visit {data-progressive=TRUE}

Here, you will count the number of cases and non-cases for both prevalent myocardial infarction (MI), or prevalent_mi, and coronary heart disease (CHD), or prevalent_chd, at each visit. Remember, for longitudinal data, like that in prospective cohorts, you need to count by the time period since each participant will have several rows for each of the data collection visits.

Both dplyr and tidyr are loaded and all variables have been added back into explore_framingham.

capture.output({
library(dplyr)
library(acdcourse)
data("framingham")
explore_framingham <- framingham %>%
    rename(
        got_cvd = cvd, 
        total_cholesterol = totchol,
        body_mass_index = bmi,
        currently_smokes = cursmoke,
        followup_visit_number = period,
        prevalent_chd = prevchd,
        prevalent_mi = prevmi
    )
}, file = tempfile())

Exercise step

Instructions:

# Count number of participants per visit
explore_framingham %>%
    count(___)
"The code is `count(followup_visit_number)`."
# Count number of participants per visit
explore_framingham %>% 
    count(followup_visit_number)
"Great!"

Exercise step

Instructions:

explore_framingham %>% 
    count(followup_visit_number)

# Count by visit, then prevalent cases of MI
explore_framingham %>% 
    count(___, ___)
"Include both variables in `count()`, separated by a comma."
explore_framingham %>% 
    count(followup_visit_number)

# Count by visit, then prevalent cases of MI
explore_framingham %>% 
    count(followup_visit_number, prevalent_mi)
"Amazing!"

Exercise step

Instructions:

explore_framingham %>% 
    count(followup_visit_number)

explore_framingham %>% 
    count(followup_visit_number, prevalent_mi)

# Count by visit, then prevalent cases of CHD
explore_framingham %>% 
    count(___, ___)
"Use the same syntax as for the `prevalent_mi` code."
explore_framingham %>% 
    count(followup_visit_number)

explore_framingham %>% 
    count(followup_visit_number, prevalent_mi)

# Count by visit, then prevalent cases of CHD
explore_framingham %>% 
    count(followup_visit_number, prevalent_chd)
"Woohoo! Nice job. You now know how to count the number of cases by visit."

Remove prevalent cases at the baseline

From the previous exercise, we know that there are prevalent cases of cardiovascular events at the first visit. Prevalent cases of disease at the recruitment visit can introduce bias, so we need to remove these cases before continuing with any further analyses.

Instructions:

capture.output({
library(dplyr)
library(acdcourse)
data("framingham")
explore_framingham <- framingham %>%
    rename(
        got_cvd = cvd, 
        total_cholesterol = totchol,
        body_mass_index = bmi,
        participant_age = age,
        currently_smokes = cursmoke,
        followup_visit_number = period,
        prevalent_chd = prevchd,
        prevalent_mi = prevmi
    )
}, file = tempfile())
# Drop prevalent chd cases from first visit
no_prevalent_cases <- explore_framingham %>% 
    filter(!(___ == ___ & ___ == ___)) 

# Confirm the number by counting visit then chd cases
no_prevalent_cases %>% 
    count(___, ___) 
"Filtering logic has the form `variable == condition`, for instance `followup_visit_number == 1`."
# Drop prevalent chd cases from first visit
no_prevalent_cases <- explore_framingham %>% 
    filter(!(followup_visit_number == 1 & prevalent_chd == 1)) 

# Confirm the number by counting visit then chd cases
no_prevalent_cases %>% 
    count(followup_visit_number, prevalent_chd) 
"Excellent! You've dropped baseline prevalent cases of CHD and started making sure that you've reduced bias in the final results!"


lwjohnst86/acdcourse documentation built on June 18, 2019, 8:26 p.m.