library(learnr) knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
The Framingham cohort was set up to study what factors may influence the risk of cardiovascular disease (CVD). People from the town of Framingham, USA were recruited and followed over time. Data was collected on risk factors and CVD outcomes every few years.
The framingham
dataset has been loaded for you to optionally explore,
however, do note that the dataset has not yet been tidied. We'll go through
tidying it in Chapter 2.
question("What specifically distinguishes the Framingham study as a *cohort*?", answer("It studies a disease (CVD).", correct = FALSE, message = "Almost, but incorrect. Many types of scientific studies study a disease, but that alone doesn't distinguish them as a cohort study."), answer("Participants all came from the town of Framingham, USA.", correct = TRUE, message = "Cohorts are people who share a common characteristic. In this case, the participants come from the same town and so have a similar environment."), answer("Participants were followed over time.", correct = FALSE, message = "Almost, but incorrect. Many types of scientific studies follow their subjects over time (e.g. clinical trials), but that alone doesn't distinguish them as a cohort study."), answer("Participants had risk factors measured.", correct = FALSE, message = "Incorrect. Many types of scientific studies measure risk factors, but that alone doesn't distinguish them as a cohort."), allow_retry = TRUE )
"Cohorts are people who have *something in common*."
It's usually possible to determine the cohort design from the variables within the dataset. There are at least two variables in the Framingham Heart study that give us some indication of the cohort design. Recall that cohorts involve a data collection period.
The dplyr
package has been loaded, as well as the framingham
dataset. Again,
note that framingham
has not yet been tidied up, which we will do later in the
course.
capture.output({ library(dplyr) library(acdcourse) data("framingham") framingham$time <- NULL }, file = tempfile())
Instructions:
framingham
dataset.framingham
's cohort design.# Check out the variable names names(framingham) # Select the two columns that indicate design framingham %>% select(___, ___)
"The Framingham cohort was designed to study the disease `cvd`."
# Check out the variable names names(framingham) # Select two columns that indicate design framingham %>% select(period, cvd)
"Great job!"
question("What is Framingham's cohort design? Remember, *when* the disease occurs is what distinguishes prospective from retrospective cohorts.", answer("Prospective.", correct = TRUE), answer("Retrospective.", correct = FALSE, message = "Incorrect. Participants enter the study with a disease. In Framingham, participants did not have the disease."), answer("Neither.", correct = FALSE, message = "Incorrect."), answer("Both.", correct = FALSE, message = "Incorrect. It can't be both!"), allow_retry = TRUE )
"The study was designed to investigate how people *develop* CVD over time (i.e. they don't have the disease when the study starts)."
To properly analyze the data you need to know what each variable represents. Usually it's fairly easy to identify the outcome (the disease). However, knowing which variables are potential exposures to investigate can be tricky, since modern cohort studies often measure hundreds of variables on each participant.
Initially, it can be helpful to keep only the variables of interest. For now, select a few interesting variables, renaming them so they are more descriptive, and exploring them more.
Instructions:
names(framingham)
in the console to find the exact names of the variables. got_cvd
.total_cholesterol
, body_mass_index
and currently_smokes
.period
variable to followup_visit_number
.capture.output({ library(dplyr) library(acdcourse) data("framingham") }, file = tempfile())
# Select and rename the potential predictors and outcome explore_framingham <- framingham %>% select( # Format: new_variable_name = old_variable_name # Outcome _____ = cvd, # Predictors _____ = totchol, _____ = bmi, _____ = cursmoke, # Visit number _____ = period )
"Rename `bmi` to `body_mass_index`, `totchol` to `total_cholesterol`, and `cursmoke` to `currently_smokes`."
# Select and rename the potential predictors and outcome explore_framingham <- framingham %>% select( # Format: new_variable_name = old_variable_name # Outcome got_cvd = cvd, # Predictors total_cholesterol = totchol, body_mass_index = bmi, currently_smokes = cursmoke, # Visit number followup_visit_number = period )
"Great job! You've selected and renamed the variables correctly."
Getting some simple summaries of the exposures by those with and without the disease should be done early in any analysis of cohort datasets. Even more so when there is a time component to the study, so you can identify how variables change over time or are different between groups.
Using what was shown in the video, calculate some means based on some groupings.
Instructions:
followup_visit_number
and got_cvd
using the dplyr
function group_by()
.body_mass_index
, currently_smokes
, and total_cholesterol
using summarize()
and mean()
.mean()
drops NA
values by setting the na.rm
argument to TRUE
.capture.output({ library(dplyr) library(acdcourse) data("framingham") explore_framingham <- framingham %>% select( got_cvd = cvd, total_cholesterol = totchol, body_mass_index = bmi, currently_smokes = cursmoke, followup_visit_number = period ) }, file = tempfile())
explore_framingham %>% # Group by visit and CVD status group_by(___, ___) %>% # Mean of body mass, smoking, and cholesterol summarize( body_mass_mean = mean(___, na.rm = ___), smokes_mean = ___, cholesterol_mean = ___ )
"Use `na.rm = TRUE` with `mean()` to exclude `NA` from the mean calculation."
explore_framingham %>% # Group by visit and CVD status group_by(followup_visit_number, got_cvd) %>% # Mean of body mass, smoking, and cholesterol summarize( body_mass_mean = mean(body_mass_index, na.rm = TRUE), smokes_mean = mean(currently_smokes, na.rm = TRUE), cholesterol_mean = mean(total_cholesterol, na.rm = TRUE) )
"Awesome! You learned how to compare the difference in some basic predictors in those who did and did not get CVD over the study duration."
Here, you will count the number of cases and non-cases for both prevalent
myocardial infarction (MI), or prevalent_mi
, and coronary heart disease (CHD),
or prevalent_chd
, at each visit. Remember, for longitudinal data, like that in
prospective cohorts, you need to count by the time period since each participant
will have several rows for each of the data collection visits.
Both dplyr
and tidyr
are loaded and all variables have been added back into explore_framingham
.
capture.output({ library(dplyr) library(acdcourse) data("framingham") explore_framingham <- framingham %>% rename( got_cvd = cvd, total_cholesterol = totchol, body_mass_index = bmi, currently_smokes = cursmoke, followup_visit_number = period, prevalent_chd = prevchd, prevalent_mi = prevmi ) }, file = tempfile())
Instructions:
count()
to find the number of participants at each followup_visit_number
.# Count number of participants per visit explore_framingham %>% count(___)
"The code is `count(followup_visit_number)`."
# Count number of participants per visit explore_framingham %>% count(followup_visit_number)
"Great!"
Instructions:
prevalent_mi
at each followup_visit_number
.explore_framingham %>% count(followup_visit_number) # Count by visit, then prevalent cases of MI explore_framingham %>% count(___, ___)
"Include both variables in `count()`, separated by a comma."
explore_framingham %>% count(followup_visit_number) # Count by visit, then prevalent cases of MI explore_framingham %>% count(followup_visit_number, prevalent_mi)
"Amazing!"
Instructions:
prevalent_chd
.explore_framingham %>% count(followup_visit_number) explore_framingham %>% count(followup_visit_number, prevalent_mi) # Count by visit, then prevalent cases of CHD explore_framingham %>% count(___, ___)
"Use the same syntax as for the `prevalent_mi` code."
explore_framingham %>% count(followup_visit_number) explore_framingham %>% count(followup_visit_number, prevalent_mi) # Count by visit, then prevalent cases of CHD explore_framingham %>% count(followup_visit_number, prevalent_chd)
"Woohoo! Nice job. You now know how to count the number of cases by visit."
From the previous exercise, we know that there are prevalent cases of cardiovascular events at the first visit. Prevalent cases of disease at the recruitment visit can introduce bias, so we need to remove these cases before continuing with any further analyses.
Instructions:
!
) observations where followup_visit_number
is equal to 1 and where prevalent_chd
is equal to 1.capture.output({ library(dplyr) library(acdcourse) data("framingham") explore_framingham <- framingham %>% rename( got_cvd = cvd, total_cholesterol = totchol, body_mass_index = bmi, participant_age = age, currently_smokes = cursmoke, followup_visit_number = period, prevalent_chd = prevchd, prevalent_mi = prevmi ) }, file = tempfile())
# Drop prevalent chd cases from first visit no_prevalent_cases <- explore_framingham %>% filter(!(___ == ___ & ___ == ___)) # Confirm the number by counting visit then chd cases no_prevalent_cases %>% count(___, ___)
"Filtering logic has the form `variable == condition`, for instance `followup_visit_number == 1`."
# Drop prevalent chd cases from first visit no_prevalent_cases <- explore_framingham %>% filter(!(followup_visit_number == 1 & prevalent_chd == 1)) # Confirm the number by counting visit then chd cases no_prevalent_cases %>% count(followup_visit_number, prevalent_chd)
"Excellent! You've dropped baseline prevalent cases of CHD and started making sure that you've reduced bias in the final results!"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.