In lwjohnst86/acdcourse: Course on Analyzing Cohort Datasets in R

library(learnr)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

Introduction to cohort studies

What makes it a cohort?

The Framingham cohort was set up to study what factors may influence the risk of cardiovascular disease (CVD). People from the town of Framingham, USA were recruited and followed over time. Data was collected on risk factors and CVD outcomes every few years.

The framingham dataset has been loaded for you to optionally explore, however, do note that the dataset has not yet been tidied. We'll go through tidying it in Chapter 2.

question("What specifically distinguishes the Framingham study as a *cohort*?",
  answer("It studies a disease (CVD).", correct = FALSE, message = "Almost, but incorrect. Many types of scientific studies study a disease, but that alone doesn't distinguish them as a cohort study."),
  answer("Participants all came from the town of Framingham, USA.", correct = TRUE, message = "Cohorts are people who share a common characteristic. In this case, the participants come from the same town and so have a similar environment."),
  answer("Participants were followed over time.", correct = FALSE, message = "Almost, but incorrect. Many types of scientific studies follow their subjects over time (e.g. clinical trials), but that alone doesn't distinguish them as a cohort study."),
  answer("Participants had risk factors measured.", correct = FALSE, message = "Incorrect. Many types of scientific studies measure risk factors, but that alone doesn't distinguish them as a cohort."),
  allow_retry = TRUE
)

"Cohorts are people who have *something in common*."

What cohort type is the Framingham Heart Study? {data-progressive=TRUE}

It's usually possible to determine the cohort design from the variables within the dataset. There are at least two variables in the Framingham Heart study that give us some indication of the cohort design. Recall that cohorts involve a data collection period.

The dplyr package has been loaded, as well as the framingham dataset. Again, note that framingham has not yet been tidied up, which we will do later in the course.

capture.output({
  library(dplyr)
  library(acdcourse)
  data("framingham")
  framingham$time <- NULL
}, file = tempfile())

Exercise step

Instructions:

Familiarize yourself with the variables in the framingham dataset.
Select the two variables that give the most indication on framingham's cohort design.

# Check out the variable names
names(framingham)

# Select the two columns that indicate design
framingham %>% 
    select(___, ___)

"The Framingham cohort was designed to study the disease `cvd`."

# Check out the variable names
names(framingham)

# Select two columns that indicate design
framingham %>% 
    select(period, cvd)

"Great job!"

Exercise step

question("What is Framingham's cohort design? Remember, *when* the disease occurs is what
distinguishes prospective from retrospective cohorts.",
  answer("Prospective.", correct = TRUE),
  answer("Retrospective.", correct = FALSE, message = "Incorrect. Participants enter the study with a disease. In Framingham, participants did not have the disease."),
  answer("Neither.", correct = FALSE, message = "Incorrect."),
  answer("Both.", correct = FALSE, message = "Incorrect. It can't be both!"),
  allow_retry = TRUE
)

"The study was designed to investigate how people *develop* CVD over time (i.e. they don't have the disease when the study starts)."

Cohort types, variables, and the Framingham Study

Select the outcome and some exposures

To properly analyze the data you need to know what each variable represents. Usually it's fairly easy to identify the outcome (the disease). However, knowing which variables are potential exposures to investigate can be tricky, since modern cohort studies often measure hundreds of variables on each participant.

Initially, it can be helpful to keep only the variables of interest. For now, select a few interesting variables, renaming them so they are more descriptive, and exploring them more.

Instructions:

Run names(framingham) in the console to find the exact names of the variables.
Choose the correct outcome for cardiovascular disease (CVD). Rename it to got_cvd.
Rename the three predictors to total_cholesterol, body_mass_index and currently_smokes.
Rename the period variable to followup_visit_number.

capture.output({
  library(dplyr)
  library(acdcourse)
  data("framingham")
}, file = tempfile())

# Select and rename the potential predictors and outcome
explore_framingham <- framingham %>%
    select(
        # Format: new_variable_name = old_variable_name
        # Outcome
        _____ = cvd,
        # Predictors
        _____ = totchol,
        _____ = bmi,
        _____ = cursmoke,
        # Visit number
        _____ = period 
    )

"Rename `bmi` to `body_mass_index`, `totchol` to `total_cholesterol`, and `cursmoke` to `currently_smokes`."

# Select and rename the potential predictors and outcome
explore_framingham <- framingham %>%
    select(
        # Format: new_variable_name = old_variable_name
        # Outcome
        got_cvd = cvd,
        # Predictors
        total_cholesterol = totchol,
        body_mass_index = bmi,
        currently_smokes = cursmoke,
        # Visit number
        followup_visit_number = period 
    )

"Great job! You've selected and renamed the variables correctly."

Simple summary of the exposures by outcome

Getting some simple summaries of the exposures by those with and without the disease should be done early in any analysis of cohort datasets. Even more so when there is a time component to the study, so you can identify how variables change over time or are different between groups.

Using what was shown in the video, calculate some means based on some groupings.

Instructions:

Group the data by followup_visit_number and got_cvd using the dplyr function group_by().
Calculate the mean for body_mass_index, currently_smokes, and total_cholesterol using summarize() and mean().
Make sure that mean() drops NA values by setting the na.rm argument to TRUE.

capture.output({
  library(dplyr)
  library(acdcourse)
  data("framingham")
  explore_framingham <- framingham %>%
      select(
          got_cvd = cvd,
          total_cholesterol = totchol,
          body_mass_index = bmi,
          currently_smokes = cursmoke,
          followup_visit_number = period
      )
}, file = tempfile())

explore_framingham %>% 
    # Group by visit and CVD status
    group_by(___, ___) %>% 
    # Mean of body mass, smoking, and cholesterol
    summarize(
        body_mass_mean = mean(___, na.rm = ___),
        smokes_mean = ___,
        cholesterol_mean = ___
    )

"Use `na.rm = TRUE` with `mean()` to exclude `NA` from the mean calculation."

explore_framingham %>% 
    # Group by visit and CVD status
    group_by(followup_visit_number, got_cvd) %>% 
    # Mean of body mass, smoking, and cholesterol
    summarize(
        body_mass_mean = mean(body_mass_index, na.rm = TRUE),
        smokes_mean = mean(currently_smokes, na.rm = TRUE),
        cholesterol_mean = mean(total_cholesterol, na.rm = TRUE)
    )

"Awesome! You learned how to compare the difference in some basic predictors in those who did and did not get CVD over the study duration."

Prevalence and incidence in cohorts

Count number of participants and cases per visit {data-progressive=TRUE}

Here, you will count the number of cases and non-cases for both prevalent myocardial infarction (MI), or prevalent_mi, and coronary heart disease (CHD), or prevalent_chd, at each visit. Remember, for longitudinal data, like that in prospective cohorts, you need to count by the time period since each participant will have several rows for each of the data collection visits.

Both dplyr and tidyr are loaded and all variables have been added back into explore_framingham.

capture.output({
library(dplyr)
library(acdcourse)
data("framingham")
explore_framingham <- framingham %>%
    rename(
        got_cvd = cvd, 
        total_cholesterol = totchol,
        body_mass_index = bmi,
        currently_smokes = cursmoke,
        followup_visit_number = period,
        prevalent_chd = prevchd,
        prevalent_mi = prevmi
    )
}, file = tempfile())

Exercise step

Instructions:

Use count() to find the number of participants at each followup_visit_number.

# Count number of participants per visit
explore_framingham %>%
    count(___)

"The code is `count(followup_visit_number)`."

# Count number of participants per visit
explore_framingham %>% 
    count(followup_visit_number)

"Great!"

Exercise step

Instructions:

Count the number of participants with prevalent_mi at each followup_visit_number.

explore_framingham %>% 
    count(followup_visit_number)

# Count by visit, then prevalent cases of MI
explore_framingham %>% 
    count(___, ___)

"Include both variables in `count()`, separated by a comma."

explore_framingham %>% 
    count(followup_visit_number)

# Count by visit, then prevalent cases of MI
explore_framingham %>% 
    count(followup_visit_number, prevalent_mi)

"Amazing!"

Exercise step

Instructions:

Lastly, do the same thing for prevalent_chd.

explore_framingham %>% 
    count(followup_visit_number)

explore_framingham %>% 
    count(followup_visit_number, prevalent_mi)

# Count by visit, then prevalent cases of CHD
explore_framingham %>% 
    count(___, ___)

"Use the same syntax as for the `prevalent_mi` code."

explore_framingham %>% 
    count(followup_visit_number)

explore_framingham %>% 
    count(followup_visit_number, prevalent_mi)

# Count by visit, then prevalent cases of CHD
explore_framingham %>% 
    count(followup_visit_number, prevalent_chd)

"Woohoo! Nice job. You now know how to count the number of cases by visit."

Remove prevalent cases at the baseline

From the previous exercise, we know that there are prevalent cases of cardiovascular events at the first visit. Prevalent cases of disease at the recruitment visit can introduce bias, so we need to remove these cases before continuing with any further analyses.

Instructions:

Exclude (with !) observations where followup_visit_number is equal to 1 and where prevalent_chd is equal to 1.
Count the number of observations to make sure that patients with CHD at the first visit were dropped.

capture.output({
library(dplyr)
library(acdcourse)
data("framingham")
explore_framingham <- framingham %>%
    rename(
        got_cvd = cvd, 
        total_cholesterol = totchol,
        body_mass_index = bmi,
        participant_age = age,
        currently_smokes = cursmoke,
        followup_visit_number = period,
        prevalent_chd = prevchd,
        prevalent_mi = prevmi
    )
}, file = tempfile())

# Drop prevalent chd cases from first visit
no_prevalent_cases <- explore_framingham %>% 
    filter(!(___ == ___ & ___ == ___)) 

# Confirm the number by counting visit then chd cases
no_prevalent_cases %>% 
    count(___, ___)

"Filtering logic has the form `variable == condition`, for instance `followup_visit_number == 1`."

# Drop prevalent chd cases from first visit
no_prevalent_cases <- explore_framingham %>% 
    filter(!(followup_visit_number == 1 & prevalent_chd == 1)) 

# Confirm the number by counting visit then chd cases
no_prevalent_cases %>% 
    count(followup_visit_number, prevalent_chd)

"Excellent! You've dropped baseline prevalent cases of CHD and started making sure that you've reduced bias in the final results!"

lwjohnst86/acdcourse documentation built on June 18, 2019, 8:26 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lwjohnst86/acdcourse
Course on Analyzing Cohort Datasets in R

In lwjohnst86/acdcourse: Course on Analyzing Cohort Datasets in R

Introduction to cohort studies

What makes it a cohort?

What cohort type is the Framingham Heart Study? {data-progressive=TRUE}

Exercise step

Exercise step

Cohort types, variables, and the Framingham Study

Select the outcome and some exposures

Simple summary of the exposures by outcome

Prevalence and incidence in cohorts

Count number of participants and cases per visit {data-progressive=TRUE}

Exercise step

Exercise step

Exercise step

Remove prevalent cases at the baseline

R Package Documentation

Browse R Packages

We want your feedback!

lwjohnst86/acdcourse Course on Analyzing Cohort Datasets in R

In lwjohnst86/acdcourse: Course on Analyzing Cohort Datasets in R

Introduction to cohort studies

What makes it a cohort?

What cohort type is the Framingham Heart Study? {data-progressive=TRUE}

Exercise step

Exercise step

Cohort types, variables, and the Framingham Study

Select the outcome and some exposures

Simple summary of the exposures by outcome

Prevalence and incidence in cohorts

Count number of participants and cases per visit {data-progressive=TRUE}

Exercise step

Exercise step

Exercise step

Remove prevalent cases at the baseline

R Package Documentation

Browse R Packages

We want your feedback!

lwjohnst86/acdcourse
Course on Analyzing Cohort Datasets in R