In lwjohnst86/acdcourse: Course on Analyzing Cohort Datasets in R

The purpose of epidemiology is to study disease. Two important terms related to diseases in cohort settings are prevalence and incidence.

Prevalent vs incident cases of disease

Prevalent vs incident cases

What are prevalence and incidence? Prevalence is a single snapshot in time and counts the number of people who have a disease. Incidence is the number of new cases of a disease over a specific period of time. Because of the time component, incidence data is much more powerful at uncovering how exposures can influence disease. This graph illustrates prevalence compared to incidence. At visit zero, person one has a disease, so there is one total case, which is also the one prevalent case.

Prevalent vs incident cases of disease

Prevalent vs incident cases

At the next visit, person two develops the disease. There are now two prevalent cases, but one of these cases is an incident case.

Prevalent vs incident cases of disease

Prevalent vs incident cases

At visit two, person three also gets the disease. There are now three total cases, or rather three prevalent cases at visit two, and only one incident case since visit one.

Prevalence is always the number of cases at a specific time. For incident cases though, we need to know the person's previous status, so it is more difficult to count.

Over the entire period of follow-up, there are two incident cases due to person two and person three.

Incident data gives stronger scientific evidence

Temporal aspect of an association
- Exposure preceding disease occurrence
Participants at recruitment shouldn't have the disease
- If they do, it isn't incident data

It's this incidence information from prospective cohorts that is it's main strength and what makes the study design such a powerful scientific tool.

With this temporal aspect of the design and analysis, evidence for a particular finding is stronger compared to other study designs. Since the exposure occurs before the disease develops, you are able to state with more certainty whether it influenced the disease or not.

For this to happen, participants cannot have the disease at the recruitment visit. Otherwise, it won't be incident data, as we saw in the figure. So, for prospective cohort analyses, it's important to confirm prevalent cases during the first visit.

Counting number of cases at recruitment

framingham %>% 
    select(prevalent_chd = prevchd,
           followup_visit_number = period) %>% 
    filter(followup_visit_number == 1) %>% 
    count(prevalent_chd)

# A tibble: 2 x 2
  prevalent_chd     n
          <int> <int>
1             0  4240
2             1   194

filter() keeps rows matching TRUE logical condition
Drops rows with !, converting TRUE to FALSE, e.g. !TRUE equals FALSE

Let's check the number of cases at the recruitment, or first, visit. First we select and rename the variables to be more descriptive, then we use filter to keep only baseline data when visit number equals one. Next, we count the number of disease cases. Here we are counting the prevalent cases of coronary heart disease, or CHD, which is a subset of CVD. One of the exclusion criteria for the Framingham study was to not have CVD, but these participants were, likely accidentally, included in the study.

Often during the baseline data collection, researchers find out that some participants actually do have the disease and were accidentally recruited. This happens even with the most rigorous practices, and is why exploratory data analysis is so important. Part of your analysis must be to check these things out and to confirm your assumptions. You'll also notice that we didn't use the got cvd variable. One of the reasons for this is that CVD is composed of several related diseases, such as stroke or CHD, so there are multiple variables that contain information on CVD. These are variables we will need to check into as we explore the data more and tidy it up.

Note that, in our code, filter kept the rows that match the true condition. In this case, only the first visit. But if you want to drop rows, you can do the opposite by using the exclamation mark to drop the rows that do not match the true condition.