Splitting cohorts"

knitr::opts_chunk$set(
  collapse = TRUE,
  eval = TRUE, message = FALSE, warning = FALSE,
  comment = "#>"
)

Introduction

In this vignette we show how to split existing cohorts. We are going to use the GiBleed database to conduct the different examples. To make sure GiBleed database is available you can use the function requireEunomia() so let's get started.

Load necessary packages:

library(duckdb)
library(CDMConnector)
library(PatientProfiles)
library(CohortConstructor)
library(dplyr, warn.conflicts = FALSE)
library(clock)

Create cdm_reference object from GiBleed database:

requireEunomia(datasetName = "GiBleed")
con <- dbConnect(drv = duckdb(), dbdir = eunomiaDir())
cdm <- cdmFromCon(
  con = con, cdmSchema = "main", writeSchema = "main", writePrefix = "my_study_"
)

Let's start by creating two drug cohorts, one for users of diclofenac and another for users of acetaminophen.

cdm$medications <- conceptCohort(cdm = cdm, 
                                 conceptSet = list("diclofenac" = 1124300L,
                                                   "acetaminophen" = 1127433L), 
                                 name = "medications")
cohortCount(cdm$medications)
settings(cdm$medications)

stratifyCohorts

If we want to create separate cohorts by sex we could use the function requireSex():

cdm$medications_female <- cdm$medications |>
  requireSex(sex = "Female", name = "medications_female") |>
  renameCohort(
    cohortId = c("acetaminophen", "diclofenac"), 
    newCohortName = c("acetaminophen_female", "diclofenac_female")
  )
cdm$medications_male <- cdm$medications |>
  requireSex(sex = "Male", name = "medications_male") |>
  renameCohort(
    cohortId = c("acetaminophen", "diclofenac"), 
    newCohortName = c("acetaminophen_male", "diclofenac_male")
  )
cdm <- bind(cdm$medications_female, cdm$medications_male, name = "medications_sex")
cohortCount(cdm$medications_sex)
settings(cdm$medications_sex)

The stratifyCohorts() function will produce a similar output but it relies on a column being already created so let's first add a column sex to my existent cohort:

cdm$medications <- cdm$medications |>
  addSex()
cdm$medications

Now we can use the function stratifyCohorts() to create a new cohort based on the sex column, one new cohort will be created for any value of the sex column:

cdm$medications_sex_2 <- cdm$medications |>
  stratifyCohorts(strata = "sex", name = "medications_sex_2")
cohortCount(cdm$medications_sex_2)
settings(cdm$medications_sex_2)

Note that both cohorts can be slightly different, in the first case four cohorts will always be created, whereas in the second one it will rely on whatever is in the data, if one the diclofenac cohort does not have 'Female' records the diclofenac_female cohort is not going to be created, if we have individuals with sex 'None' then a {cohort_name}_none cohort will be created.

The function is very powerful and multiple cohorts can be created in one go, in this example we will create cohorts by "age and sex" and by "year".

cdm$stratified <- cdm$medications |>
  addAge(ageGroup = list("child" = c(0,17), "18_to_65" = c(18,64), "65_and_over" = c(65, Inf))) |>
  addSex() |>
  mutate(year = get_year(cohort_start_date)) |>
  stratifyCohorts(strata = list(c("sex", "age_group"), "year"), name = "stratified")

cohortCount(cdm$stratified)
settings(cdm$stratified)

A total of 232 cohorts were created in one go, 12 related to sex & age group combination, and 220 by year.

Note that these year cohorts were created based on the prescription start date, but they can have end dates after that year. If you want to split the cohorts on yearly contributions see the next section.

yearCohorts

yearCohorts() is a function that is used to split the contribution of a cohort into the different years that is spread across, let's see this simple example:

library(ggplot2)
x <- tibble(
  time = as.Date(c("2010-05-01", "2012-06-12", "2010-05-01", "2010-12-31", "2011-01-01", "2011-12-31", "2012-01-01", "2012-06-12")),
  y = c(rep(1, 2), rep(0.8, 2), rep(0.78, 2), rep(0.76, 2)),
  colour = c(rep("1", 2), rep("2", 6)),
  group = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
)
ggplot(data = x, mapping = aes(x = time, y = y, colour = colour, group = group)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(limits = c(0.56, 1.2), breaks = NULL, labels = NULL) +
  theme_bw() +
  theme(
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = "none"
  )

In this example we have an individual that has a cohort entry that starts on the '2010-05-01' and ends on the '2012-06-12' then its contributions will be split into three contributions:

So let's use it in one example:

cdm$medications_year <- cdm$medications |>
  yearCohorts(years = c(1990:1993), name = "medications_year")
settings(cdm$medications_year)
cohortCount(cdm$medications_year)

Note we could choose the years of interest and that invididuals. Let's look closer to one of the individuals (person_id = 4383) that has 6 records:

cdm$medications |> 
  filter(subject_id == 4383)

From the 6 records only 3 are within our period of interest 1990-1993, there are two contributions that start and end in the same year that's why they are going to be unaltered and just assigned to the year of interest. But one of the cohort entries starts in 1990 and ends in 1991, then their contribution will be split into the two years, so we expect to see 4 cohort contributions for this subject (2 in 1990, 1 in 1991 and 1 in 1992):

cdm$medications_year |>
  dplyr::filter(subject_id == 4383)

Let's disconnect from our cdm object to finish.

cdmDisconnect(cdm)


Try the CohortConstructor package in your browser

Any scripts or data that you put into this service are public.

CohortConstructor documentation built on June 8, 2025, 12:49 p.m.