History File Stratification
In LTASR: Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(knitr)
library(LTASR)
per_form <- tibble(Variable = c('id',
'begin_dt',
'end_dt',
'\\<daily exposure variables\\>'
),
Description = c('Unique identifier for each person',
'Beginning date of exposure period',
'End date of exposure period',
'Exposure variable(s)'
),
Format = c('',
'character',
'character',
'numeric'
)
)

History File

In order to stratify a cohort by a time-dependent exposure covariate (aside from age and calendar period), a history file must be created and read in. This file contains one row per person per exposure period. An exposure period is a period of time in which all daily values/levels of an exposure variable are assumed to be constant.

Below are the required variables to be found within the history file:

kable(per_form)

Below is an example layout of a history file with multiple exposures:

options(knitr.kable.NA = '')
history_example %>%
  kable()

The above example contains 3 persons with 2 exposure variables, employed and exposure_level. Person/id 1 contains 2 non-overlapping exposure periods in which employed is 1 for both but exposure_level drops from 71 to 5 units per day.

Calculation of Cumulative Exposure From History File

LTASR comes with an example history file, called history_example, that can be used in conjunction with person_example for testing. Below reads in both example files and formats dates appropriately:

person <- person_example %>%
  mutate(dob = as.Date(dob, format='%m/%d/%Y'),
         pybegin = as.Date(pybegin, format='%m/%d/%Y'),
         dlo = as.Date(dlo, format='%m/%d/%Y'))

history <- history_example %>%
  mutate(begin_dt = as.Date(begin_dt, format='%m/%d/%Y'),
         end_dt = as.Date(end_dt, format='%m/%d/%Y')) %>%
  group_by(id)

For the remainder of this section, we will consider Person/id 1 to demonstrate how exposure is calculated over time. Below is the information found within the person file for person/id 1:

options(knitr.kable.NA = '')
person_example %>%
  filter(id == 1) %>%
  kable()

This example person's follow-up starts on r filter(person_example, id==1)$pybegin and continues through r filter(person_example, id==1)$dlo. Below plots their cumulative exposure for both employed and exposure_level variables:

pid <- 1
history %>%
  filter(id ==pid) %>%
  bind_rows(tibble(id=pid, 
                   begin_dt=filter(history, id==pid, row_number()==n())$end_dt + 1,
                   end_dt=filter(person, id==pid)$dlo,
                   employed=0,
                   exposure_level=0)) %>%
  expand_dates(begin_dt, end_dt) %>%
  filter(!is.na(period)) %>%
  mutate(Date = date, 
         Employed = cumsum(employed),
         `Exposure Level` = cumsum(exposure_level)) %>%
  select(id, Date, Employed, `Exposure Level`) %>%
  pivot_longer(c(Employed, `Exposure Level`)) %>%

  ggplot() +
  geom_line(aes(x=Date,
                y=value)) +
  facet_wrap(vars(name), nrow = 1, scales = "free") +
  theme_bw()

Both exposures start at 0, then employed increases by 1 unit per day for both periods. This can therefore be thought of as a duration variable (in days) of all periods. The exposure_level increases rapidly (71 units per day) during the first period and then increases slower (5 units per day).

NOTE: Any gaps within the history file and the follow-up times (for example, the period between the last exposure period within the history file through the end of follow-up) is assumed to be 0. That is, exposure values do not change during these periods.

Stratifying person time

Once the person file and history file have been read in (see Demo for basic stratification vignette for additional information on how to read in files), information on how to stratify the exposure variables must be defined using the exp_strata() function.

Below specifies which exposure variables to consider, what cut-points to use for stratification and any lag (in years) to apply to the cumulative exposure variable:

exp1 <- exp_strata(var = 'employed',
                   cutpt = c(-Inf, 365, Inf),
                   lag = 0)
exp2 <- exp_strata(var = 'exposure_level',
                   cutpt = c(-Inf, 0, 10000, 20000, Inf),
                   lag = 10)

The employed variable will contain 2 strata: (-Inf, 365] and (365, Inf]. Or, put alternatively, ≤ 1 year and > 1 year.

The exposure_level will contain 5 strata: (-Inf, 0], (0, 10000], (10000, 20000] and (20000, Inf). Therefore, the first category defines unexposed person-time. Additionally, a 10 year lag will be applied when defining strata.

Once the exposure strata have been defined, LTASR provides two functions for stratifying the cohort. One is get_table_history whose usage is:

py_table <- get_table_history(persondf = person,
                              rateobj = us_119ucod_recent,
                              historydf = history,
                              exps = list(exp1, exp2))

This creates the below table:

py_table %>%
  head() %>%
  kable()

This function is very fast, and replicates how the original LTAS behaved. It also exactly stratifies the person-days into the appropriate strata. However, it may be desired to calculate mean exposure values for each strata to be used in a Poisson regression later. To implement this exactly is very slow.

Therefore, a separate function, get_table_history_est, calculates these mean exposure values and also allows for a step parameter to be specified defining the number of days to calculate the cumulative exposure.

An example usage is:

py_table_est <- get_table_history_est(persondf = person,
                                  rateobj = us_119ucod_recent,
                                  historydf = history,
                                  exps = list(exp1, exp2),
                                  step = 7)

By specifying step = 7, person time is considered every 7 days when allocating person-time to strata. This results in a significant increase in speed at the cost of a (generally) trivial amount of inaccuracy.

Specifying step = 1 will calculate strata exactly for each individual day, but is significantly slower.

Below is the result of this specification:

py_table_est %>%
  head() %>%
  kable(digits = 1)

As can be seen, the pday are slightly different than the previous table. However, the effects on results will generally be trivial.

In addition, two additional variables are available: employed and exposure_level indicating the person-time weighted mean values.

Step Specifications

When specifying the step parameter, there is a trade-off between computation speed and accuracy. Specifying step = 1 will result in the most accurate stratification, but can be extremely slow.

To investigate this further, below plots the time (in minutes) taken to stratify a cohort of 5,200 people with 2 exposure variables for various specifications of the step parameter:

times <- c(360.33, 192.92, 138.79, 116.28, 101.35,  85.00,  79.82,  71.31,  68.42,  63.43,  43.82,  33.28) 

ggplot() +
  geom_point(aes(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 365),
                 y = times/60)) +
  theme_bw() +
  scale_x_continuous(name="Step", 
                     trans = 'log10',
                     breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 365))+
  scale_y_continuous(name="Minutes",
                     breaks = 0:6,
                     limits = c(0, 6.5))

There are dramatic savings in computation time when increasing the step parameter in the low end. In this example, after about step = 10, improvements in computation time diminish. It seems a step parameter of about 5-10 is a good compromise.

Exact times will depend upon:

the size of the cohort,\
the number of exposure variables and\
the number of strata per exposure variable.

An additional consideration is the level of detail of the exposure variable. That is, if exposure is dramatically changing, relative to its specified strata, the loss of accuracy will be more dramatic for small increases of the step parameter. For example, age is stratified by 5-year increments, therefore, a step value of 1-week (step = 7) will cause a trivial amount of inaccuracy.

One option is to use a crude step value during initial investigations, but when results are to be published/presented, the function can be run again with a smaller step value.

Any scripts or data that you put into this service are public.

LTASR documentation built on Sept. 11, 2024, 6:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

LTASR
Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R

History File Stratification
In LTASR: Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R

History File

Calculation of Cumulative Exposure From History File

Stratifying person time

Step Specifications

Try the LTASR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

LTASR Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R

History File Stratification In LTASR: Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R

History File

Calculation of Cumulative Exposure From History File

Stratifying person time

Step Specifications

Try the LTASR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

LTASR
Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R

History File Stratification
In LTASR: Functions to Replicate the Center for Disease Control and Prevention's 'LTAS' Software in R