collapse_timesteps: Collapse timesteps for cohorts

View source: R/calculate.R

collapse_timestepsR Documentation

Collapse timesteps for cohorts

Description

collapse_timesteps will create a vector that flags rows in a cohort that have subsequent steps within the threshold difference (usually in days).

Usage

collapse_timesteps(
  data,
  grp_id,
  date_start,
  date_end,
  threshold = 1,
  preserve_id = FALSE
)

Arguments

data

A data object (tibble or data.frame).

grp_id

Unique ID for each member of the cohort (unquoted).

date_start

Date format (e.g. YYYY-mm-dd) for entry point for record (unquoted).

date_end

Date format (e.g. YYYY-mm-dd) for exit point for record (unquoted).

threshold

Integer value for acceptable difference in days between successive record (defaults to 1).

preserve_id

Logical value, if set to TRUE will output list of original ID to ensure column merges back correctly.

Details

Data when organized as a cohort will typically have a long-format with repeated records for an individual, each with a particular date-span for that instance. Often, subsequent steps between these records are very close in time and can be collapse into a single record to simplify the cohort. The logic involves comparing the previous records date_end compared to the subsequent date_start. If the difference in those two dates is more than a specific threshold, they will be flagged as a different group in a progression of cohort steps, otherwise the two records will be flagged as the same group to collapse. In order to compare the cohort, the data provided is sorted by id and dates. Consequently, the output will also be in that order; if joining back to the original data-set, ensure the data is sorted by the provided columns. Since the logic requires looping by individuals, the function is written using data.table; however, other versions using dplyr and Rcpp were trialed as well.

Value

An integer vector (ordered by grp_id and dates) or a list containing the original id and collapse id.

Examples

# Load libraries
library(dplyr); library(data.table); library(lubridate); library(magrittr); library(tibble)
# Create fake data for scenarios
test_data <- tribble(~grp_id, ~date_start, ~date_end,
                     1, '2020-01-01', '2020-01-02',
                     1, '2020-01-03', '2020-01-04',
                     1, '2020-01-04', '2020-09-02',
                     2, '2020-01-01', '2020-09-02',
                     2, '2020-09-10', '2020-09-20',
                     2, '2020-09-21', '2020-09-22',
                     3, '2020-01-01', '2020-01-02',
                     3, '2020-01-02', '2020-01-20',
                     3, '2020-01-21', '2020-01-22',
                     3, '2020-01-22', '2020-04-02',
                     3, '2020-04-22', '2021-04-22',
                     3, '2021-06-09', '2021-06-22') %>%
  dplyr::mutate_at(vars(contains('date')), ymd)

# Create vector of outputs (ensure original dataset is sorted)
test_data$timestep_group <- collapse_timesteps(data = test_data,
                                             grp_id = grp_id,
                                             date_start = date_start,
                                             date_end = date_end,
                                             threshold = 1)

test_data %>%
  group_by(grp_id, timestep_group) %>%
  mutate(min = min(date_start),
         max = max(date_end))



al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.