identify_overlap: Identify overlapping timesteps for cohorts
In al-obrien/farrago: A random collection of helpful code baubles

identify_overlap

R Documentation

Identify overlapping timesteps for cohorts

Description

identify_overlap will create a vector that flags rows in a cohort that have timesteps that are overlapping.

Usage

identify_overlap(data, grp_id, date_start, date_end, preserve_id = F)

Arguments

`data`	A data object (tibble, data.frame, data.table).
`grp_id`	Unique ID for each member of the cohort (unquoted).
`date_start`	Date format (e.g. YYYY-mm-dd) for entry point for record (unquoted).
`date_end`	Date format (e.g. YYYY-mm-dd) for exit point for record (unquoted).
`preserve_id`	Logical value, if set to `TRUE` will output list of original ID to ensure column merges back correctly.

Details

Data when organized as a cohort will typically have a long-format with repeated records for an individual, each with a particular date-span for that instance. Sometimes, subsequent steps between these records are overlapping (data entry errors or otherwise) and can be identified so that when collapsed, only the max/min time-points are preserved. This is an important step in ensuring a cohort process has monotonic (i.e. ever increasing) timesteps.

The logic involves sorting by the date_start for each group and comparing if that value is larger or smaller than the preceding date_end. When FALSE, this indicates that an overlap occurs; when TRUE, the flag will increment. This function does not do the collapse procedure, as that can have nuanced implications with NA values, but it will provide the groupings required to do so. It is recommended to have the original data sorted by group and dates so that the returned flag aligns correctly. For performance, this function is written primarily in data.table.

A method to find the exact overlapping ranges is to leverage lubridate::interval() and lubridate::intersect()

Value

An integer vector (ordered by grp_id and dates) or a list containing the original id and collapse id.

Examples

# Load libraries
library(dplyr); library(data.table); library(lubridate); library(magrittr)
# Create fake data for scenarios
test_data <- tribble(~grp_id, ~date_start, ~date_end,
                     1, '2020-01-01', '2020-01-02',
                     1, '2020-01-01', '2020-01-04',
                     1, '2020-01-05', '2020-09-02',
                     2, '2020-01-01', '2020-09-15',
                     2, '2020-09-10', '2020-09-20',
                     2, '2020-09-21', NA,
                     3, '2020-01-01', '2020-01-02',
                     3, '2020-01-02', '2020-01-20',
                     3, '2020-01-15', '2020-01-19',
                     3, '2020-01-22', '2020-04-02',
                     3, '2020-04-22', NA,
                     3, '2021-06-09', '2021-06-22') %>%
  dplyr::mutate_at(vars(contains('date')), ymd)

# Create vector of outputs (ensure original dataset is sorted)
test_data$overlap_group <- identify_overlap(data = test_data,
                                             grp_id = grp_id,
                                             date_start = date_start,
                                             date_end = date_end)

test_data %>%
  group_by(grp_id, overlap_group) %>%
  mutate(min = min(date_start, na.rm = TRUE),
         max = max(date_end, na.rm = TRUE),
         min = if_else(is.infinite(min), NA_Date_, min),
         max = if_else(is.infinite(max), NA_Date_, max)) # To avoid -/+ Inf on only NA groupings; can skip if not required

al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.