identify_overlap: Identify overlapping timesteps for cohorts

identify_overlapR Documentation

Identify overlapping timesteps for cohorts

Description

identify_overlap will create a vector that flags rows in a cohort that have timesteps that are overlapping.

Usage

identify_overlap(data, grp_id, date_start, date_end, preserve_id = F)

Arguments

data

A data object (tibble, data.frame, data.table).

grp_id

Unique ID for each member of the cohort (unquoted).

date_start

Date format (e.g. YYYY-mm-dd) for entry point for record (unquoted).

date_end

Date format (e.g. YYYY-mm-dd) for exit point for record (unquoted).

preserve_id

Logical value, if set to TRUE will output list of original ID to ensure column merges back correctly.

Details

Data when organized as a cohort will typically have a long-format with repeated records for an individual, each with a particular date-span for that instance. Sometimes, subsequent steps between these records are overlapping (data entry errors or otherwise) and can be identified so that when collapsed, only the max/min time-points are preserved. This is an important step in ensuring a cohort process has monotonic (i.e. ever increasing) timesteps.

The logic involves sorting by the date_start for each group and comparing if that value is larger or smaller than the preceding date_end. When FALSE, this indicates that an overlap occurs; when TRUE, the flag will increment. This function does not do the collapse procedure, as that can have nuanced implications with NA values, but it will provide the groupings required to do so. It is recommended to have the original data sorted by group and dates so that the returned flag aligns correctly. For performance, this function is written primarily in data.table.

A method to find the exact overlapping ranges is to leverage lubridate::interval() and lubridate::intersect()

Value

An integer vector (ordered by grp_id and dates) or a list containing the original id and collapse id.

See Also

intersect interval

Examples

# Load libraries
library(dplyr); library(data.table); library(lubridate); library(magrittr)
# Create fake data for scenarios
test_data <- tribble(~grp_id, ~date_start, ~date_end,
                     1, '2020-01-01', '2020-01-02',
                     1, '2020-01-01', '2020-01-04',
                     1, '2020-01-05', '2020-09-02',
                     2, '2020-01-01', '2020-09-15',
                     2, '2020-09-10', '2020-09-20',
                     2, '2020-09-21', NA,
                     3, '2020-01-01', '2020-01-02',
                     3, '2020-01-02', '2020-01-20',
                     3, '2020-01-15', '2020-01-19',
                     3, '2020-01-22', '2020-04-02',
                     3, '2020-04-22', NA,
                     3, '2021-06-09', '2021-06-22') %>%
  dplyr::mutate_at(vars(contains('date')), ymd)

# Create vector of outputs (ensure original dataset is sorted)
test_data$overlap_group <- identify_overlap(data = test_data,
                                             grp_id = grp_id,
                                             date_start = date_start,
                                             date_end = date_end)

test_data %>%
  group_by(grp_id, overlap_group) %>%
  mutate(min = min(date_start, na.rm = TRUE),
         max = max(date_end, na.rm = TRUE),
         min = if_else(is.infinite(min), NA_Date_, min),
         max = if_else(is.infinite(max), NA_Date_, max)) # To avoid -/+ Inf on only NA groupings; can skip if not required


al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.