identify_overlap | R Documentation |
identify_overlap
will create a vector that flags rows in a cohort that have timesteps that are overlapping.
identify_overlap(data, grp_id, date_start, date_end, preserve_id = F)
data |
A data object (tibble, data.frame, data.table). |
grp_id |
Unique ID for each member of the cohort (unquoted). |
date_start |
Date format (e.g. YYYY-mm-dd) for entry point for record (unquoted). |
date_end |
Date format (e.g. YYYY-mm-dd) for exit point for record (unquoted). |
preserve_id |
Logical value, if set to |
Data when organized as a cohort will typically have a long-format with repeated records for an individual, each with a particular date-span for that instance. Sometimes, subsequent steps between these records are overlapping (data entry errors or otherwise) and can be identified so that when collapsed, only the max/min time-points are preserved. This is an important step in ensuring a cohort process has monotonic (i.e. ever increasing) timesteps.
The logic involves sorting by the date_start
for each group and comparing if that value is larger or smaller than the preceding date_end
. When FALSE
, this indicates that
an overlap occurs; when TRUE
, the flag will increment. This function does not do the collapse procedure, as that can have nuanced implications with NA
values, but it will provide
the groupings required to do so. It is recommended to have the original data sorted by group and dates so that the returned flag aligns correctly. For performance, this function is written primarily in data.table
.
A method to find the exact overlapping ranges is to leverage lubridate::interval()
and lubridate::intersect()
An integer vector (ordered by grp_id and dates) or a list containing the original id and collapse id.
intersect
interval
# Load libraries
library(dplyr); library(data.table); library(lubridate); library(magrittr)
# Create fake data for scenarios
test_data <- tribble(~grp_id, ~date_start, ~date_end,
1, '2020-01-01', '2020-01-02',
1, '2020-01-01', '2020-01-04',
1, '2020-01-05', '2020-09-02',
2, '2020-01-01', '2020-09-15',
2, '2020-09-10', '2020-09-20',
2, '2020-09-21', NA,
3, '2020-01-01', '2020-01-02',
3, '2020-01-02', '2020-01-20',
3, '2020-01-15', '2020-01-19',
3, '2020-01-22', '2020-04-02',
3, '2020-04-22', NA,
3, '2021-06-09', '2021-06-22') %>%
dplyr::mutate_at(vars(contains('date')), ymd)
# Create vector of outputs (ensure original dataset is sorted)
test_data$overlap_group <- identify_overlap(data = test_data,
grp_id = grp_id,
date_start = date_start,
date_end = date_end)
test_data %>%
group_by(grp_id, overlap_group) %>%
mutate(min = min(date_start, na.rm = TRUE),
max = max(date_end, na.rm = TRUE),
min = if_else(is.infinite(min), NA_Date_, min),
max = if_else(is.infinite(max), NA_Date_, max)) # To avoid -/+ Inf on only NA groupings; can skip if not required
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.