collapse_timesteps | R Documentation |
collapse_timesteps
will create a vector that flags rows in a cohort that have subsequent steps within the threshold difference (usually in days).
collapse_timesteps(
data,
grp_id,
date_start,
date_end,
threshold = 1,
preserve_id = FALSE
)
data |
A data object (tibble or data.frame). |
grp_id |
Unique ID for each member of the cohort (unquoted). |
date_start |
Date format (e.g. YYYY-mm-dd) for entry point for record (unquoted). |
date_end |
Date format (e.g. YYYY-mm-dd) for exit point for record (unquoted). |
threshold |
Integer value for acceptable difference in days between successive record (defaults to |
preserve_id |
Logical value, if set to |
Data when organized as a cohort will typically have a long-format with repeated records for an individual, each with a particular date-span for that instance.
Often, subsequent steps between these records are very close in time and can be collapse into a single record to simplify the cohort. The logic involves comparing
the previous records date_end compared to the subsequent date_start. If the difference in those two dates is more than a specific threshold, they will be flagged as
a different group in a progression of cohort steps, otherwise the two records will be flagged as the same group to collapse. In order to compare the cohort, the data provided
is sorted by id and dates. Consequently, the output will also be in that order; if joining back to the original data-set, ensure the data is sorted by the provided columns.
Since the logic requires looping by individuals, the function is written using data.table
; however, other versions using dplyr
and Rcpp
were trialed as well.
An integer vector (ordered by grp_id and dates) or a list containing the original id and collapse id.
# Load libraries
library(dplyr); library(data.table); library(lubridate); library(magrittr); library(tibble)
# Create fake data for scenarios
test_data <- tribble(~grp_id, ~date_start, ~date_end,
1, '2020-01-01', '2020-01-02',
1, '2020-01-03', '2020-01-04',
1, '2020-01-04', '2020-09-02',
2, '2020-01-01', '2020-09-02',
2, '2020-09-10', '2020-09-20',
2, '2020-09-21', '2020-09-22',
3, '2020-01-01', '2020-01-02',
3, '2020-01-02', '2020-01-20',
3, '2020-01-21', '2020-01-22',
3, '2020-01-22', '2020-04-02',
3, '2020-04-22', '2021-04-22',
3, '2021-06-09', '2021-06-22') %>%
dplyr::mutate_at(vars(contains('date')), ymd)
# Create vector of outputs (ensure original dataset is sorted)
test_data$timestep_group <- collapse_timesteps(data = test_data,
grp_id = grp_id,
date_start = date_start,
date_end = date_end,
threshold = 1)
test_data %>%
group_by(grp_id, timestep_group) %>%
mutate(min = min(date_start),
max = max(date_end))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.