knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette explains how to use the functions:
calc_futime()
to calculate follow-up time from index event until next event, death or end of follow-up datepat_status()
to determine patient status at end of follow-uprenumber_time_id()
to calculate a consecutive index of events per case IDreshape_long()
to transpose dataset in wide format to data in long formatreshape_wide()
to transpose dataset in long format to data in wide format (the wide format is required for many package functions)sir_byfutime()
to calculate standardized incidence ratios (SIRs) with custom grouping variables stratified by follow-up timesummarize_sir_results()
to summarize detailed SIR results produced by sir_byfutime()
vital_status()
to determine vital status whether patient is alive or dead at end of follow-upFor some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use:
It is recommended to run the following steps in the correct order to obtain accurate follow-up time calculations
Filter all cases in the long version of the dataset that are relevant for your analysis. Make sure that:
for each case_id
the index event (e.g. First Cancer FC) is still included and is the one remaining row in the dataset with the smallest case_id
(TUMID3
variable for ZfKD data, and SEQ_NUM
for SEER data)
case_id
s might or might not get a countable incident event (e.g. Second Primary Cancer SPC). This event should be the second entry per case_id
(second smallest case_id
) if it is to be countedin the long version dataset a count_var
should indicate whether the countable incident event (SPC) has occurred or not. Coded 0
for non-occurrence (or not counted event) and 1
for a counted incident event.
Renumber filtered long dataset: In the filter long dataset, you should run the helper function msSPChelpR::renumber_time_id_dt()
(or non-data.table variant msSPChelpR::renumber_time_id()
) that will renumber all events per case_id
and (if step 1 is fulfilled) will assign each index event with time_var_new = 1
and each second (possibly countable incident event) with time_var_new = 2
. Any SIR related function will only count the second event, if additionally to time_var_new = 2
for this row also count_var = 1
is true.
Reshape dataset: Run msSPChelpR::reshape_wide_dt()
or non-data.table-variant msSPChelpR::reshape_wide()
, so that dataset is transposed to wide format (1 row per case_id
, creating variables such as count_var.2
).
Set flag for Second Primary Cancer diagnosis: After filtering and reshaping it is essential to set p_spc
again. This variable will be used by later steps of the analysis.
Determine patient status at a defined end of follow-up by using the msSPChelpR::pat_status()
function. This date for end of follow-up must:
fu_end =
parametermust precede the end of data collection. E.g. if the last incident events for the dataset you are using are collected at the end of 2014, your fu_end
must be fu_end = "2014-12-15"
or earlier.
Based on the newly calculated patient status, you might want to exclude cases for which patient status cannot be determined
Calculate follow-up time for the same dataset by using the msSPChelpR::calc_futime()
function and the same fu_end
as for step 6. By standard all functions of the msSPChelpR
package require follow-up times as numeric years.
In order to calculate SIR using the package functions, the following data structure is needed:
* Wide format data wide_df
with one row per patient that has encountered the index event (i.e. diagnosed with a first primary cancer FC)
wide_df
needs to contain the following variables (columns) per patient (row):region_var
- variable in df that contains information on region where case was incident.agegroup_var
- variable in df that contains information on age-group.sex_var
- variable in df that contains information on biological sex.year_var
- variable in df that contains information on year or year-period when case was incident.site_var
- variable in df that contains information on case (count event) diagnosis. Cases are usually the second cancers. Diagnoses can use any coding system (e.g. ICD) but coding system between dataset and reference data must be coherent.futime_var
- variable in df that contains follow-up time per person between date of first cancer and any of death, date of event (case), end of FU date (in years; whatever event comes first). In case you have not calculated the FU time yet, you can use the workflow described in the previous chapter.If your data has the required structure, you can calculate and summarize SIRs with the following two steps:
msSPChelpR::sir_byfutime()
function. For this calculation usually a reference dataset is required that defines the population standard rates. refrates_df
must use the same category coding of age, sex, region, year and cancer_site as agegroup_var
, sex_var
, region_var
, year_var
and site_var
The theory behind calculating stratified SIRs is explained in the chapter on basics on SIRs
Summarize SIR results using the msSPChelpR::summarize_sir_results()
function on the stratified sir results produced by the previous step.
In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.
library(dplyr) library(magrittr) library(msSPChelpR) #Load synthetic dataset of patients with cancer to demonstrate package functions data("us_second_cancer") #This dataset is in long format, so each tumor is a separate row in the data us_second_cancer
#filter for lung cancer ids <- us_second_cancer %>% #detect ids with any lung cancer filter(t_site_icd == "C34") %>% select(fake_id) %>% as.vector() %>% unname() %>% unlist() filtered_usdata <- us_second_cancer %>% #filter according to above detected ids with any lung cancer diagnosis filter(fake_id %in% ids) %>% arrange(fake_id) filtered_usdata
time_id
{#step-renumber}renumbered_usdata <- filtered_usdata %>% renumber_time_id(new_time_id_var = "t_tumid", dattype = "seer", case_id_var = "fake_id") renumbered_usdata %>% select(fake_id, sex, t_site_icd, t_datediag, t_tumid)
usdata_wide <- renumbered_usdata %>% reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10) #now the data is in the wide format as required by many package functions. #This means, each case is a row and several tumors per case ID are #add new columns to the data using the time_id as column name suffix. usdata_wide
p_spc
{#step-spc}usdata_wide <- usdata_wide %>% dplyr::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2) ~ "No SPC", !is.na(t_site_icd.2) ~ "SPC developed", TRUE ~ NA_character_)) %>% #create the same information as numeric variable count_spc dplyr::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2) ~ 1, TRUE ~ 0)) usdata_wide %>% dplyr::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1, t_datediag.1, t_site_icd.2, t_datediag.2)
usdata_wide <- usdata_wide %>% pat_status(., fu_end = "2017-12-31", dattype = "seer", status_var = "p_status", life_var = "p_alive.1", spc_var = "p_spc", birthdat_var = "datebirth.1", lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2", life_stat_alive = "Alive", life_stat_dead = "Dead", spc_stat_yes = "SPC developed", spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31", use_lifedatmin = FALSE, check = TRUE, as_labelled_factor = TRUE) usdata_wide %>% dplyr::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1, t_site_icd.2, t_datediag.2) #alternatively, you can impute the date of death using lifedatmin_var usdata_wide %>% pat_status(., fu_end = "2017-12-31", dattype = "seer", status_var = "p_status", life_var = "p_alive.1", spc_var = "p_spc", birthdat_var = "datebirth.1", lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2", life_stat_alive = "Alive", life_stat_dead = "Dead", spc_stat_yes = "SPC developed", spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31", use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1", check = TRUE, as_labelled_factor = TRUE)
usdata_wide <- usdata_wide %>% dplyr::filter(!p_status %in% c("NA - Patient not born before end of FU", "NA - Patient did not develop cancer before end of FU", "NA - Patient date of death is missing")) usdata_wide %>% dplyr::count(p_status)
usdata_wide <- usdata_wide %>% calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31", dattype = "seer", time_unit = "years", lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2") usdata_wide %>% dplyr::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)
sircalc_results <- usdata_wide %>% sir_byfutime( dattype = "seer", ybreak_vars = c("race.1", "t_dco.1"), xbreak_var = "none", futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf), count_var = "count_spc", refrates_df = us_refrates_icd2, calc_total_row = TRUE, calc_total_fu = TRUE, region_var = "registry.1", age_var = "fc_agegroup.1", sex_var = "sex.1", year_var = "t_yeardiag.1", race_var = "race.1", site_var = "t_site_icd.1", #using grouping by second cancer incidence futime_var = "p_futimeyrs", alpha = 0.05) sircalc_results %>% print(n = 100)
#The summarize function is versatile. Her for example the summary by sircalc_results %>% #summarize results across region, age, year and t_site summarize_sir_results(., summarize_groups = c("region", "age", "year", "race"), summarize_site = TRUE, output = "long", output_information = "minimal", add_total_row = "only", add_total_fu = "no", collapse_ci = FALSE, shorten_total_cols = TRUE, fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name", xbreak_var_name = "none", site_var_name = "t_site", alpha = 0.05 ) %>% dplyr::select(-region, -age, -year, -race, -sex, -yvar_name)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.