Introduction to the msSPChelpR package - from long dataset to SIR analyses"
In msSPChelpR: Helper Functions for Second Primary Cancer Analyses

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette explains how to use the functions:

calc_futime() to calculate follow-up time from index event until next event, death or end of follow-up date
pat_status() to determine patient status at end of follow-up
renumber_time_id() to calculate a consecutive index of events per case ID
reshape_long() to transpose dataset in wide format to data in long format
reshape_wide() to transpose dataset in long format to data in wide format (the wide format is required for many package functions)
sir_byfutime() to calculate standardized incidence ratios (SIRs) with custom grouping variables stratified by follow-up time
summarize_sir_results() to summarize detailed SIR results produced by sir_byfutime()
vital_status() to determine vital status whether patient is alive or dead at end of follow-up

For some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use:

tidytable variants carry the suffix _tt()
tidyr variants carry the suffix _tidyr()

Recommended Workflows

Calculate follow-up times {#wf-calc-fu}

It is recommended to run the following steps in the correct order to obtain accurate follow-up time calculations

Use the long version dataset
Filter all cases in the long version of the dataset that are relevant for your analysis. Make sure that:
for each case_id the index event (e.g. First Cancer FC) is still included and is the one remaining row in the dataset with the smallest case_id (TUMID3 variable for ZfKD data, and SEQ_NUM for SEER data)
all case_ids might or might not get a countable incident event (e.g. Second Primary Cancer SPC). This event should be the second entry per case_id (second smallest case_id) if it is to be counted
in the long version dataset a count_var should indicate whether the countable incident event (SPC) has occurred or not. Coded 0 for non-occurrence (or not counted event) and 1 for a counted incident event.
Renumber filtered long dataset: In the filter long dataset, you should run the helper function msSPChelpR::renumber_time_id_dt() (or non-data.table variant msSPChelpR::renumber_time_id()) that will renumber all events per case_id and (if step 1 is fulfilled) will assign each index event with time_var_new = 1 and each second (possibly countable incident event) with time_var_new = 2. Any SIR related function will only count the second event, if additionally to time_var_new = 2 for this row also count_var = 1 is true.
Reshape dataset: Run msSPChelpR::reshape_wide_dt() or non-data.table-variant msSPChelpR::reshape_wide(), so that dataset is transposed to wide format (1 row per case_id, creating variables such as count_var.2).
Set flag for Second Primary Cancer diagnosis: After filtering and reshaping it is essential to set p_spc again. This variable will be used by later steps of the analysis.
Determine patient status at a defined end of follow-up by using the msSPChelpR::pat_status() function. This date for end of follow-up must:
be in "YYYY-MM-DD" format and is always defined via the fu_end = parameter
must precede the end of data collection. E.g. if the last incident events for the dataset you are using are collected at the end of 2014, your fu_end must be fu_end = "2014-12-15" or earlier.
Based on the newly calculated patient status, you might want to exclude cases for which patient status cannot be determined
Calculate follow-up time for the same dataset by using the msSPChelpR::calc_futime() function and the same fu_end as for step 6. By standard all functions of the msSPChelpR package require follow-up times as numeric years.

Calculate Standardized Incidence Ratios (SIR)

In order to calculate SIR using the package functions, the following data structure is needed: * Wide format data wide_df with one row per patient that has encountered the index event (i.e. diagnosed with a first primary cancer FC)

The dataset wide_df needs to contain the following variables (columns) per patient (row):
region_var - variable in df that contains information on region where case was incident.
agegroup_var - variable in df that contains information on age-group.
sex_var - variable in df that contains information on biological sex.
year_var - variable in df that contains information on year or year-period when case was incident.
site_var - variable in df that contains information on case (count event) diagnosis. Cases are usually the second cancers. Diagnoses can use any coding system (e.g. ICD) but coding system between dataset and reference data must be coherent.
futime_var - variable in df that contains follow-up time per person between date of first cancer and any of death, date of event (case), end of FU date (in years; whatever event comes first). In case you have not calculated the FU time yet, you can use the workflow described in the previous chapter.

If your data has the required structure, you can calculate and summarize SIRs with the following two steps:

Calculate SIR per SPC diagnosis with age, sex, region, period-specific strata using the msSPChelpR::sir_byfutime() function. For this calculation usually a reference dataset is required that defines the population standard rates. refrates_df must use the same category coding of age, sex, region, year and cancer_site as agegroup_var, sex_var, region_var, year_var and site_var
The theory behind calculating stratified SIRs is explained in the chapter on basics on SIRs
Summarize SIR results using the msSPChelpR::summarize_sir_results() function on the stratified sir results produced by the previous step.

Theory behind SIRs {#theo-sir}

In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.

Examples

SEER lung cancer

Step 1 - Long dataset {#step-long}

library(dplyr)
library(magrittr)
library(msSPChelpR)
#Load synthetic dataset of patients with cancer to demonstrate package functions
data("us_second_cancer")

#This dataset is in long format, so each tumor is a separate row in the data
us_second_cancer

Step 2 - Filter long dataset {#step-filter}

#filter for lung cancer
ids <- us_second_cancer %>%
  #detect ids with any lung cancer
  filter(t_site_icd == "C34") %>%
  select(fake_id) %>%
  as.vector() %>%
  unname() %>%
  unlist()

filtered_usdata <- us_second_cancer %>%
  #filter according to above detected ids with any lung cancer diagnosis
  filter(fake_id %in% ids) %>%
   arrange(fake_id)

filtered_usdata

Step 3 - Renumber `time_id` {#step-renumber}

renumbered_usdata <- filtered_usdata %>%
  renumber_time_id(new_time_id_var = "t_tumid", 
                   dattype = "seer",
                   case_id_var = "fake_id")

renumbered_usdata %>%
   select(fake_id, sex, t_site_icd, t_datediag, t_tumid)

Step 4 - Reshape to wide dataset {#step-reshape-wide}

usdata_wide <- renumbered_usdata %>%
  reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10)

#now the data is in the wide format as required by many package functions. 
#This means, each case is a row and several tumors per case ID are 
#add new columns to the data using the time_id as column name suffix.
usdata_wide

Step 5 - Recalculate `p_spc` {#step-spc}

usdata_wide <- usdata_wide %>%
  dplyr::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ "No SPC",
                         !is.na(t_site_icd.2)           ~ "SPC developed",
                         TRUE ~ NA_character_)) %>%
  #create the same information as numeric variable count_spc
  dplyr::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ 1,
                            TRUE ~ 0))
usdata_wide %>%
   dplyr::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1, 
                 t_datediag.1, t_site_icd.2, t_datediag.2)

Step 6 - Determine patient status at end of FU {#step-pat-status}

usdata_wide <- usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = FALSE, check = TRUE, 
             as_labelled_factor = TRUE)

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1, 
                 t_site_icd.2, t_datediag.2)

#alternatively, you can impute the date of death using lifedatmin_var
usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1", 
             check = TRUE, as_labelled_factor = TRUE)

Step 6b - Remove patients irrelevant to analysis depending on status {#step-filter-pat-status}

usdata_wide <- usdata_wide %>%
  dplyr::filter(!p_status %in% c("NA - Patient not born before end of FU",
                                 "NA - Patient did not develop cancer before end of FU",
                                 "NA - Patient date of death is missing"))

usdata_wide %>%
  dplyr::count(p_status)

Step 7 - Calculate FU time {#step-calc-futime}

usdata_wide <- usdata_wide %>%
   calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31",
               dattype = "seer", time_unit = "years", 
               lifedat_var = "datedeath.1", 
               fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2")

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)

Step 8 - Calculate SIR {#step-sir-byfutime}

sircalc_results <- usdata_wide %>%
  sir_byfutime(
    dattype = "seer",
    ybreak_vars = c("race.1", "t_dco.1"),
    xbreak_var = "none",
    futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf),
    count_var = "count_spc",
    refrates_df = us_refrates_icd2,
    calc_total_row = TRUE,
    calc_total_fu = TRUE,
    region_var = "registry.1",
    age_var = "fc_agegroup.1",
    sex_var = "sex.1",
    year_var = "t_yeardiag.1",
    race_var = "race.1",
    site_var = "t_site_icd.1", #using grouping by second cancer incidence
    futime_var = "p_futimeyrs",
    alpha = 0.05)

sircalc_results %>% print(n = 100)

Step 9 - Summarize SIR results {#step-summarize-sir}

#The summarize function is versatile. Her for example the summary by

sircalc_results %>%
  #summarize results across region, age, year and t_site
  summarize_sir_results(.,
                        summarize_groups = c("region", "age", "year", "race"),
                        summarize_site = TRUE,
                        output = "long",  output_information = "minimal",
                        add_total_row = "only",  add_total_fu = "no",
                        collapse_ci = FALSE,  shorten_total_cols = TRUE,
                        fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name",
                        xbreak_var_name = "none", site_var_name = "t_site",
                        alpha = 0.05
                        ) %>%
  dplyr::select(-region, -age, -year, -race, -sex, -yvar_name)

Built with

sessionInfo()

Any scripts or data that you put into this service are public.

msSPChelpR documentation built on June 11, 2022, 1:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

msSPChelpR
Helper Functions for Second Primary Cancer Analyses

Introduction to the msSPChelpR package - from long dataset to SIR analyses"
In msSPChelpR: Helper Functions for Second Primary Cancer Analyses

Introduction

Recommended Workflows

Calculate follow-up times {#wf-calc-fu}

Calculate Standardized Incidence Ratios (SIR)

Theory behind SIRs {#theo-sir}

Examples

SEER lung cancer

Step 1 - Long dataset {#step-long}

Step 2 - Filter long dataset {#step-filter}

Step 3 - Renumber `time_id` {#step-renumber}

Step 4 - Reshape to wide dataset {#step-reshape-wide}

Step 5 - Recalculate `p_spc` {#step-spc}

Step 6 - Determine patient status at end of FU {#step-pat-status}

Step 6b - Remove patients irrelevant to analysis depending on status {#step-filter-pat-status}

Step 7 - Calculate FU time {#step-calc-futime}

Step 8 - Calculate SIR {#step-sir-byfutime}

Step 9 - Summarize SIR results {#step-summarize-sir}

Built with

Try the msSPChelpR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

msSPChelpR Helper Functions for Second Primary Cancer Analyses

Introduction to the msSPChelpR package - from long dataset to SIR analyses" In msSPChelpR: Helper Functions for Second Primary Cancer Analyses

Introduction

Recommended Workflows

Calculate follow-up times {#wf-calc-fu}

Calculate Standardized Incidence Ratios (SIR)

Theory behind SIRs {#theo-sir}

Examples

SEER lung cancer

Step 1 - Long dataset {#step-long}

Step 2 - Filter long dataset {#step-filter}

Step 3 - Renumber time_id {#step-renumber}

Step 4 - Reshape to wide dataset {#step-reshape-wide}

Step 5 - Recalculate p_spc {#step-spc}

Step 6 - Determine patient status at end of FU {#step-pat-status}

Step 6b - Remove patients irrelevant to analysis depending on status {#step-filter-pat-status}

Step 7 - Calculate FU time {#step-calc-futime}

Step 8 - Calculate SIR {#step-sir-byfutime}

Step 9 - Summarize SIR results {#step-summarize-sir}

Built with

Try the msSPChelpR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

msSPChelpR
Helper Functions for Second Primary Cancer Analyses

Introduction to the msSPChelpR package - from long dataset to SIR analyses"
In msSPChelpR: Helper Functions for Second Primary Cancer Analyses

Step 3 - Renumber `time_id` {#step-renumber}

Step 5 - Recalculate `p_spc` {#step-spc}