In biostat3: Utility Functions, Datasets and Extended Examples for Survival Analysis

Downloading user-written Stata commands and data files

[gettingstarted] Stata will be used throughout the course. This section describes how to download and install the files required for the computing exercises (e.g., data files) as well as how to install user-written commands for extending Stata. If you are working in a computer lab during a course it’s possible these files may have already been installed for you.

Downloading the course files

It is suggested that you create a new directory, change the Stata working directory to the new directory (e.g., cd c:\survival\), and then download the files. You can create a new directory in Windows Explorer or you can do it from within Stata as follows.

mkdir c:\survival
cd c:\survival

Use the pwd command to confirm you are in the working directory you wish to use for the course. The course files are available on the web as a ZIP archive:\ http://biostat3.net/download/biostat3.zip\ Save and extract this folder in your created working directory. You can also read the data files directly from the web from within Stata. For example,

. use http://www.biostat3.net/download/colon, clear
(Colon carcinoma, all stages, 1975-94, follow-up to 1995)

Installing Stata user-written commands {#installing_commands}

Download and installation of user-written commands is done within Stata. It is recommended that you change the Stata working directory to the course directory (e.g., cd c:\survival\) before issuing these commands.

How can I check if these commands are already installed?

You can use the which command to check if (and where) a Stata command is installed.

. which stpm2
c:\ado\plus\s\stpm2.ado
*! version 1.5.0 29Jul2014

Use the adoupdate command to update previously installed user-written commands (note that this is distinct from the update command that updates official Stata commands). Simply type adoupdate, update to update all user-written commands.

stpm2 - flexible parametric models

The stpm2 command, written by Paul Lambert and Patrick Royston, fits flexible parametric survival models (so called Royston-Parmar models). It is installed from within Stata using the following commands:

ssc install stpm2
ssc install rcsgen

rcsgen is a command for generating basis vectors for restricted cubic splines and is required by stpm2.

Estimating probability of death in a competing risks framework

The stcompet command estimates the cumulative incidence function (CIF) non-parametrically. The stcompadj command estimates the CIF using a competing risks analogue of the Cox model. The stpm2cm command estimates the crude probabilities of death (i.e. CIF) after fitting a relative survival model using stpm2. The stpm2cif command estimates the CIF through postestimation after fitting a cause-specific competing risks model using stpm2.

ssc install stcompet
ssc install stcompadj
ssc install stpm2cm
ssc install stpm2cif

Exercises

Hand calculation: Life table and Kaplan-Meier estimates of survival\

Using hand calculation (i.e., using a spreadsheet program or pen, paper, and a calculator) estimate the cause-specific survivor function for the sample of 35 patients diagnosed with colon carcinoma (see the table below) using both the Kaplan-Meier method (up to at least 30 months) and the actuarial method (at least the first 5 annual intervals). [byhand]

In the lectures we estimated the observed survivor function (i.e. all deaths were considered to be events) using the Kaplan-Meier and actuarial methods; your task is to estimate the cause-specific survivor function (only deaths due to colon carcinoma are considered events) using the same data. The next page includes some hints to help you get started.
```
              Age   Clinical      dx date
```
ID Sex at dx stage mmyy mm yy Status 1 male 72 Localised 2.89 2 0 Dead - other 2 female 82 Distant 12.91 2 0 Dead - cancer 3 male 73 Distant 11.93 3 0 Dead - cancer 4 male 63 Distant 6.88 5 0 Dead - cancer 5 male 67 Localised 5.89 7 0 Dead - cancer 6 male 74 Regional 7.92 8 0 Dead - cancer 7 female 56 Distant 1.86 9 0 Dead - cancer 8 female 52 Distant 5.86 11 0 Dead - cancer 9 male 64 Localised 11.94 13 1 Alive 10 female 70 Localised 10.94 14 1 Alive 11 female 83 Localised 7.90 19 1 Dead - other 12 male 64 Distant 8.89 22 1 Dead - cancer 13 female 79 Localised 11.93 25 2 Alive 14 female 70 Distant 6.88 27 2 Dead - cancer 15 male 70 Regional 9.93 27 2 Alive 16 female 68 Distant 9.91 28 2 Dead - cancer 17 male 58 Localised 11.90 32 2 Dead - cancer 18 male 54 Distant 4.90 32 2 Dead - cancer 19 female 86 Localised 4.93 32 2 Alive 20 male 31 Localised 1.90 33 2 Dead - cancer 21 female 75 Localised 1.93 35 2 Alive 22 female 85 Localised 11.92 37 3 Alive 23 female 68 Distant 7.86 43 3 Dead - cancer 24 male 54 Regional 6.85 46 3 Dead - cancer 25 male 80 Localised 6.91 54 4 Alive 26 female 52 Localised 7.89 77 6 Alive 27 male 52 Localised 6.89 78 6 Alive 28 male 65 Localised 1.89 83 6 Alive 29 male 60 Localised 11.88 85 7 Alive 30 female 71 Localised 11.87 97 8 Alive 31 male 58 Localised 8.87 100 8 Alive 32 female 80 Localised 5.87 102 8 Dead - cancer 33 male 66 Localised 1.86 103 8 Dead - other 34 male 67 Localised 3.87 105 8 Alive 35 female 56 Distant 12.86 108 9 Alive

We suggest you start with the actuarial approach. Your task is to construct a life table with the following structure.

@lrccrcccc time&$l$&$d$&$w$&$l'$&$p$& $S(t)$\ \ [0-1) & 35 & & & & &\ \ \ [1-2) & & & & & &\ \ \ [2-3) & & & & & &\ \ \ [3-4) & & & & & &\ \ \ [4-5) & & & & & &\ \ \ [5-6) & & & & & &\ \

We have already entered $l_1$ (number of people alive at the start of interval 1). The next step is to add the number who experienced the event ($d$) and the number censored ($w$) during the first year. From $l$, $d$, and $w$ you will then be able to calculate $l'$ (effective number at risk), followed by $p$ (conditional probability of surviving the interval) and finally $S(t)$, the cumulative probability of surviving from time zero until the end of the interval.

\ KAPLAN-MEIER APPROACH

To estimate survival using the Kaplan-Meier approach you will find it easiest to add a line to the table at each and every time there is an event or censoring. We should use time in months. The first time at which there is an event or censoring is time equal to 2 months. The trick is what to do when there are both events and censorings at the same time.

@ccccrccc time&# at risk&$d$&$w$&$p$& $S(t)$\ \ 2 & 35 & & & & &\ \ \ & & & & & &\ \ \ & & & & & &\ \ \ & & & & & &\ \
Using Stata to validate the hand calculations done in part 1 (a) We will now use Stata to reproduce the same analyses done by hand calculation in part 1 (a) although you can do this part without having done the hand calculations, since this question also serves as an introduction to survival analysis using Stata. Our aim is to estimate the cause-specific survivor function for the sample of 35 patients diagnosed with colon carcinoma using both the Kaplan-Meier method and the actuarial method. In the lectures we estimated the all-cause survivor function (i.e. all deaths were considered to be events) using the Kaplan-Meier and actuarial methods whereas we will now estimate the cause-specific survivor function (only deaths due to colon carcinoma are considered events).

After starting Stata, you will first have to specify the data set you wish to analyse, that is

. use colon_sample, clear

Stata will search for this file in the current working directory. The pwd command will return the name of the current working directory. If you need to change to another directory you can use, for example, cd c:\survival\. The describe command will return a summary of the data set structure (e.g., variable names) whereas the list command will display the values of variables.

In order to use the Stata ltable command (life table estimates of the survivor function) we must construct a new variable indicating whether the observation period ended with an event (the new variable is assigned code 1) or censoring (the new variable is assigned code 0). We will call this new variable csr_fail (cause-specific failure). The ltable command is not a standard Stata survival analysis (st) command and does not require that the data be stset.

. recode status (1=1) (nonmissing=0), gen(csr_fail)

There are many ways to create the new variable, the above approach is preferred because missing values of status will remain missing. Even though we don’t have any missing values, it is good programming practice to always write code that will handle missing values appropriately.

The following command will give the actuarial estimates

. ltable surv_yy csr_fail

Alternatively, we could use

. ltable surv_mm csr_fail, interval(12)

Before most Stata survival analysis commands can be used (ltable is an exception) we must first stset the data using the stset command (see Section [Statasurvival]).

. stset surv_mm, failure(status==1)

A listing of the Kaplan-Meier estimates is then obtained as follows

. sts list

To graph the Kaplan-Meier estimates

. sts graph

Note that we only have to stset the data once. You can also tell Stata to show the number at risk either on the curve or in a table.

. sts graph, atrisk\ . sts graph, risktable

Titles and axis labels can also be specified.

. sts graph, risktable ///\ title(Kaplan-Meier estimates of cause-specific survival) ///\ xtitle(Time since diagnosis in months)
Melanoma: Comparing survival proportions and mortality rates by stage for cause-specific and all-cause survival\

The purpose of this exercise is to study survival of the patients using two alternative measures - survival proportions and mortality rates. A second purpose is to study the difference between cause-specific and all-cause survival.

. use melanoma, clear\ . stset surv_mm, failure(status==1)
1. Plot estimates of the survivor function and hazard function by stage.
  
  . sts graph, by(stage)\ . sts graph, hazard by(stage)
  
  By default, the sts graph command plots Kaplan-Meier estimates of survival. If we add the hazard option it shows estimates of the hazard function. Does it appear that stage is associated with patient survival?
  
  Stata tip: You may have found that each time you produce a graph Stata overwrites the previous graph in the graph window. You can instruct Stata to open each graph in a separate window by naming the graphs. This will give you the possibility to compare graphs side by side.
  
  . sts graph, by(stage) name(survival)\ . sts graph, by(stage) name(hazard) hazard
  
  You can use set autotabgraphs to control whether multiple graphs are created as tabs within one window or as separate windows. Issue the following command to make Stata present graphs as tabs within a single window (and store the setting permanently).
  
  set autotabgraphs on, permanently
2. Estimate the mortality rates for each stage using, for example, the strate command.
  
  . strate stage
  
  What are the units of the estimated rates?
  
  [The strate command, as the name suggests, is used to estimates rates. Look at the help pages if you are not familiar with the command.]
3. If you haven’t already done so, estimate the mortality rates for each stage per 1000 person-years of follow-up.
  
  [HINT: consider the scale() option to stset and the per() option to strate.]
4. Study whether survival is different for males and females (both by plotting the survivor function and by tabulating mortality rates).
  
  . sts graph, by(sex)\ . sts graph, hazard by(sex)
  
  Is there a difference in survival between males and females? If yes, is the difference present throughout the follow up?
5. The plots you made above were based on cause-specific survival (i.e., only deaths due to cancer are counted as events, deaths due to other causes are censored). In the next part of this question we will estimate all-cause survival (i.e., any death is counted as an event). First, however, study the coding of vital status and tabulate vital status by age group.
  
  How many patients die of each cause? Does the distribution of cause of death depend on age?
  
  . codebook status\ . tab status agegrp
6. To get all-cause survival, specify all deaths (both cancer and other) as events in the stset command.
  
  . stset surv_mm, failure(status==1,2)
  
  Now plot the survivor proportion for all-cause survival by stage. We name the graph to be able to separate them in the graph window. Is the survivor proportion different compared to the cause-specific survival you estimated above? Why?
  
  . sts graph, by(stage) name(anydeath, replace)
7. It is more common to die from a cause other than cancer in older ages. How does this impact the survivor proportion for different stages? Compare cause-specific and all-cause survival by plotting the survivor proportion by stage for the oldest age group (75+ years) for both cause-specific and all-cause survival. We suggest you copy the code from the PDF file into the Stata do editor and run the code from there.
  
  . stset surv_mm, failure(status==1)\ . sts graph if agegrp==3, by(stage) ///\ name(cancerdeath_75, replace) subtitle(“Cancer”)\ . stset surv_mm, failure(status==1,2)\ . sts graph if agegrp==3, by(stage) ///\ name(anydeath_75, replace) subtitle(“All cause”)\ . graph combine cancerdeath_75 anydeath_75
8. Now estimate both cancer-specific and all-cause survival for each age group.
  
  . use melanoma, clear\ . stset surv_mm, failure(status==1,2)\ . sts graph, by(agegrp) name(anydeathbyage, replace) subtitle(“All cause”)\ \ . stset surv_mm, failure(status==1)\ . sts graph, by(agegrp) name(cancerdeathbyage, replace) subtitle(“Cancer”)\ \ . graph combine anydeathbyage cancerdeathbyage
  
  Are there bigger differences between the age groups for cause-specific or for all-cause survival?
Localised melanoma: Comparing estimates of cause-specific survival between periods; first graphically and then using the log rank test\

[melanomakm] We will now analyse the full data set of patients diagnosed with localised skin melanoma.

Use Stata to estimate the cause-specific survivor function, using the Kaplan-Meier method with survival time in months, separately for each of the two calendar periods 1975–1984 and 1985–1994. The following commands can be used

. use melanoma if stage == 1, clear\ . stset surv_mm, failure(status==1)\ . sts graph, by(year8594)

The variable year8594 takes the value 1 for patients diagnosed 1985–1994 and 0 for those diagnosed 1975–1984.
1. Without making reference to any formal statistical tests, does it appear that patient survival is superior during the most recent period?
2. The following commands can be used to plot the hazard function (instantaneous mortality rate):
```
. sts graph, hazard by(year8594)
```
  1. At what point in the follow-up is mortality highest?
  2. Does this pattern seem reasonable from a clinicial/biological perspective? [HINT: Consider the disease with which these patients were classified as being diagnosed along with the expected fatality of the disease as a function of time since diagnosis.]
3. [logrank] Use the log rank test to determine whether there is a statistically significant difference in patient survival between the two periods. The following command can be used:
  
  . sts test year8594
  
  What do you conclude?
  
  An alternative test is the generalised Wilcoxon, which can be obtained as follows
  
  . sts test year8594, wilcoxon
  
  Haven’t heard of the log rank (or Wilcoxon) test? It’s possible you may reach this exercise before we cover the details of these tests during lectures. You should nevertheless do the exercise and try and interpret the results. Both of these tests (the log rank and the generalised Wilcoxon) are used to test for differences between the survivor functions. The null hypothesis is that the survivor functions are equivalent for the two calendar periods (i.e., patient survival does not depend on calendar period of diagnosis).
4. Estimate cause-specific mortality rates for each age group, and graph Kaplan-Meier estimates of the cause-specific survivor function for each age group. Are there differences between the age groups? Is the interpretation consistent between the mortality rates and the survival proportions?
  
  . strate agegrp, per(1000)\ . sts graph, by(agegrp)
  
  What are the units of the estimated hazard rates? HINT: look at how you defined time when you stset the data.
5. Repeat some of the previous analyses after using the scale() option to stset to rescale time from months to years. This is equivalent to dividing the time variable by 12 so all analyses will be the same except the units of time will be different (e.g., the graphs will have different labels).
  
  . stset surv_mm, failure(status==1) scale(12)\ . sts graph, by(agegrp)\ . strate agegrp, per(1000)
6. Study whether there is evidence of a difference in patient survival between males and females. Estimate both the hazard and survival function and use the log rank test to test for a difference.
Localised melanoma: Comparing actuarial and Kaplan-Meier approaches with discrete time data\

The aim of this exercise is to examine the effect of heavily grouped data (i.e., data with lots of ties) on estimates of survival made using the Kaplan-Meier method and the actuarial method.

For the patients diagnosed with localised skin melanoma, use Stata to estimate the 10-year cause-specific survival proportion. Use both the Kaplan-Meier method and the actuarial method. Do this both with survival time recorded in completed years and survival time recorded in completed months. That is, you should obtain 4 separate estimates of the 10-year cause-specific survival proportion to complete the cells of the following table. The purpose of this exercise is to illustrate small differences between the two methods when there are large numbers of ties.

In order to reproduce the results in the printed solutions you’ll need to restrict to localised stage (stage==1) and estimate cause-specific survival (status==1 indicates an event). Look at the Stata code in the previous questions if you are unsure.
```
        Actuarial   Kaplan-Meier
```
```
 Years

Months
```
1. Of the two estimates (Kaplan-Meier and actuarial) made using time recorded in years, which do you think is the most appropriate and why?
  
  [HINT: Consider how each of the methods handle ties.]
2. Which of the two estimates (Kaplan-Meier or actuarial) changes most when using survival time in months rather than years? Why?
Diet data: tabulating incidence rates and modelling with Poisson regression [diet:poisson]\

Load the diet data and stset the data using time-on-study as the timescale.
```
. use diet, clear
. stset dox, id(id) fail(chd) origin(doe) scale(365.24)
```
1. Use the strate command to tabulate CHD incidence rates per 1000 person-years for each category of hieng. Calculate (by hand) the ratio of the two incidence rates.
2. Use the command poisson to find the incidence rate ratio for the high energy group compared to the low energy group and compare the estimate to the one you obtained in the previous question:
```
. poisson chd hieng, e(y) irr
```
  NOTE: Rates are calculated as events/person-time so when modelling rates we need to let Stata know both of these quantities. chd is the event indicator and y is the person time at risk for each individual. The irr option results in the estimates being presented as estimated incidence rate ratios rather than parameter estimates (log incidence rate ratios).
3. Grouping the values of total energy into just two groups does not tell us much about how the CHD rate changes with total energy. It is a useful exploratory device, but to look more closely we need to group the total energy into perhaps 3 or 4 groups. In this example we shall use the cut points $1500, 2500, 3000, 4500$. To check if these cutpoints seem reasonable, type:
```
. histogram energy, normal
. sum energy, detail
```
4. Use the commands
```
. egen eng3=cut(energy), at(1500, 2500, 3000, 4500)
. tabulate eng3
```
  to create a new variable eng3 coded 1500 for values of energy in the range 1500–2499, 2500 for values in the range 2500–2999, and 3000 for values in the range 3000–4500.
5. [e] To estimate and plot the rates for different levels of eng3 try
```
. strate eng3, per(1000) graph
```
  Calculate (by hand) the ratio of rates in the second and third levels to the first level.
6. Create your own indicator variables for the three levels of eng3 with
```
. tabulate eng3, gen(X)
```
7. Check the indicator variables with
```
. list energy eng3 X1 X2 X3 if eng3==1500
. list energy eng3 X1 X2 X3 if eng3==2500
. list energy eng3 X1 X2 X3 if eng3==3000
```
8. Use poisson to compare the second and third levels with the first, as follows:
```
. poisson chd X2 X3, e(y) irr
```
  Compare your estimates with those you obtained in part [e].
9. Use poisson to compare the first and third levels with the second.
10. Repeat the analysis comparing the second and third levels with the first but this time have Stata create the indicators automatically via the i. syntax. That is
```
. poisson chd i.eng3, e(y) irr
```
11. Without using st commands, calculate the total number of events during follow-up, person-time at risk, and the crude incidence rate (per 1000 person-years), for example with Stata commands for descriptive statistics (e.g., summarize). Confirm your answer using strate or stptime. [HINT: Remember that the total number of person-years is the number of persons at risk multiplied with the mean follow up time among those persons.]
Localised melanoma: model cause-specific mortality with Poisson regression [locmel:csr:poisson]\

In this exercise we model, using Poisson regression, cause-specific mortality of patients diagnosed with localised (stage==1) melanoma.

In exercise [locmel:csr:cox] we model cause-specific mortality using Cox regression and in exercise [locmel:csr:fpm] we use flexible parametric models. The aim is to illustrate that these three methods are very similar.

The aim of these exercises is to explore the similarities and differences to these three approaches to modelling. We will be comparing the results (and their interpretation) as we proceed through the exercises so you may wish to save your commands in a do file to facilitate comparison.

The following commands can be used to load and stset the data.

. use melanoma if stage==1, clear\ . stset surv_mm, failure(status==1) scale(12) id(id)
1. Plot Kaplan-Meier estimates of cause-specific survival as a function of calendar period of diagnosis.
  
  . sts graph, by(year8594)
  
  [localmel:rate:adjusted]
  1. During which calendar period (the early or the latter) is survival best?
  2. Now plot the estimated hazard function (cause-specific mortality rate) as a function of calendar period of diagnosis.
    
    . sts graph, by(year8594) hazard
    
    During which calendar period (the early or the latter) is mortality the lowest?
  3. Is the interpretation (with respect to how prognosis depends on period) based on the hazard consistent with the interpretation of the survival plot?
2. Use the strate command to estimate the cause-specific mortality rate for each calendar period.
  
  . strate year8594, per(1000)
  
  [localmel:rate:unadjusted] During which calendar period (the early or the latter) is mortality the lowest? Is this consistent with what you found earlier? If not, why the inconsistency?
3. The reason for the inconsistency between parts [localmel:rate:adjusted] and [localmel:rate:unadjusted] was confounding by time since diagnosis. The comparison in part [localmel:rate:adjusted] was adjusted for time since diagnosis (since we compare the differences between the curves at each point in time) whereas the comparison in part [localmel:rate:unadjusted] was not. Understanding this concept is central to the remainder of the exercise so please ask for help if you don’t follow.
  
  Two approaches for controlling for confounding are ‘restriction’ and ‘statistical adjustment’. We will first use restriction to control for confounding. That is we will stset the data again but use the exit(time 120) option to restrict the potential follow-up time to a maximum of 120 months. Individuals who survive more than 120 months are censored at 120 months
  
  . use melanoma if stage==1, clear\ . stset surv_mm, failure(status==1) scale(12) id(id) exit(time 120)
  1. Use the strate command to estimate the cause-specific mortality rate for each calendar period.
    
    . strate year8594, per(1000)
    
    During which calendar period (the early of the latter) is mortality the lowest? Is this consistent with what you found in part [localmel:rate:unadjusted]?
  2. Calculate by hand the ratio (85–94/75–84) of the two mortality rates (i.e., a mortality rate ratio) and interpret the estimate (i.e., during which period is mortality higher/lower and by how much).
  3. Now use Poisson regression to estimate the same mortality rate ratio.
    
    . streg year8594, dist(exp)
  NOTE: streg is one of several Stata commands for performing Poisson regression. The model could also be fitted using the poisson or glm commands.
  
  . gen risktime=_t-_t0\ . poisson _d year8594 if _st==1, exp(risktime) irr\ . glm _d year8594 if _st==1, family(poisson) eform lnoffset(risktime)
  
  However, if you have stset/stsplit the data it is recommended that you use streg since streg understands and respects the internal st variables (_st, _t, _t0, and _d). In particular, ‘trimmed’ person-time will be ignored by streg but not by the poisson command.
  
  Strictly speaking, streg fits parametric survival models. A parametric survival model assuming survival times are exponentially distributed (dist(exp)) implies a constant hazard and a Poisson process for the number of events (i.e., Poisson regression).
4. In order to adjust for time since diagnosis (i.e., adjust for the fact that we expect mortality to depend on time since diagnosis) we need to split the data by this timescale. We will restrict our analysis to mortality up to 10 years following diagnosis.
  
  . stsplit fu, at(0(1)10) trim
  
  NOTE: The trim option instructs Stata to ignore time-at-risk outside the interval [0,10] (i.e., after 10 years subsequent to diagnosis). Since we have already made this restriction using stset there should not be any time ‘trimmed’.
5. Now tabulate (and produce a graph of) the rates by follow-up time.
  
  . strate fu, per(1000) graph
  
  Mortality appears to be quite low during the first year of follow-up. Does this seem reasonable considering the disease with which these patients have been diagnosed?
6. Compare the plot of the estimated rates to a plot of the hazard rate as a function of continuous time.
  
  . sts graph, hazard
  
  Is the interpretation similar? Do you think it is sufficient to classify follow-up time into annual intervals or might it be preferable to use, for example, narrower intervals?
7. Use Poisson regression to estimate incidence rate ratios as a function of follow-up time.
  
  . streg i.fu, dist(exp)
  
  Does the pattern of estimated incident rate ratios mirror the pattern you observed in the plots?
8. Now estimate the effect of calendar period of diagnosis while adjusting for time since diagnosis. Before fitting this model, predict what you expect the estimated effect to be (i.e., will it be higher, lower, or similar to the value of $0.8831852$ we obtained in part c).
  
  . streg i.fu year8594, dist(exp)
  
  Is the estimated effect of calendar period of diagnosis consistent with what you expected? Add an interaction between follow-up and calendar period of diagnosis and interpret the results.
9. [localmel:poisson] Now control for age, sex, and calendar period.
  
  . streg i.fu i.agegrp year8594 sex, dist(exp)
  1. Interpret the estimated hazard ratio for the parameter labelled agegrp 2, including a comment on statistical significance.
  2. Is the effect of calendar period strongly confounded by age and sex? That is, does the inclusion of sex and age in the model change the estimate for the effect of calendar period?
  3. Perform a Wald test of the overall effect of age and interpret the results.
    
    . test 1.agegrp 2.agegrp 3.agegrp
10. [localmel:agesexinteraction] Is the effect of sex modified by calendar period (whilst adjusting for age and follow-up)? Fit an appropriate interaction term to test this hypothesis.
11. Based on the interaction model you fitted in exercise [localmel:agesexinteraction], estimate the hazard ratio for the effect of sex (with 95% confidence interval) for each calendar period.
  
  ADVANCED: Do this with each of the following methods and confirm that the results are the same:
  1. Using hand-calculation on the estimates from exercise [localmel:agesexinteraction].
  2. Using the estimates from exercise [localmel:agesexinteraction] and the lincom command.
    
    . lincom 2.sex + 1.year8594#2.sex, eform
  3. Creating appropriate dummy variables that represent the effects of sex for each calendar period.
    
    . gen sex_early=(sex==2)*(year8594==0)\ . gen sex_latter=(sex==2)*(year8594==1)\ . streg i.fu i.agegrp year8594 sex_early sex_latter, dist(exp)
  4. Using Stata 11 syntax to repeat the previous model.
    
    . streg i.fu i.agegrp i.year8594 year8594#sex, dist(exp)
12. Now fit a separate model for each calendar period in order to estimate the hazard ratio for the effect of sex (with 95% confidence interval) for each calendar period. Why do the estimates differ from those you obtained in the previous part?
  
  . streg i.fu i.agegrp sex if year8594==0, dist(exp)\ . streg i.fu i.agegrp sex if year8594==1, dist(exp)
  
  Can you fit a single model that reproduces the estimates you obtained from the stratified models? Try:
  
  . streg i.fu##year8594 i.agegrp##year8594 year8594##sex, dist(exp)
Diet data: Using Poisson regression to study the effect of energy intake adjusting for confounders on two different timescales\

Use Poisson regression to study the association between energy intake (hieng) and CHD adjusted for potential confounders (job, BMI). We know that people who expend a lot of energy (i.e., are physically active) require a higher energy intake. We do not have data on physical activity but we are hoping that occupation (job) will serve as a surrogate measure of work-time physical activity (conductors on London double-decker busses expend energy walking up and down the stairs all day).

Fit models both without adjusting for ‘time’ and by adjusting for attained age (you will need to split the data) and time-since-entry and compare the results.
1. Rates can be modelled on different timescales, e.g., attained age, time-since-entry, calendar time. Plot the CHD incidence rates both by attained age and by time-since-entry. Is there a difference? Do the same for CHD hazard by different energy intakes (hieng).
  
  . use diet, clear\ \ .* Timescale: Attained age\ . stset dox, id(id) fail(chd) origin(dob) enter(doe) scale(365.24)\ . sts graph, hazard\ . sts graph, by(hieng) hazard\ \ .* Timescale: Time-since-entry\ . stset dox, id(id) fail(chd) origin(doe) enter(doe) scale(365.24)\ . sts graph, hazard\ . sts graph, by(hieng) hazard
2. Model the rate using Poisson regression, without adjusting for any timescale. What is the effect of hieng on CHD? What assumption does this model make on the shape of the underlying incidence rate over time?
  
  . poisson chd hieng, e(y) irr
3. Adjust for BMI and job. Is there evidence that the effect of energy intake on CHD is confounded by BMI and job?
  
  . gen bmi=weight/(height/100*height/100)\ . poisson chd hieng job bmi, e(y) irr
4. Firstly, let’s adjust for the timescale attained age. To do this in Poisson regression you must split the data on timescale age. First use stset (with origin date of birth) and then use stsplit to generate agebands.
  
  . stset dox, id(id) fail(chd) origin(dob) enter(doe) scale(365.24)\ . stsplit ageband, at(30,50,60,72) trim\ . list id _t0 _t ageband y in 1/10
  
  As the poisson command is not an st command, you must keep track of the risktime yourself. Why is the y variable not correct anymore? Generate a new variable, risktime, which contains the risktime for each split record.
  
  . gen risktime=_t-_t0\ . list id _t0 _t ageband y risktime in 1/10
  
  You must also keep track of the event variable, as chd will not be valid after the split.
  
  . tab ageband chd, missing\ . tab ageband _d, missing
  
  Now fit the model for CHD, both without and with the adjustment for job and bmi. Is the effect of hieng on CHD confounded by age, BMI or job?
  
  . poisson _d hieng i.ageband, e(risktime) irr\ . poisson _d hieng i.job bmi i.ageband, e(risktime) irr
  
  What assumption is being made about the shape of the baseline hazard (HINT: the baseline hazard takes the shape of the timescale)?
5. Secondly, do the same analysis, but now adjust for the timescale time-since-entry. (You must read the data in again, as you now want to split on another timescale. This is strictly not necessary, but to avoid mistakes it is generally a good idea to start over again.)
  
  . use diet, clear\ . gen bmi=weight/(height/100*height/100)
  
  Specify time-since-entry as the timescale by specifying date of entry as the time origin.
  
  . stset dox, id(id) fail(chd) origin(doe) enter(doe) scale(365.24)\ \ . stsplit fuband, at(0,5,10,15,22) trim\ . list id _t0 _t fuband y in 1/10\ \ . gen risktime=_t-_t0\ . list id _t0 _t fuband y risktime in 1/10\ \ . tab fuband chd, missing\ . tab fuband _d, missing\ \ . poisson _d hieng i.fuband, e(risktime) irr\ . poisson _d hieng i.job bmi i.fuband, e(risktime) irr
  
  Compare the results with the analysis adjusted for attained age. Are there any differences? Why (or why not)? Go back to the graphs at the beginning of the exercise and look for explanations.
6. Repeat the exercise using streg. What is the advantage/disadvantage of using streg?
Localised melanoma: modelling cause-specific mortality using Cox regression [locmel:csr:cox]\

In exercise [locmel:csr:poisson] we modelled the cause-specific mortality of patients diagnosed with localised melanoma using Poisson regression. We will now model cause-specific mortality using Cox regression and compare the results to those we obtained using the Poisson regression model.

To fit a Cox proportional hazards model (for cause-specific survival) with calendar period as the only explanatory variable, the following commands can be used. Note that we are censoring all survival times at 120 months (10 years) in order to facilitate comparisons with the Poisson regression model in exercise [locmel:csr:poisson].

. use melanoma\ . keep if stage == 1\ . stset surv_mm, failure(status==1) exit(time 120)\ . stcox year8594
1. Interpret the estimated hazard ratio, including a comment on statistical significance.
2. (This part is more theoretical and is not required in order to understand the remaining parts.)
  
  Stata reports a Wald test of the null hypothesis that survival is independent of calendar period. The test statistic (and associated P-value) is reported in the table of parameter estimates (labelled z). Under the null hypothesis, the test statistic has a standard normal (Z) distribution, so the square of the test statistic will have a chi square distribution with one degree of freedom.
  
  Stata also reports a likelihood ratio test statistic of the null hypothesis that none of the parameters in the model are associated with survival (labelled LR chi2(1)). In general, this test statistic will have a chi-square distribution with degrees of freedom equal to the number of parameters in the model. For the current model, with only one parameter, the test statistic has a chi square distribution with one degree of freedom.
  
  Compare these two test statistics with each other and with the log rank test statistic (which also has a $\chi^2_1$ distribution) calculated in question [logrank] (you should, however, recalculate the log rank test since we have restricted follow-up to the first 10 years in this exercise). Would you expect these test statistics to be similar? Consider the null and alternative hypotheses of each test and the assumptions involved with each test.
3. [localmel:cox] Now include sex and age (in categories) in the model.
  
  . stcox sex year8594 i.agegrp
  1. Interpret the estimated hazard ratio for the parameter labelled agegrp 2, including a comment on statistical significance.
  2. Is the effect of calendar period strongly confounded by age and sex? That is, does the inclusion of sex and age in the model change the estimate for the effect of calendar period?
  3. Perform a Wald test of the overall effect of age and interpret the results.
    
    . test 1.agegrp 2.agegrp 3.agegrp
4. Perform a likelihood ratio test of the overall effect of age and interpret the results. The following commands can be used
  
  . stcox sex year8594 i.agegrp\ . est store A\ . stcox sex year8594\ . lrtest A
  
  Compare your findings to those obtained using the Wald test. Are the findings similar? Would you expect them to be similar?
5. The model estimated in question [localmel:cox] is similar to the model estimated in question [localmel:poisson].
  1. Both models adjust for sex, year8594, and i.agegrp but the Poisson regression model in question [localmel:poisson] appears to adjust for an additional variable (i.fu). Is the Poisson regression model adjusting for an additional factor? Explain.
  2. Would you expect the parameter estimate for sex, period, and age to be similar for the two models? Are they similar?
  3. Do both models assume proportional hazards? Explain.
  Following is some code for estimating and comparing the Cox and Poisson regression models (you might wish to copy this code from the PDF version of this document and paste it into the Stata do file editor or use compare_cox_poisson.do.
```
use melanoma if stage==1, clear
stset surv_mm, failure(status==1) id(id) exit(time 120)
stcox year8594 sex i.agegrp
est store Cox

/* split on time since diagnosis */
stsplit fu, at(0(12)120) trim

streg i.fu year8594 sex i.agegrp, dist(exp)
est store Poisson
est table Cox Poisson, eform equations(1)
```
6. ADVANCED: By splitting at each failure time we can estimate a Poisson regression model that is identical to the Cox model. Code is available in the file compare_cox_poisson.do on the course website. This model takes several minutes to estimate and you may need to reset the values of memory and matsize.
7. ADVANCED: Split the data finely (e.g., 3-month intervals) and model the effect of time using a restricted cubic spline.
```
use melanoma if stage==1, clear
stset surv_mm, failure(status==1) id(id) exit(time 120)
/* split on time since diagnosis (1-month intervals) */
stsplit fu, at(0(1)120) trim
/* Create basis for restricted cubic spline */
mkspline fu_rcs=fu, cubic
streg fu_rcs* year8594 sex i.agegrp, dist(exp)
predict xb, xb
twoway line xb fu if year8594==0 & sex==1 & agegrp==1, sort
```
Examining the proportional hazards hypothesis (localised melanoma)\
1. For the localised melanoma data with 10 years follow-up, plot the instantaneous cause-specific hazard for each calendar period. The following commands can be used
  
  . use melanoma if stage == 1, clear\ . stset surv_mm, failure(status==1) id(id) exit(time 120) scale(12)\ . sts graph, hazard by(year8594)
  
  Make a rough estimate of the hazard ratio for patients diagnosed 1985–94 to those diagnosed 1975–84. In part (d) you will fit a Cox model and check your estimate.
2. Now plot the instantaneous cause-specific hazard for each calendar period using a log scale for the y axis (use the option yscale(log)). What would you expect to see if a proportional hazards assumption were appropriate? Do you see it?
3. Another graphical way of checking the proportional hazards assumption is to plot the log cumulative cause specific hazard function for each calendar period. These plots were not given extensive coverage in the lectures, so attempt this if you like or continue to part (d). The command for plotting this function is
  
  . stphplot, by(year8594)
  
  What would you expect to see if a proportional hazards assumption were appropriate? Do you see it?
4. Compare your estimated hazard ratio from part (a) with the one from a fitted Cox model with calendar period as the only explanatory variable. Are they similar?
5. Now fit a more complex model and use graphical methods to explore the assumption of proportional hazards by calendar period. For example,
  
  . stcox sex i.year8594 i.agegrp\ . estat phtest, plot(1.year8594)
  
  What do you conclude?
6. Do part (a)–(e) but now for the variable agegrp. What are your conclusions regarding the assumption of proportional hazards?
7. Now formally test the assumption of proportional hazards using
  
  . stcox sex i.year8594 i.agegrp\ . estat phtest, detail
  
  Are your conclusions from the test coherent with your conclusions from the graphical assessments?
8. Estimate separate age effects for the first two years of follow-up (and separate estimates for the remainder of the follow-up) while controlling for sex and period. Do the estimates for the effect of age differ between the two periods of follow-up?
  
  There are two ways to fit time-varying effects: 1) the tvc option in stcox or 2) by splitting on time using stsplit.
  
  Using tvc:
  
  . tab(agegrp), gen(agegrp)\ . stcox sex year8594 agegrp2 agegrp3 agegrp4, ///\ tvc(agegrp2 agegrp3 agegrp4) texp(_t>=2)
  
  Using stsplit:
  
  . stsplit fuband, at(0,2)\ . list id _t0 _t fu in 1/10\ \ . stcox sex year8594 i.agegrp##i.fuband
  
  These are simply two alternative syntaxes for fitting the same model with the same parameterizations. They give the so-called default parameterizations for interaction effects. We see effects of age (i.e., the hazard ratios) for the period 0–2 years subsequent to diagnosis along with the interaction effects. An advantage of the default parameterisation is that one can easily test the statistical significance of the interaction effects. Before going further, test whether the age*follow-up interaction is statistically significant (using a Wald and/or LR test).
9. Often we wish to see the effects of exposure (age) for each level of the modifier (time since diagnosis). That is, we would like to complete the table below with relevant hazard ratios. To get the effects of age for the period 2+ years after diagnosis, using the default parametrization, we must multiply the hazard ratios for 0–2 years by the appropriate interaction effect. Now let’s reparameterise the model to directly estimate the effects of age for each level of time since diagnosis. This is easily done in Stata (version 11 or later) using single #’s
  
  . stcox sex year8594 i.fuband i.fuband#i.agegrp
```
         0–2 years   2+ years
```
```
Agegrp1    1.00        1.00
Agegrp2             
Agegrp3             
Agegrp4
```
  Fill in the table above. Does the effect of age appear different before and after 2 years?
10. ADVANCED: Fit an analogous Poisson regression model. Are the parameter estimates similar? HINT: You will need to split the data by time since diagnosis.
Cox regression with observed (all-cause) mortality as the outcome\

Now fit a model to the localised melanoma data where the outcome is observed survival (i.e. all deaths are considered to be events).

. stset surv_mm, failure(status==1,2) exit(time 120)\ . keep if stage==1\ . stcox sex year8594 i.agegrp
1. Interpret the estimated hazard ratio for the parameter labelled 2.agegrp, including a comment on statistical significance.
2. On comparing the estimates between the observed and cause-specific survival models it appears that only the parameters for age have changed substantially. Can you explain why the estimates for the effect of age would be expected to change more than the estimates of the effect of sex and period?
Cox model for cause-specific mortality for melanoma (all stages)\

Use Cox regression to model the cause-specific survival of patients with skin melanoma (including all stages).
1. First fit the model with sex as the only explanatory variable. Does there appear to be a difference in survival between males and females?
2. Is the effect of sex confounded by other factors (e.g. age, stage, subsite, period)? After controlling for potential confounders, does there still appear to a difference in survival between males and females?
3. Consider the hypothesis that there exists a class of melanomas where female sex hormones play a large role in the etiology. These hormone related cancers are diagnosed primarily in women and are, on average, less aggressive (i.e., prognosis is good). If such a hypothesis were true we might expect the effect of sex to be modified by age at diagnosis (e.g., pre versus post menopausal). Test whether this is the case.
4. Decide on a ‘most appropriate’ model for these data. Be sure to evaluate the proportional hazards assumption.
Modelling the diet data using Cox regression
1. Fit the following Poisson regression model to the diet data (we fitted this same model in question [diet:poisson]).
```
. use diet, clear
. poisson chd hieng, e(y) irr
```
  Now fit the following Cox model.
```
. stset dox, id(id) fail(chd) entry(doe) origin(doe) scale(365.24)
. stcox hieng
```
  1. On what scale are we measuring ‘time’? That is, what is the timescale?
  2. Is it correct to say that both of these models estimate the effect of high energy on CHD without controlling for any potential confounders? If not, how are these models conceptually different?
  3. Would you expect the parameter estimates for these two models to be very different? Is there a large difference?
2. stset the data with attained age as the timescale and refit the Cox model. Is the estimate of the effect of high energy different? Would we expect it to be different?
Estimating the effect of a time-varying exposure – the bereavement data\

These data were used to study a possible effect of marital bereavement (loss of husband or wife) on all–cause mortality in the elderly. The dataset was extracted from a larger follow-up study of an elderly population and concerns subjects whose husbands or wives were alive at entry to the study. Thus all subjects enter as not bereaved but may become bereaved at some point during follow–up. The variable dosp records the date of death of each subject’s spouse and takes the value 1/1/2000 where this has not yet happened.
1. Load the data with
```
. use brv, clear
. desc
```
  To see how the coding works for couples try
```
. list id sex doe dosp dox fail if couple==3
```
  for a couple, both of whom die during follow–up. Draw a picture showing the follow–up for both subjects, and mark the dates of entry exit and death of spouse on it. Try
```
. list id sex doe dosp dox fail if couple==4
```
  for a couple, one of whom dies during follow–up,
```
. list id sex doe dosp dox fail if couple==19
```
  for a couple, neither of whom die during follow–up, and
```
. list id sex doe dosp dox fail if couple==7
```
  for a couple where only data on one individual is available.
2. Set the st variables, calculate the mortality rate per 1000 years for men and for women, and find the rate ratio comparing women (coded 2) with men (coded 1), using
```
. stset dox, fail(fail) origin(dob) entry(doe) scale(365.24) id(id)
. strate sex, per(1000)
. streg sex, dist(exp)
```
  1. What dimension of time did we use as the timescale when we stset the data? Do you think this is a sensible choice?
  2. Which gender has the highest mortality? Is this expected?
  3. Could age be a potential confounder? Does age at entry differ between males and females? Later we will estimate the rate ratio while controlling for age.
3. Breaking records into pre and post bereavement\ In these data a subject changes exposure status from not bereaved to bereaved when his or her spouse dies. The first stage of the analysis therefore is to partition each follow–up into a record describing the period of follow-up pre–bereavement and (for subjects who were bereaved during the study) the period post–bereavement.
  
  This can be done using stsplit:
```
. stsplit brv, after(time=dosp) at(0)
. recode brv -1=0 0=1
```
  This syntax of stsplit splits the records at the death of spouse (or 1/1/2000 if the spouse is still alive). The variable brv takes the values $-1$ for the pre bereavement part and 0 for the post bereavment part and the recode command changes these to 0 and 1 respectively.
  
  To see the effect on couple 3
```
. list id sex doe dosp dox brv _t0 _t _d fail if couple==3
```
  We see that, of this couple, only the woman was bereaved during follow-up (it is impossible for both of a couple to contribute person-time to the bereaved category). This woman was classified as ‘not bereaved’ during age 83.87 and 84.41 and ‘bereaved’ during ages 84.41 and 84.82. Study the data for the other couples mentioned above.
4. Now find the (crude) effect of bereavement
```
. streg brv, dist(exp)
```
5. Since there is a strong possibility that the effect of bereavement is not the same for men as for women, use streg to estimate the effect of bereavement separately for men and women. Do this both by fitting separate models for males and females (e.g. streg brv if sex==1) as well as by using a single model with an interaction term (you may need to create dummy variables). Confirm that the estimates are identical for these two approaches.
6. Controlling for age There is strong confounding by age. Use stsplit to expand the data by 5 year age–bands, and check that the rate is increasing with age. Use streg to find the effect of bereavement controlled for age. If you wish to study the distribution of age then it is useful to know that age at entry and exit are stored in the variables _t0 and _t respectively.
7. Now estimate the effect of bereavement (controlled for age) separately for each sex.
8. We have assumed that any effect of bereavement is both immediate and permanent. This is not realistic and we might wish to improve the analysis by further subdividing the post–bereavement follow–up. How might you do this? (you are not expected to actually do it)
9. Analysis using Cox regression\ We can also model these data using Cox regression. Provided we have stset the data with attained age as the time scale and split the data (using stsplit) to obtain separate observations for the bereaved and non-bereaved person-time the following command will estimate the effect of bereavement adjusted for attained age.
```
. stcox brv
```
  That is, we do not have to split the data by attained age (although we can fit the model to data split by attained age and the results will be the same).
10. Use the Cox model to estimate the effect of bereavement separately for males and females and compare the estimates to those obtained using Poisson regression.
[smr] Calculating SMRs/SIRs\

The standardized mortality ratio (SMR) is the ratio of the observed number of deaths in the study population to the number that would be expected if the study population experienced the same mortality as the standard population. It is an indirectly standardized rate. When studying disease incidence the corresponding quantity is called a standardized incidence ratio (SIR). These measures are typically used when the entire study population is considered ‘exposed’. Rather than following-up both the exposed study population and an unexposed control population and comparing the two estimated rates we instead only estimate the rate (or number of events) in the study population and compare this to the expected rate (expected number of events) for the standard population. For example, we might study disease incidence or mortality among individuals with a certain occupation (farmers, painters, airline cabin crew) or cancer incidence in a cohort exposed to ionising radiation.

In the analysis of cancer patient survival we typically estimate excess mortality (observed - expected deaths). The SMR (observed/expected deaths) is a measure of relative mortality. The estimation of observed and expected numbers of deaths are performed in an identical manner for each measure but with the SMR we assume that the effect of exposure is multiplicative to the baseline rate whereas with excess mortality we assume it is additive. Which measure, relative mortality or excess mortality, do you think is more homogeneous across age?

The following example illustrates the approach to estimating SMRs/SIRs using Stata. Specifically, we will estimate SMRs for the melanoma data using the general population mortality rates stratified by age and calendar period (derived from popmort.dta) to estimate the expected number of deaths. The expected mortality rates depend on current age and current year so the approach is as follows
- Split follow-up into 1-year age bands
- Split the resulting data into 1-year calendar period bands
- For each age-period band, merge with popmort.dta to obtain the expected mortality rates
- Sum the observed and expected numbers of deaths and calculate the SMR (observed/expected) and a 95% CI
- Start by stsetting the data with age as the timescale and splitting the follow-up into 1 year age bands
```
use melanoma, clear
stset exit, fail(status == 1 2) origin(bdate) entry(dx) scale(365.24) id(id)
stsplit _age, at(0(1)110) trim
```
- Now split these new records into 1 year calendar period bands using
```
stsplit _year, after(time=d(1/1/1900)) at(70(1)100) trim
replace _year=1900+_year
list id _age _year in 1/15
```
  Note that we have used the second syntax for stsplit and set the origin for calendar period as 1/1/1900 for convenience in setting the breaks.
- Each subject’s follow–up is now divided into small pieces corresponding to the agebands and calendar periods the subject passes through. We can make tables of deaths and person-years by age and calendar period with
```
gen _y = _t - _t0 if _st==1
table _age _year, c(sum _d)
table _age _year, c(sum _y) format(%5.3f)
```
  As the data have been split in 1-year intervals on both time scales the table created above is not so informative. Grouped variables will provide a better overview.
```
egen ageband_10=cut(_age), at (0(10)110)
egen period_5=cut(_year), at(1970(5)2000)

table ageband_10 period_5, c(sum _d)
table ageband_10 period_5, c(sum _y) format(%4.1f)
```
- To make a table of rates by age and calendar period, try
```
gen obsrate=_d/_y
table ageband_10 period_5 [iw=_y] , c(mean obsrate) format(%5.3f)
```
- To calculate the expected cases for a cohort, using reference mortality rates classified by age and calendar period, it is first necessary to break the follow–up into parts which correspond to these age bands and calendar periods, as above.
  
  Before calculating the expected number of cases it is necessary to add the reference rates to the expanded data with
```
sort _year sex _age
merge _year sex _age using popmort
```
  This is a matched merge on age band and calendar period and will add the appropriate survival probability to each record. The system variable _merge takes the following values:
  
  1– record in the master file but no match in popmort 2– record in popmort but no match in the master file 3– record in the master file with a match in popmort
```
tab _merge
```
  should show mostly 3’s with some 2’s but no 1’s. You can now drop the records with no match in the master file and the system variable
```
drop if _merge==2
drop _merge
```
- The mortality rates for the standard population are derived by transforming the survival probabilities
```
gen mortrate=(-ln(prob))
```
  and to calculate the expected number of cases, multiply the follow-up time for each record by the reference rate for that record
```
gen e=_y*mortrate
list id e _d in 1/15
```
- The SMR is the ratio of the total observed cases to the total number expected. The total numbers are obtained through
```
egen obs=total(_d)
egen exp=total(e)
```
  from which the manually calculated SMR with corresponding 95% confidence interval are obtained (preserve and restore are used to speed up the processing)
```
preserve
keep in 1
gen SMR = obs/exp
gen LL = ( 0.5*invchi2(2*obs, 0.025)) / exp
gen UL = ( 0.5*invchi2(2*(obs+1), 0.975)) / exp

display "SMR(95%CI)=" round(SMR,.001) " ///
        (" round(LL,.001) ":" round(UL,.001) ")"
restore
```
  An easier approach is to let the strate command perform these calculations for us (after first splitting and merging in the standard rates).
```
strate, smr(mortrate)
```
- To calculate the SMR for the different stages, try
```
strate stage, smr(mortrate)
```
Summary\ The following commands can be used to calculate an SMR for the melanoma patients:
```
use melanoma, clear
stset exit, fail(status == 1 2) origin(bdate) entry(dx) scale(365.25) id(id)
stsplit _age, at(0(1)110) trim
stsplit _year, after(time=d(1/1/1900)) at(70(1)100) trim
replace _year=1900+_year
sort _year sex _age
merge _year sex _age using popmort
drop if _merge==2
gen mortrate=-ln(prob)
strate, smr(mortrate)
```
Localised melanoma: Generating and analysing a nested case-control study\

The data on patients diagnosd with localised melanoma were obtained from the Finnish cancer registry. We might be interested in studying additional prognostic factors not available in the cancer registry data. For example, we might wish to extract additional information from the medical records (e.g., treatment) or genotype archived tumour samples. A nested case control design will restrict the number of individuals for whom we need to collect additional information. We will first fit a model to the full cohort and then generate a nested case-control study and fit the same model.

We will first analyse the full cohort using a Cox model. We will only study the first 10 years subsequent to diagnosis.
```
use melanoma, clear
keep if stage == 1
stset surv_mm, failure(status==1) id(id) exit(time 120)

/* Cox model to full cohort */
stcox sex year8594 i.agegrp
```
1. How many individuals are in our study? That is, if we were collecting additional information, on how many individuals would we need to collect it?
2. How many experience the event (i.e., death due to cancer).
3. Now generate a nested case-control study with 1 control per case and fit a model with the same explanatory variables.
```
sttocc, n(1)
clogit _case sex year8594 i.agegrp, group(_set) or
```
4. How many unique individuals are in our study? That is, if we were collecting additional information, on how many individuals would we need to collect it?
5. Compare the estimated parameters and standard errors between the full cohort analysis and the nested case-control study. What is the relative efficiency (ratio of variances) of the nested case-control compared to the full cohort design?
6. If you are interested in exploring this further you can repeat the exercise. That is, generate another nested case-control study and analyse it. If you did this a large number of times, and plotted a histogram of the parameter estimates, you will find that the parameter estimates from the nested case-control study have a symmetric distribution with mean equal to the the full-cohort estimate. In the solutions you can find Stata code for looping.
Modelling cause-specific mortality using flexible parametric models [locmel:csr:fpm]\

We will now fit some models with the linear predictor on the log cumulative hazard scale using flexible parametric survival models (Royston-Parmar models).

Load the Melanoma data and refit the Cox model to use as a comparison.
```
. use melanoma, clear
. keep if stage == 1
. stset surv_mm, failure(status==1) exit(time 120) scale(12)
. stcox year8594,
```
1. Fit a flexible parametric survival model on the log cumulative hazard scale with 4 degrees of freedom for the baseline.
```
. stpm2 year8594, scale(hazard) df(4) eform
```
  Compare the estimated hazard ratio, 95% confidence interval and statistical significance to the Cox model.
2. Obtain predicted values of the survival and hazard functions and plot these functions by calendar period.
```
. predict s1, survival
. predict h1, hazard per(1000)

. twoway  (line s1 _t if year8594 == 0, sort) ///
          (line s1 _t if year8594 == 1, sort) ///
          , legend(order(1 "1975-1984" 2 "1985-1994") ring(0) pos(1) col(1)) ///
          xtitle("Time since diagnosis (years)")

. twoway  (line h1 _t if year8594 == 0, sort) ///
          (line h1 _t if year8594 == 1, sort) ///
          , legend(order(1 "1975-1984" 2 "1985-1994") ring(0) pos(1) col(1)) ///
          xtitle("Time since diagnosis (years)")
```
3. Add the option yscale(log) to the hazard plot to display the hazard function on the log scale. Why is the difference between the two lines constant over the time scale?
4. Note that there are 4 _rcs terms because of the df(4) option. We can investigate more or less degrees of freedom for the baseline. It is easiest to do this in a loop.
```
forvalues i = 1/6 {
  stpm2 year8594, scale(hazard) df(`i') eform
  estimates store df`i'
  predict h`i', hazard per(1000)
  predict s`i', survival
}
```
  Compare the hazard ratios, AIC and BIC from the different models. On the following page you can find a brief description of the AIC and BIC.
```
. estimates table df*, eq(1) keep(year8594) se stats(AIC BIC)
```
  According to the AIC and BIC how many degrees of freedom should be used for the baseline? Does it matter for the interpretation of the the estimated hazard ratio?
  
  About AIC and BIC
  
  AIC (Akaike information criterion) and BIC (Bayesian information criterion) are two popular measures for comparing the relative goodness-of-fit of statistical models. The AIC and BIC are defined as:\ $AIC = -2 \ln(\mbox{likelihood}) + 2 k$\ $BIC = -2 \ln(\mbox{likelihood}) + \ln(N) k$\ where $k$ = number of parameters estimated and $N$ = number of observations.
  
  Given a set of candidate models for the data, the preferred model is the one with the minimum AIC/BIC value. Hence, the measures not only reward goodness of fit, but also include a penalty that is an increasing function of the number of estimated parameters. AIC uses a fixed constant, 2, in the penalty term whereas the penalty in BIC is a function of the number of observations. It is not always obvious how ‘number of observations’ should be defined for time-to-event data, particularly for grouped or split data. Volinsky and Raftery (2000) suggest using the number of events for $N$ in the BIC penalty term for survival models. The estimates stats command contains an option n(#) for specifying $N$. Which measure – AIC or BIC – is best?
  
  In many circumstances both the AIC and BIC will suggest the same model. For population-based survival data, the number of observations is large so BIC will penalize models with additional parameters more strongly than AIC. We suggest using the AIC rather than BIC.
5. Compare the estimated baseline survival and hazard functions for the models with varying degrees of freedom.
6. Now include sex and age (in categories) in the model.
```
. stpm2 i.sex year8594 i.agegrp, df(4) scale(hazard) eform
. estimates store ph
```
  Compare the estimates to those obtained in question 9(c). Perform a Wald test (use the test command) for the overall effect of age and compare it to that obtained for the Cox model.
7. Explain why the estimates from the Cox model and the flexible parametric model are so similar.
8. We will now extend the model to allow the effect of age group to be time-dependent using 2 degrees of freedom for each age category. Note that you can’t use Stata’s factor variables in the tvc() option, so create your own dummy variables.
```
. tab agegrp, gen(agegrp)
. stpm2 i.sex year8594 agegrp2-agegrp4, df(4) scale(hazard) ///
        tvc(agegrp2 agegrp3 agegrp4) dftvc(2)
. estimates store nonph
```
  Perform a likelihood ratio test comparing the proportional hazards model with the non-proportional hazards (for age) model. Is there evidence of a non-proportional effect?
```
. lrtest ph nonph
```
9. Predict and plot the baseline hazard function for this model. For which group of patients (i.e., which covariate values) does the baseline hazard apply?
```
. predict h0, hazard zeros ci
. line h0 _t, sort
```
10. Obtain a prediction of the hazard ratio as a function of time for each age group.
```
. predict hr2, hrnumerator(agegrp2 1) ci
. predict hr3, hrnumerator(agegrp3 1) ci
. predict hr4, hrnumerator(agegrp4 1) ci
```
  Plot these hazard ratios versus follow-up time on the same graph. What happens to the hazard ratios as follow-up time increases? Also plot the hazard ratio for the oldest group with a 95% confidence interval. Explain why the hazard ratio for the largest age group is so high early on in the time-scale (hint: look at the baseline hazard)
11. Obtain and plot with 95% confidence intervals the difference in the hazard rates between the oldest and youngest agegroups for males in 1975-1984.
```
. predict hdiff4, hdiff1(agegrp4 1) ci per(1000)
. twoway (rarea hdiff4_lci hdiff4_uci _t, sort) ///
    (line hdiff4 _t, sort) ///
    ,legend(off)  ///
    xtitle("Time since diagnosis (years)") ///
    ytitle("Difference in hazard rate")
```
  Explain why the hazard difference is small early on in the time-scale, when the hazard ratio is at is greatest.
12. Obtain and plot with 95% confidence intervals the difference in the survival functions between the oldest and youngest agegroups for females diagnosed in 1985-1994.
```
. predict sdiff4, sdiff1(agegrp4 1 sex 2 year8594 1) ///
                sdiff2(agegrp4 0 sex 2 year8594 1) ci
. twoway (rarea sdiff4_lci sdiff4_uci _t, sort) ///
    (line sdiff4 _t, sort) ///
    ,legend(off)  ///
    xtitle("Time since diagnosis (years)") ///
    ytitle("Difference in survival functions")
```
13. Fit models with 1, 2 and 3 df for the time-dependent effect of age. Use the AIC and BIC to compare models. Compare the estimated time-dependent hazard ratio for the oldest age group compared to the youngest (also compare the 95% confidence intervals). You may want to exclude the first month from you plot.
Probability of death in a competing risks framework (cause-specific survival) [locmel:rsr:fpm]\

This question gives an introduction to some of the methods available for competing risks analyses. We will be estimating similar quantities to q34 and q38 but we are now working in a cause-specific survival framework. To carry out the exercises you will need to install some user-written commands from within Stata. The stcompet command estimates the cumulative incidence function (CIF) non-parametrically. The stcompadj command estimates the CIF using a competing risks analogue of the Cox model. Finally, the stpm2cif command estimates the CIF through postestimation after fitting a flexible parametric model.
1. Load the melanoma data. If you summarize status you will notice that there are deaths from both cancer and other causes. Plot the complement of the Kaplan-Meier estimate (i.e. 1 minus Kaplan-Meier survival estimate) for both cancer and other causes. Describe what you see.
  
  A common confusion when competing risks are present is to think that the probability of death from cancer can be obtained by taking the complement of the Kaplan-Meier estimate (1-KM). By doing this we treat deaths from other causes as censored. This then assumes that the patients dying from other causes would have been at no systematically higher or lower risk of dying from cancer (independent censoring). Patients that die due to competing causes can no longer die from cancer which essentially violates this independence assumption. Therefore, the resulting estimates would not be interpretable as the probability of death from cancer.
  
  The appropriate estimate for the “real world” probability of death from cancer when competing risks are present is the cumulative incidence function. That is the proportion of patients that have died from cancer at a certain time in the follow-up period taking into account competing causes of death.
2. Use the stcompet command to estimate the cumulative incidence function for both cancer and other causes. Plot the cumulative incidence functions along with the complements of the Kaplan-Meier estimates from part (a). What do you notice?
```
. stset surv_mm, failure(status==1) scale(12)
. stcompet CIF=ci, compet1(2)
. gen CIFcancer=CIF if status==1
. gen CIFother=CIF if status==2
```
3. We now want to consider a competing risks analysis that will take into account covariate effects. Use the stcompadj command to estimate the cumulative incidence function for cancer and other causes taking into account the effect of sex. First we need to generate a binary variable for females. We shall assume that the effect of females is the same for both cancer and other causes. You will need to run the command twice in order to evaluate the cumulative incidence functions for each sex. Plot the cumulative incidence functions for cancer and other causes for both males and females. Are there any differences between males and females?
```
. gen female=sex==2
. stset surv_mm, failure(status==1) scale(12)
. stcompadj female=0, compet(2) gen(CIFcancermale CIFothermale)
. stcompadj female=1, compet(2) gen(CIFcancerfemale CIFotherfemale)
```
4. We will now consider the same model but using the flexible parametric approach. In order to do this we will first need to expand the data set so that each patient has two rows of data - one for each cause of death.
  1. Expand the data and have a look at the new data set.
```
. expand 2
. bysort id: gen cause=_n
. gen cancer=(cause==1)
. gen other=(cause==2)
```
  2. Fit a flexible parametric model for cancer and other causes simultaneously. Include sex as a covariate assuming that the effect of sex is the same for both cancer and other causes. Interpret the effect of sex.
```
. stset surv_mm, failure(event) scale(12)
. stpm2 cancer other female, scale(hazard) ///
  rcsbaseoff dftvc(3) nocons tvc(cancer other) eform nolog
```
    By including the two cause indicators (cancer and other) as both main effects and time-dependent effects (using tvc option) we have fitted a stratified model with two separate baselines, one for each cause. For this reason we have used the rcsbaseoff option together with the nocons option which excludes the baseline hazard from the model.
  3. Use the stpm2cif postestimation command to obtain the cumulative incidence functions for cancer and other causes for each sex. You will need to run this command twice - once for each sex. Notice that a new time variable is generated called _newt. You will need to use this instead of _t in your plots. Do the results look the same as the ones from the Cox model approach?
```
. stpm2cif cancermale othermale, cause1(cancer 1) ///
  cause2(other 1)
. stpm2cif cancerfemale otherfemale, cause1(cancer 1 female 1) ///
  cause2(other 1 female 1)
```
  4. Think about an alternative way to present the results. Try stacking the cumulative incidence functions for cancer and other causes.
5. So far the model only included one covariate (male/female). We now want to adjust the model for age. However, we don’t believe that the effect of age is the same for both cancer and other causes of death.
  1. To allow the effect of age to vary for the two causes create interaction terms between age group and the causes of death.
```
. gen age0ca=(agegrp==0 & cancer==1)
. gen age1ca=(agegrp==1 & cancer==1)
. gen age2ca=(agegrp==2 & cancer==1)
. gen age3ca=(agegrp==3 & cancer==1)
. gen age0oth=(agegrp==0 & other==1)
. gen age1oth=(agegrp==1 & other==1)
. gen age2oth=(agegrp==2 & other==1)
. gen age3oth=(agegrp==3 & other==1)
```
  2. Fit a flexible parametric model including sex and the interaction terms between age group and cause.
```
. stpm2 cancer other female age1ca age2ca age3ca ///
        age1oth age2oth age3oth , scale(hazard) ///
        rcsbaseoff dftvc(3) nocons tvc(cancer other) eform nolog
```
  3. Now incorporate sex as a time-dependent effect in the model.
```
. stpm2 cancer other female age1ca age2ca age3ca ///
        age1oth age2oth age3oth , scale(hazard) ///
        rcsbaseoff dftvc(3) nocons tvc(cancer other female) eform nolog
```
  4. Estimate the cumulative incidence functions for each sex and for the age groups 0-44 and 75+. Plot the cumulative incidence functions against time.
```
. stpm2cif cancermaleage0 othermaleage0, ///
  cause1(cancer 1) cause2(other 1)

. stpm2cif cancermaleage3 othersex1age3, ///
  cause1(cancer 1 age3ca 1) cause2(other 1 age3oth 1)

. stpm2cif cancerfemaleage0 othefemaleage0, ///
  cause1(cancer 1 female 1) cause2(other 1 female 1)

. stpm2cif cancerfemaleage3 otherfemaleage3, ///
  cause1(cancer 1 age3ca 1 female 1) cause2(other 1 age3oth 1 female 1)
```