In jacobaro/danalyze: Danalyze: Automated Analysis Functions

Load the main libraries.

needs::needs(danalyze, tidyverse)

The first step is to load some example survival data (in this case the melanoma data from the 'timereg' package). Some stop times are fractional. To turn this into panel data with unique observations by subject ('id') and time ('start' and 'stop') the stop time is simply set to the start time plus one. In practice, data could be expanded to account for variation in the duratio of each observation. Clustering the standard error addresses problems with transforming the data in this way.

# get the survival data from an existing package
# data citation: https://www.rdocumentation.org/packages/timereg/versions/1.3-6/topics/mela.pop
data(mela.pop, package = "timereg")

# the data has fractional stop times, which wont work for our purposes so just use start + 1
mela.pop$stop2 = mela.pop$start + 1

# set sex to 1/0
mela.pop$sex = if_else(mela.pop$sex == 2, 1, 0)

Next we set the main formula that we will use in our test.

# set the formula making sure to use our new stop time -- we include rate, which is the population offset; this does not make sense as an IV but helps to understand time-varying hazards
f = survival::Surv(start, stop2, status) ~ sex + log1p(age) + rate

As a first step we can run a a Cox proportional hazards model and show the results.

# run survival model, clustering on 'id'
m.surv = survival::coxph(formula = f, data = mela.pop, cluster = id, x = T)

# show a summary of the results
summary(m.surv)

The bootstrapped version using the 'danalyze' package is also run. This takes slightly longer (~3 seconds to generate the data replicates and ~10 seconds to run the bootstrap). As before we cluster on 'id' using a block bootstrap.

# run package version
m.surv.boot = analysis(runs = 1000, formula = f, data = mela.pop, cluster = ~ id)

The results of the Cox model and the results of the 'danalyze' model should be quite similar. We can check both just to make sure this is the case.

# first show the results of the Cox model using the standard summary function
summary(m.surv)

# we can also use the 'danalyze' function which will produce the same results
# get_prediction_frequentist(m.surv)$coefficients

We can check these results against the bootstrap results. The magnitude of the coefficients are very similar as are the standard errors. The larger standard error for 'rate' suggest that there may be more heterogeneity in the effect across individuals. The bootstrap version also generates a coefficient for the '(Intercept)', which is used to capture the shape of the baseline hazard.

# then show the results of the bootstrapped model
results(m.surv.boot)$coefficients

We can now compare the results of the Cox survival model to the results of using a series of linear models. To do this, we first create the new set of dependent variables used for the linear model before running the set of linear models. Because we are comparing the cumulative incidence, we create dependent variables out to 10 time units and generate contrasts for the effect of sex (the difference in the cumulative incidence between male and female individuals). If the set of linear models is correct, the cumulative incidence curves of the modeled contrast should be similar across both approaches.

The first step is to set the prediction so that we can estimate the contrast of the cumulative incidence curve. As indicated, we will look first at sex (compring )

We then setup the data to allow the set of linear models to be run.

# a helper function to identify if an event has occurred within X time forward from the current observation
# t = 0 means the current observation, t = 1 means within one time unit forward of the current observation
within_time = function(x, t, outcome = 1) {
  maxl = length(x)
  sapply(1:maxl, function(z) {
    maxt = min(maxl, z + t)

    # get the outcome -- the ordering here matters -- first we check if it happened (return '1')
    # then we check to see if we know it did not happen (return 'NA' if we are not sure or '0' otherwise)
    if(any(na.omit(x[z:maxt]) == outcome)) return(outcome) # we know it happened
    if(z + t > maxt || any(is.na(x[1:maxt]))) return(NA) # these are outcomes we do not know so we set them to 'NA'
    return(0) # we know it did not happen
  })
}

# we will create 10 dependent variables that record the occurrence of the outcome within X time ('status_c0' is identical to 'status')
mela.pop = mela.pop %>% arrange(id, start, stop2) %>% group_by(id) %>%
  mutate(across(status, .fns = list(c0 = ~ within_time(.x, 0), c1 = ~ within_time(.x, 1),
                                    c2 = ~ within_time(.x, 2), c3 = ~ within_time(.x, 3), 
                                    c4 = ~ within_time(.x, 4), c5 = ~ within_time(.x, 5), 
                                    c6 = ~ within_time(.x, 6), c7 = ~ within_time(.x, 7), 
                                    c8 = ~ within_time(.x, 8), c9= ~ within_time(.x, 9))))

Next we run the set of linear models using our newly created dependent variables. New coefficient estimates, which are essentially the time-dependent hazard, are created. To ease interpretation we also run a prediction using sex (2 vs. 1). The predictions are generated in a format that 'danalyze' can use easily and a format that makes it easy to produce predictions from the built in survival prediction function.

# prediction for danalyze -- very easy to see what is happening!
pr.boot = pr(sex = c(1, 0))

# prediction for built-in survival predict function -- set through 10 subsequent time units
pr.survival = bind_rows(
  tibble(sex = 1, age = mean(mela.pop$age), rate = mean(mela.pop$rate), start = 0, stop2 = 1:10, status = 0),
  tibble(sex = 0, age = mean(mela.pop$age), rate = mean(mela.pop$rate), start = 0, stop2 = 1:10, status = 0))

Run the linear models.

# run a linear model (on binomial data)
m.surv.linear = lapply(0:9, function(i) {
  # set the formula
  f.t = update(f, as.formula(paste0("status_c", i, " ~ .")))

  # run the model -- binomial outcome but we are running it using a gaussian family -- could also just use lm but this makes it easier to check what happens when you swap family
  out.t = glm(formula = f.t, data = mela.pop, family = gaussian)

  # get prediction using the prediction we created above -- the SE is pretty large when using just id, but that is likely the more accurate assessment
  # this uses a 'danalyze' function
  r.t = get_prediction_frequentist(out.t, cluster = ~ id, predictions = pr.boot)

  # save
  r.t
})

We have the predictions that we want to generate contrasts from and the models. Make nice data.frames for plotting.

# get contrasts from the set of linear models
out.linear = map_dfr(m.surv.linear, "contrasts", .id = ".time")
out.linear$.time = as.numeric(out.linear$.time)

# get the results from the bootstrapped survival model
out.boot = results(object = m.surv.boot, predictions = pr.boot, times = 1:10)$contrasts

# for sake of comparison we can also plot the predictions (but not easily the contrasts) for the standard survival model

# get contrasts from the survival model
out.surv = predict(m.surv, newdata = pr.survival, type = "survival", se.fit = T)

# turn it into cumulative incidence (the CIF common in interpreting competing risks models)
out.surv = tibble(
  .time = c(1:10, 1:10),
  c = 1 - out.surv$fit,
  c.low = c - qnorm(0.975) * out.surv$se.fit,
  c.high = c + qnorm(0.975) * out.surv$se.fit
)

# now combine the contrasts from the set of linear models and the bootstrapped survival model

# set name for easy comparison
out.linear$.type = "Linear Series"
out.boot$.type = "Survival"
out.all = bind_rows(out.linear %>% select(.type, .time, c:c.high), out.boot %>% select(.type, .time, c:c.high))

With models run and contrasts generated, we can plot the results to visually inspect the similarity (or differences). The results are very similar. Slightly different standard errors and some deviation over time but this is to be expected given that the Cox model assumes proportional hazards but the set of linear models does not. Overall the results are very close.

# plot -- assuming that '2' means male and '1' means female
ggplot(out.all, aes(x = .time, y = c, ymin = c.low, ymax = c.high)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey80", size = 1.5) +
  geom_ribbon(aes(fill = .type), alpha = 0.4) + geom_line(aes(color = .type), size = 1) +
  scale_fill_manual("Model Type", values = c("orangered", "steelblue4")) + 
  scale_color_manual("Model Type", values = c("orangered", "steelblue4")) +
  scale_x_continuous(breaks = c(1, 5, 10)) + scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal() + labs(x = "Time", y = "Change in Cumulative Incidence, Male vs. Female")

We can check the impact of another variable to see if results are still consistent. The next steps check the effect of 'rate' using this same process. Create predictions, run the set of linear models, and then get contrasts.

# new prediction
# we use 'rate' but the same issue will occur for 'age' as the hazards vary with time (see below for this assessment)
# ll that needs to be done is to change 'rate' to 'age'
pr.boot = pr(rate = create_values(mela.pop$rate)) # -- the 'age' prediction: pr(age = create_values(mela.pop$age))

# run a linear model (on binomial data)
m.surv.linear = lapply(0:9, function(i) {
  # set the formula
  f.t = update(f, as.formula(paste0("status_c", i, " ~ .")))

  # run the model -- binomial outcome but we are running it using a gaussian family 
  # could also just use lm but this makes it easier to check what happens when you swap family
  out.t = glm(formula = f.t, data = mela.pop, family = gaussian)
  # this outcome model (run as an lm instead of a glm) can also be saved and used for blackwell's sensivitiy analysis

  # get prediction using the prediction we created above
  # the SE is pretty large when using just id, but that is likely the more accurate assessment
  # this uses a 'danalyze' function
  r.t = get_prediction_frequentist(out.t, cluster = ~ id, predictions = pr.boot)

  # save
  r.t
})

# get contrasts from the set of linear models
out.linear = map_dfr(m.surv.linear, "contrasts", .id = ".time")
out.linear$.time = as.numeric(out.linear$.time)

# get the results from the bootstrapped survival model
out.boot = results(object = m.surv.boot, predictions = pr.boot, times = 1:10)$contrasts

# set name for easy comparison
out.linear$.type = "Linear Series"
out.boot$.type = "Survival"
out.all = bind_rows(out.linear %>% select(.type, .time, c:c.high), out.boot %>% select(.type, .time, c:c.high))

# plot again
ggplot(out.all, aes(x = .time, y = c, ymin = c.low, ymax = c.high)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey80", size = 1.5) +
  geom_ribbon(aes(fill = .type), alpha = 0.4) + geom_line(aes(color = .type), size = 1) +
  scale_fill_manual("Model Type", values = c("orangered", "steelblue4")) + 
  scale_color_manual("Model Type", values = c("orangered", "steelblue4")) +
  scale_x_continuous(breaks = c(1, 5, 10)) + scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal() + labs(x = "Time", y = "Change in Cumulative Incidence, High vs. Low Rate")

The plot does not look very good now. The survival model outputs a nice slowly increasing cumulative incidence curve but the series of linear models produces a curve that increases rapidly until time five and then decreases rapidly. The two curves look little alike.

What happened? One possibility is that the the proportional hazards assumption is violated: the survival model outputs a nice curve because it is using the mean hazard value wherase the true hazard value may change over time. In other words, the series of linear models, which does not assume proportional hazards, is right while the survival model is wrong.

To check the proportional hazards assumption we run the 'cox.zph' function. A significant correlation between time and the scaled Schoenfeld residuals (a positive or negative slope) suggests that the assumption is violated while a horizontal slope indicates the opposite.

cox.zph(m.surv)

None of the P-values are significant ('rate' especially is highly insignificant), which suggests that there might not be a problem. This test is by no means foolproof so we also use a more sophisticated approach to model possible time-varying hazards directly. Specifically, we use the 'timereg' package. The 'timereg' package provides two tests for time-invariant effects: Kolmogrov-Smirnov and Von Mises. A significant P-value indicates that hazards are time-varying (violating standard assumptions) while an insignficiant P-value indicates the opposite. We find that both tests suggests no time-varying effects for 'sex' but significant time-varying effects for 'rate' (and also for 'age'). This more sophisticated test seems to confirm our original intuition: proportional hazards for 'sex' but not for 'rate'.

# rerun our main model using 'timereg'
m.surv.timereg = timereg::timecox(formula = f, data = mela.pop, cluster = mela.pop$id)

# check proportional hazards
# this test says that 'sex' is fine but that rate is not, which matches what we find with our linear model
summary(m.surv.timereg)

To make this easier to see we can also plot the hazards generated by the 'timereg' model over time. This does not take into account the baseline hazard but does show the extreme rise and fall in the 'rate' variable. (It also shows the lrge increase in the age variable that can be seen in the linear series if 'rate' is changed to 'age' in the prediction.) By contrast, the slope for 'sex' looks more like a straight line (you could stick a straight line through it and sufficiently capture the cumulative hazard).

# get the plot data
plot.timereg = m.surv.timereg$cum %>% as_tibble %>% mutate(time = as.integer(time)) %>% group_by(time) %>% 
  summarize(across(everything(), mean)) %>% filter(time != 0)

# lengthen
plot.timereg = plot.timereg %>% pivot_longer(-time)

# normalize
plot.timereg = plot.timereg %>% group_by(name) %>% mutate(value = scale(value))

# plot
ggplot(plot.timereg %>% filter(name != "(Intercept)"), aes(x = time, y = value)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey80", size = 1.5) +
  geom_line(aes(color = name), size = 1) +
  scale_color_manual("Variable Name", values = c("orangered", "steelblue4", "darkgreen")) +
  scale_x_continuous(breaks = c(1, 5, 10)) +
  theme_minimal() + labs(x = "Time", y = "Scaled Culumative Coefficient Value")

The results of this more complicated survival model point to time-varying effects for 'age' and 'rate' but not for 'sex.' This matches what we find when using the sequence of linear models. In fact, as demonstrated, one advantage of the sequence of linear models is that it provides a robust platform for estimating cumulative incidence while taking into account time-varying hazards.

Given that the sequence of linear models produces reasonable estimates, it is straightforward to apply Blackwell's (2014) selection bias approach to sensitivity analysis. To do this we need two models: the outcome model and the treatment model. For ease, we will take 'sex' as our treatment and run the analysis for one dependent variable.

To apply this to the series of linear models, the outcome needs to be rerun X times depending on the number of time periods desired.

# the causalsens function doesnt like transforms in the formula so we just drop all those
f.cs = ~ sex + age + rate

# set outcome model
sens.out = lm(formula = update(f.cs, status_c0 ~ .), data = mela.pop)

# set treatment model
sens.treat = glm(formula = update(f, sex ~ . - sex), data = mela.pop, family = binomial)

# now run causalsens
sens.main = causalsens::causalsens(model.y = sens.out, model.t = sens.treat, cov.form = ~ age + rate, 
                                   data = mela.pop, alpha = seq(-0.1, 0.1, by = 0.02))

# see what happened
sens.main$sens

Plotting makes it easier to understand exactly what is happening. The existing variables have a max partial-R^2 of 0.015, which corresponds to an alpha of about 0.06 (and -0.06 depending on direction of confounding).

# plot the raw sensitvity resutls based on alpha
plot(sens.main, type = "raw", bty = "n")

# we could also plot based on rsquared
# plot(sens.main, type = "r.squared", bty = "n")