Validating published prognostic models

knitr::opts_chunk$set(echo = FALSE)

devtools::load_all()
library(tidyverse)
library(patchwork)
theme_set(cowplot::theme_cowplot())
data <- read_data_for_analysis()
at_admission <- data$patient_data %>%
  mutate(alive_12 = case_when(
    last_record >= 12 ~ TRUE,
    outcome == "Discharged" ~ TRUE,
    outcome == "Death" ~ FALSE,
    TRUE ~ NA),
    alive_30 = case_when(
    last_record >= 30 ~ TRUE,
    outcome == "Discharged" ~ TRUE,
    outcome == "Death" ~ FALSE,
    TRUE ~ NA)
  )

In all of the following we focus on the area under curve (AUC), also often called C-statistic. We compute AUC using the pROC package [@http://zotero.org/users/5567156/items/UN6BIC67], which also reports bootstrapped confidence intervals.

For all risk scores we look at 12-day mortality and 30-day mortality - simply because those are the time horizons reported in the studies we looked at. When needed or when it is uncertain how the outcome was coded in the original study, we can also look at "indefinite horizon" where we only include patients that were not censored and had a known outcome.

ACP

The ACP index [@http://zotero.org/users/5567156/items/EN4GICJ6] is a simple score that categorizes the patients into 3 categories based on age and C-reactive protein (CRP): grade 1 (age $<$ 60 years and CRP $<$ 34 mg/L), grade 2 (age $\geq$ 60 years and CRP $<$ 34 mg/L or age $<$ 60 years and CRP $\geq$ 34 mg/L), grade 3 (age $\geq$ 60 years and CRP $\geq$ 34 mg/L) and was evaluated primarily on 12-day mortality. Despite the score being developed on patients in Wuhan, China, the 12-day mortality in our dataset is in rough agreement with the values in the original manuscript as shown in Table \@ref(tab:acpresults).

acp_data <- data$marker_data_wide %>%  
  group_by(patient_id) %>%
  filter(any(!is.na(CRP))) %>%
  summarise(
    first_CRP_day = min(day[!is.na(CRP)]),
    first_CRP = CRP[day == first_CRP_day],
    worst_breathing_12 = max(breathing[day >= first_CRP_day & day <= first_CRP_day + 12]),
    .groups = "drop") %>% 
  inner_join(data$patient_data, by = c("patient_id")) %>%
  filter(outcome %in% c("Death","Discharged") | last_record - first_CRP_day >= 12) %>%
  mutate(    
    alive_12 = case_when(last_record >= first_CRP_day + 12 | outcome == "Discharged" ~ TRUE,
                         outcome == "Death" ~ FALSE,
                         TRUE ~ NA),
    alive_30 = case_when(last_record >= first_CRP_day + 30 | outcome == "Discharged" ~ TRUE,
                         outcome == "Death" ~ FALSE,
                         TRUE ~ NA),
    ACP_grade = case_when(
       age < 60 & first_CRP < 34 ~ 1,
       age >= 60 & first_CRP >= 34 ~ 3,
       TRUE ~ 2
    ) 
  )

acp_data %>%
  filter(!is.na(alive_12)) %>%
  group_by(ACP_grade) %>%
  summarise(`patients` = n(), `12-day mortality` = sum_percent(!alive_12), .groups = "drop") %>%
  arrange(ACP_grade) %>%
  mutate(`Li et al. 95% CI` = c("0%", "0 - 11.3%", "19.8 - 44.3%")) %>%
  rename(`ACP grade` = ACP_grade) %>%
  knitr::kable(booktabs = TRUE, caption = "Comparing the 12-day mortality based on ACP score (where available) with the 95\\% confidence intervals (CI) for 12-day mortality reported by Lu et al.")

# acp_data %>%
#   group_by(ACP_grade, worst_breathing_12) %>%
#   summarise(mean(alive_12))

The original authors do not report AUC (C-statistic), but they report the overall counts of alive and death patients, letting us compute it. The counts are:

auc_data_from_counts <- function(score_values, positive_counts, negative_counts) {
  res <- list()
  for(i in 1:length(score_values)) {
    res[[i]] <- rbind(
      data.frame(score = rep(score_values[i], positive_counts[i]), outcome = rep(TRUE, positive_counts[i])),
      data.frame(score = rep(score_values[i], negative_counts[i]), outcome = rep(FALSE, negative_counts[i]))
    )
  }
  do.call(rbind, res)
}

#Grade 1, 0% mortality,Grade 2, 5.6% mortality, Grade 3, 33.2% mortality
li_et_al_data <-  auc_data_from_counts(c(1,2,3), 
                                            positive_counts = c(194, 145-8, 95-32), 
                                            negative_counts = c(0, 8, 32)) %>%
                         transmute(alive_12 = outcome, ACP_grade = score)

li_et_al_data %>% group_by(ACP_grade, alive_12) %>% summarise(N = n(), .groups = "drop")

Giving us this AUC:

acp_auc_orig <- my_auc_summary(alive_12 ~ ACP_grade, li_et_al_data)
acp_auc_orig %>% transmute(AUC = format(auc, digits = 3), `95% CI` = paste0("(",format(auc_low, digits = 3), ", ", format(auc_high, digits = 3), ")")) %>% knitr::kable(booktabs = TRUE)

So we can now compare AUC computed on our data for both 12-day mortality (as in the original study) and 30-day mortality.

acp_auc <- rbind(
  my_auc_summary(alive_12 ~ ACP_grade, acp_data),
  my_auc_summary(alive_30 ~ ACP_grade, acp_data)
)
compare_auc_to_orig(acp_auc_orig, acp_auc)

We see that the score noticeably underperforms in the present dataset.

Chen and Liu

Chen and Liu [@http://zotero.org/users/5567156/items/4DDUVEKS] propose a score based on a logistic regression model as follows.

$$ Score = -4.211+0.013 \times CRP+0.059 \times Age +0.112 \times DD -1.984 \times LYM $$

Where CRP is C-reactive protein (mg/l) DD is D-dimer (the paper does not have units, but the associated web app is clear that it expects D-dimer as ug/ml FEU) and LYM is the Lympohocyte count (10^9 / L). The paper states:

patients who were with >10% missing values, stayed in the hospital < 7 days, afflicted by a severe disease before admission (e.g., cancer, aplastic anemia, and uremia), were unconscious at admission, and were directly admitted to the intensive care unit (ICU) were excluded.

Non-normally distributed continuous variables were transformed using a Box-Cox transformation.

However, reverse engineering the asociated web app (https://phenomics.fudan.edu.cn/risk_scores/), no transform appears to have been done to the variables that were finally selected. It is also slightly unclear how they computed the outcome (i.e. how they treated censoring).

The units in our dataset are:

data$marker_data %>% filter(marker %in% c("CRP", "d_dimer", "lymphocyte_count")) %>%
  select(marker, unit) %>% distinct()

Which matches the model except for D-dimer (we use ng/ml DDU, FEU value is double the DDU).

# More precise coefficient values and the CI range determined by reverse engineering the webapp

chen_liu_data <- data$marker_data_wide %>%  
  mutate(has_all_values = !is.na(CRP) & !is.na(d_dimer) & !is.na(lymphocyte_count),
         d_dimer_converted = d_dimer * 2 / 1000) %>%
  group_by(patient_id) %>%
  filter(any(has_all_values)) %>%
  summarise(
    first_day = min(day[has_all_values]),
    first_CRP = CRP[day == first_day],
    first_d_dimer_converted = d_dimer_converted[day == first_day],
    first_lymphocyte_count = lymphocyte_count[day == first_day],
    breathing_first = breathing[day == first_day],
    .groups = "drop") %>% 
  inner_join(at_admission, by = c("patient_id")) %>%
  mutate(    
    inclusion_criterium = breathing_first < "NIPPV" & (outcome %in% c("Discharged", "Death") | last_record >= first_day + 7),
    chen_liu = -4.21167 + 0.058699 * age + 0.013391 * first_CRP + 0.112055 * first_d_dimer_converted - 1.983733 * first_lymphocyte_count,
    risk = 1 / (1 + exp(-chen_liu)),
    risk_lower = 1 / (1 + exp(-chen_liu + 1.96*0.1721)),
    risk_upper = 1 / (1 + exp(-chen_liu - 1.96*0.1721)),
    alive_12_data = case_when(last_record >= first_day + 12 | outcome == "Discharged" ~ TRUE,
                         outcome == "Death" ~ FALSE,
                         TRUE ~ NA),
    alive_30_data = case_when(last_record >= first_day + 30 | outcome == "Discharged" ~ TRUE,
                         outcome == "Death" ~ FALSE,
                         TRUE ~ NA),
    alive_all = case_when(outcome == "Discharged" ~ TRUE,
                         outcome == "Death" ~ FALSE,
                         TRUE ~ NA)

  ) %>% select(age, starts_with("first"), chen_liu, starts_with("risk"), breathing_first, starts_with("alive"), inclusion_criterium)

A plot of the computed values for our dataset.

chen_liu_data %>% ggplot(aes(x = chen_liu, y = alive_30)) + geom_point()

We look at survival since first day when all of the values required for the score are available, or we ignore when the values were computed and look at survival since admission. Finally we also tried emulating the inclusion criteria from Chen & Lie

Chen & Liu report AUC = 0.881 for the validation dataset and don't provide enough information to let us recompute it with credible intervals.

chen_liu_data_inclusion <- chen_liu_data %>% filter(inclusion_criterium)
chen_liu_auc <- rbind(
  my_auc_summary(alive_12 ~ chen_liu, chen_liu_data),
  my_auc_summary(alive_30 ~ chen_liu, chen_liu_data),
  my_auc_summary(alive_12_data ~ chen_liu, chen_liu_data),
  my_auc_summary(alive_30_data ~ chen_liu, chen_liu_data),
  my_auc_summary(alive_all ~ chen_liu, chen_liu_data),
  my_auc_summary(alive_12 ~ chen_liu, chen_liu_data_inclusion, "incl"),
  my_auc_summary(alive_30 ~ chen_liu, chen_liu_data_inclusion, "incl"),
  my_auc_summary(alive_12_data ~ chen_liu, chen_liu_data_inclusion, "incl"),
  my_auc_summary(alive_30_data ~ chen_liu, chen_liu_data_inclusion, "incl"),
  my_auc_summary(alive_all ~ chen_liu, chen_liu_data_inclusion, "incl")
)

#plot_all_auc(chen_liu_auc)
chen_liu_auc_orig <- my_auc_summary_from_estimate(0.881, "alive", "chen_liu", "incl")
compare_auc_to_orig(chen_liu_auc_orig, chen_liu_auc) 

Shi et al.

Shi et al. [@http://zotero.org/users/5567156/items/X4GWJIP5] report a simple score: 1 point for Age >= 50, Male and Hypertension each. They try to predict "severe" illness, but it is not clear what "severe" means. Here, we test both "not ventilated" (or worse, including death) and "not needing supplementary oxygen" as proxy for "severe". They also create a risk score for both "severe" on admission and "severe" during hospitalization. As for other scores we also look at 12- and 30- day mortality.

shi_et_al_data <- data$marker_data_wide %>%  
  group_by(patient_id) %>%
  summarise(
    breathing_admission = max(breathing[!is.na(breathing) & day < 2]),
    not_ventilated_admission = breathing_admission < "NIPPV",
    not_oxygen_admission = breathing_admission < "Oxygen",
    .groups = "drop") %>% 
  inner_join(at_admission, by = c("patient_id")) %>%
  mutate(shi = (age >= 50) + (sex == "M") + has_hypertension_drugs,
         not_ventilated_all = case_when(
           outcome == "Death" ~ FALSE,
           outcome != "Discharged" ~ NA,
           worst_breathing < "NIPPV" ~ TRUE
         ),
         not_oxygen_all = case_when(
           outcome == "Death" ~ FALSE,
           outcome != "Discharged" ~ NA,
           worst_breathing < "Oxygen" ~ TRUE
         )
         )

shi_et_al_auc <- rbind(
  my_auc_summary(alive_12 ~ shi, shi_et_al_data),
  my_auc_summary(alive_30 ~ shi, shi_et_al_data),
  my_auc_summary(not_ventilated_all ~ shi, shi_et_al_data),
  my_auc_summary(not_oxygen_all ~ shi, shi_et_al_data),
  my_auc_summary(not_ventilated_admission ~ shi, shi_et_al_data, subgroup = " (admission)"),
  my_auc_summary(not_oxygen_admission ~ shi, shi_et_al_data, subgroup = " (admission)")
)

#plot_all_auc(shi_et_al_auc)

We extract the original data from Fig1 B,C using WebPlotDigitizer.

shi_et_al_reported <- data.frame(
  score = c(0,1,2,3), 
  shi_severe_admission = c(0,0.057,0.19,0.4), shi_severe_all = c(0.083, 0.138,0.389,0.429),
  mild_cases_admission = c(118, 199, 98, 21), all_cases_admission = c(118, 211, 121, 35),
  mild_cases_hospitalization = c(11, 25, 11, 4), all_cases_hospitalization = c(12, 29, 18, 7)
) 

shi_et_al_reported %>% select(-starts_with("shi_")) %>% knitr::kable(booktabs = TRUE)
# Check that the proportions match
# shi_et_al_reported %>% mutate(sev2 = (all_cases_admission - mild_cases_admission)/all_cases_admission) %>%
#   select(score, shi_severe_admission,sev2)
# 
# shi_et_al_reported %>% mutate(sev2 = (all_cases_hospitalization - mild_cases_hospitalization)/all_cases_hospitalization) %>%
#   select(score, shi_severe_all,sev2)
shie_et_al_auc_data_admission <- auc_data_from_counts(score_values = shi_et_al_reported$score,
                                                      positive_counts = shi_et_al_reported$mild_cases_admission,
                                                      negative_counts = shi_et_al_reported$all_cases_admission - shi_et_al_reported$mild_cases_admission) %>% 
  transmute(shi = score, severe_admission = outcome)

shi_et_al_auc_reported_admission <- my_auc_summary(severe_admission ~ shi, shie_et_al_auc_data_admission, subgroup = " (admission)")
#shi_et_al_auc_reported_admission$summary

shie_et_al_auc_data_all <- auc_data_from_counts(score_values = shi_et_al_reported$score,
                                                      positive_counts = shi_et_al_reported$mild_cases_hospitalization,
                                                      negative_counts = shi_et_al_reported$all_cases_hospitalization - shi_et_al_reported$mild_cases_hospitalization) %>% 
  transmute(shi = score, severe_hospitalized = outcome)

shi_et_al_auc_reported_all <- my_auc_summary(severe_hospitalized ~ shi, shie_et_al_auc_data_all)

Here is the comparison to original data.

compare_auc_to_orig(rbind(shi_et_al_auc_reported_admission, shi_et_al_auc_reported_all), shi_et_al_auc, nrow = 1)

We see that the score consistently underperforms, although some of the CIs are quite wide.

We can also compare the ratios of "severe" cases reported by Shi et al. to ratios in our dataset.

shi_et_al_compare <- shi_et_al_data %>%
  group_by(shi) %>%
  summarise(
    across(
      one_of(c("not_ventilated_admission", "not_ventilated_all", "not_oxygen_admission", "not_oxygen_all")), 
      ~ mean(.x, na.rm = TRUE), .names = "our_{col}"), 
    n_our_admission = sum(!is.na(not_ventilated_admission)), 
    n_our_all = sum(!is.na(not_ventilated_all)),
    .groups = "drop") %>%
  inner_join(shi_et_al_reported, by = c("shi" = "score"))

shi_et_al_compare %>%
  transmute(score = shi, `Shi et al. (admission)` = shi_severe_admission, `Ventilated (admission)` = 1 - our_not_ventilated_admission, `Oxygen (admission)` = 1 - our_not_oxygen_admission, N = n_our_admission) %>% knitr::kable(booktabs = TRUE)

shi_et_al_compare %>%
  transmute(score = shi, `Shi et al. (hospitalization)` = shi_severe_all, `Ventilated (hospitalization)` = 1 - our_not_ventilated_all, `Oxygen (hospitalization)` = 1 - our_not_oxygen_all, N = n_our_all) %>% knitr::kable(booktabs = TRUE)

Caramelo et al.

Caramelo et al. [@http://zotero.org/users/5567156/items/3A2RN4ZS] uses simulations to get patient-level characterstics from aggregate data and then fit a logistic regression model. We use the coefficients they report for the model (the model can't be used directly, because the intercept is not reported). The covariates in the model are age (by decades), sex, presence of hypertension, presence of diabetes, presence of cardiac disease, presence of chronic respiratory disease and presence of cancer. It is unclear whether NYHA > 1 would be considered "cardiac disease" and similarly whether asthma is chronic respiratory disease, so we compute two versions of the score. Our dataset also does not contain cancer, so we ignore it (cancer has one of the lowest reported OR in the model).

Caramelo et al. have not computed AUC in their paper and do not give us any mean to reconstruct it, so we only present our results for the risk score.

caramelo_data <- at_admission %>%
  mutate(
    caramelo_base = 0 + 
           case_when(age < 20 ~ 0,
                     age < 30 ~ 0.2017,
                     age < 40 ~ 0.3271,
                     age < 50 ~ 5.6030,
                     age < 60 ~ 6.7626,
                     age < 70 ~ 18.8161,
                     age < 80 ~ 43.7291,
                     TRUE ~ 86.8680) +
           if_else(sex == "M", 1.8518, 0) +
           if_else(has_hypertension_drugs, 7.4191, 0) +
           if_else(diabetes, 9.0329, 0),
    caramelo1 = caramelo_base +            
           if_else(heart_failure | ischemic_heart_disease | NYHA > 1, 12.8328, 0) +
           if_else(COPD | asthma, 7.7925, 0),
    caramelo2 = caramelo_base +            
           if_else(heart_failure | ischemic_heart_disease, 12.8328, 0) +
           if_else(COPD, 7.7925, 0)
         )

caramelo_auc <- rbind(
  my_auc_summary(alive_12 ~ caramelo1, caramelo_data),
  my_auc_summary(alive_30 ~ caramelo1, caramelo_data),
  my_auc_summary(alive_12 ~ caramelo2, caramelo_data, note = "v2"),
  my_auc_summary(alive_30 ~ caramelo2, caramelo_data, note = "v2"),
  my_auc_summary(alive_12 ~ caramelo_base, caramelo_data, note = "base"),
  my_auc_summary(alive_30 ~ caramelo_base, caramelo_data, note = "base")
)

compare_auc_to_orig(tibble(), caramelo_auc)

The results look somewhat encouraging for the score - the AUC values float around 0.8.

Bello-Chavolla et al.

The index given in [@http://zotero.org/users/5567156/items/INZVABQI] is built from following elements:

bello_chavolla_table <- rbind(
  data.frame(factor = "Age >= 65 years", score = 3),
  data.frame(factor = "Diabetes", score = 1),
  data.frame(factor = "Diabetes*Age < 40 years", score = 5),
  data.frame(factor = "Age < 40 years", score = -6),
  data.frame(factor = "Obesity", score = 1),
  data.frame(factor = "Pneumonia", score = 7),
  data.frame(factor = "Chronic Kidney Disease", score = 3),
  data.frame(factor = "COPD", score = 1),
  data.frame(factor = "Immunosuppression", score = 1)
)

bello_chavolla_table

Of those, our data does not contain Immunosuppresion and Pneumonia. We can however use taking antibiotics within 1 day of hospitalization as an imperfect proxy for pneumonia. We will ignore immunosuppresion in our test. Both of those choices could reduce the performance of the score.

The authors do not define obesity, so we will use BMI > 30 as threshold for obesity. Our data contains an indicator for "Renal disease" which might be classified slightly differently than "Chronic Kidney Disease".

The authors report C-statistic = 0.830 for mortality on validation dataset - it is unclear how they treat censoring, but they mention 30-day mortality in other part of the paper, so we will focus on 30-day mortality.

atb_first_day <- data$marker_data_wide %>% select(patient_id, day, took_antibiotics) %>% 
  group_by(patient_id) %>%
  summarise(atb0 = any(took_antibiotics[day <= 0]), 
            atb1 = any(took_antibiotics[day <= 1]), .groups = "drop")

bello_chavolla_data <- at_admission %>%
  inner_join(atb_first_day, by = "patient_id") %>%
  mutate(bc_base =  3 *  (age >= 65) +
           diabetes +
           5 * (diabetes & age < 40) +
           (-6) * (age < 40) +
           is_obese +
           3 * renal_disease +
           COPD,
         bc0 = bc_base + 7 * (atb0 & admitted_for_covid),
         bc1 = bc_base + 7 * (atb1 & admitted_for_covid)
  )
# bello_chavolla_data %>% ggplot(aes(x = bc0, y = as.integer(alive_30))) +
#   geom_smooth() + geom_jitter(width = 0.3, height = 0.3) 
# bello_chavolla_data %>% ggplot(aes(x = bc_base, y = as.integer(alive_30))) +
#   geom_smooth() + geom_jitter(width = 0.3, height = 0.3) 
bc_auc <- rbind(
  my_auc_summary(alive_30 ~ bc_base, bello_chavolla_data, note = "base"),
  my_auc_summary(alive_12 ~ bc_base, bello_chavolla_data, note = "base"),
  my_auc_summary(alive_30 ~ bc0, bello_chavolla_data, note = "day0"),
  my_auc_summary(alive_12 ~ bc0, bello_chavolla_data, note = "day0"),
  my_auc_summary(alive_30 ~ bc1, bello_chavolla_data, note = "day1"),
  my_auc_summary(alive_12 ~ bc1, bello_chavolla_data, note = "day1")
)

bc_auc_orig <- my_auc_summary_from_estimate(0.830, "alive_30", "bc0")

compare_auc_to_orig(bc_auc_orig, bc_auc)

The main surprise here is that not inferring pneumonia (shown as "base" in the figure) works best. This indicates that our data may not be well suited to the use of the score.

Age only

As a simple baseline we use age in years and the decade of age as the sole predictor and compute AUC for this.

age_only_data <- at_admission %>%
  mutate(age_decade = floor(age / 10))
age_auc <-rbind(
  my_auc_summary(alive_30 ~ age, age_only_data),
  my_auc_summary(alive_12 ~ age, age_only_data),
  my_auc_summary(alive_30 ~ age_decade, age_only_data, note = "decade"),
  my_auc_summary(alive_12 ~ age_decade, age_only_data, note = "decade")
)
age_auc %>% knitr::kable(booktabs = TRUE)

# plot_my_auc(alive_12 ~ age, at_admission)
plot_my_auc(alive_30 ~ age, at_admission)

Summary

all_auc <- rbind(acp_auc, chen_liu_auc, shi_et_al_auc, caramelo_auc, bc_auc, age_auc)
all_auc_orig <- rbind(acp_auc_orig, chen_liu_auc_orig, shi_et_al_auc_reported_admission,
                         shi_et_al_auc_reported_all, bc_auc_orig)

write_csv(all_auc, here::here("manuscript", "all_auc.csv"), na = "")
write_csv(all_auc_orig, here::here("manuscript", "all_auc_orig.csv"), na = "")

compare_auc_to_orig(all_auc_orig,
                    all_auc, show_outcome = FALSE)

We see that the "age only" baseline performs similarly to the best performing score (Caramelo et al.) and likely better than all of the other scores - there is however some uncertainty prohibiting us to be very sure about the ordering. This holds even if we only take the decade of age. Note that the score by (Caramelo et al.) is actually heavily based on the decade of age. Table of the results is shown below as the small differences are not well visible in the plot.

all_auc %>% mutate(score = score_caption[score]) %>%
  group_by(score, subgroup, note) %>% 
  summarise(best_auc = max(auc), .groups = "drop") %>%
  arrange(desc(best_auc)) %>%
  knitr::kable(booktabs = TRUE)


cas-bioinf/covid19retrospective documentation built on Sept. 7, 2021, 6:19 p.m.