vignettes/vignette_1_model.md

Vignette: Predictive Model Derivation and Validation

Overview

Predictive modelling is a fantastically useful statistical technique to estimate how likely a patient is to have an event, based on the characteristics of patients similar to them. There are three general scenarios where you’d want to do predictive modelling:

  1. Development of a novel prediction model.

  2. Validation of a novel prediction model.

  3. (External) Validation of a previous prediction model

Load data

Firstly let’s load our example data - we will be using the survival::colon dataset as an example.

data <- tibble::as_tibble(survival::colon) %>%
  dplyr::filter(etype==2) %>% # Outcome of interest is death
  dplyr::filter(rx!="Obs") %>%  # rx will be our binary treatment variable
  dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables

  # Convert into numeric and factor variables
  dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
  dplyr::mutate(rx = factor(rx),
                mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
                mort365 = factor(mort365, levels = c("No", "Yes")),
                sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
                differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
                extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
                surg = factor(surg, levels = c(0,1), labels = c("Short", "Long")))

head(data, 10) %>% knitr::kable()
id rx sex age obstruct perfor adhere nodes differ extent surg node4 time mort365 1 Lev+5FU Male 43 No No No 5 Moderate Serosa Short Yes 1521 No 2 Lev+5FU Male 63 No No No 1 Moderate Serosa Short No 3087 No 4 Lev+5FU Female 66 Yes No No 6 Moderate Serosa Long Yes 293 Yes 6 Lev+5FU Female 57 No No No 9 Moderate Serosa Short Yes 1767 No 7 Lev Male 77 No No No 5 Moderate Serosa Long Yes 420 No 9 Lev Male 46 No No Yes 2 Moderate Serosa Short No 3173 No 10 Lev+5FU Female 68 No No No 1 Moderate Serosa Long No 3308 No 11 Lev Female 47 No No Yes 1 Moderate Serosa Short No 2908 No 12 Lev+5FU Male 52 No No No 2 Poor Serosa Long No 3309 No 14 Lev Male 68 Yes No No 3 Moderate Serosa Short No 2910 No

Now let’s split this very simply into development and validation datasets - this should be done far more robustly in practice!

data_dev = data %>% head(0.5 * nrow(data))
data_val = data %>% tail(0.5 * nrow(data))

Predictive Modelling

1. Development of a novel prediction model

Now let’s say we want to create a new logistic regression model (fit) to predict our event (death at 1 year aka “mort365”) based on patient and operative factors.

fit <- finalfit::glmmulti(data_dev, dependent = "mort365", explanatory = c("rx", "sex","obstruct", "differ"))

summary(fit)

## 
## Call:
## glm(formula = ff_formula(dependent, explanatory), family = family, 
##     data = .data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9905  -0.4084  -0.3551  -0.2883   2.7557  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -3.3466     1.0588  -3.161  0.00157 **
## rxLev+5FU       -0.2902     0.4397  -0.660  0.50930   
## sexMale         -0.4276     0.4396  -0.973  0.33079   
## obstructYes      0.6822     0.4922   1.386  0.16573   
## differModerate   0.9046     1.0529   0.859  0.39026   
## differPoor       2.4976     1.0862   2.299  0.02148 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 176.12  on 295  degrees of freedom
## Residual deviance: 161.08  on 290  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 173.08
## 
## Number of Fisher Scoring iterations: 6

Now we want to get the patient-level predicted risk of the outcome based on the model - this is needed to evaluate the performance of the model you have derived (see the next vignettes). This isn’t necessarily difficult to do in R, but you need to know how to do this.

Using the traditional approach using tidyverse code, you might need to do something like this:

# Traditional approach
data %>%
      dplyr::select(all_of(c("mort365", c("rx", "sex","obstruct", "differ")))) %>%
      tidyr::drop_na() %>%
      dplyr::mutate(predict_raw = predict(fit, newdata  = ., ),
                    predict_prop = predict(fit, type = "response", newdata  = ., )) %>% 
  head(10) %>%
  knitr::kable() %>% kableExtra::scroll_box(width = 400)
mort365 rx sex obstruct differ predict\_raw predict\_prop No Lev+5FU Male No Moderate -3.159698 0.0407109 No Lev+5FU Male No Moderate -3.159698 0.0407109 Yes Lev+5FU Female Yes Moderate -2.049911 0.1140614 No Lev+5FU Female No Moderate -2.732142 0.0611032 No Lev Male No Moderate -2.869505 0.0536818 No Lev Male No Moderate -2.869505 0.0536818 No Lev+5FU Female No Moderate -2.732142 0.0611032 No Lev Female No Moderate -2.441949 0.0800293 No Lev+5FU Male No Poor -1.566667 0.1726920 No Lev Male Yes Moderate -2.187275 0.1008991

The exact same can easily be achieved using the predictr() function. You simply provide the data (data_dev), and the model (fit), and this will be done automatically.

# PredictR approach
predictr(data = data_dev, fit = fit) %>% 
  head(10) %>%
  knitr::kable() %>% kableExtra::scroll_box(width = 400)
rowid mort365 rx sex obstruct differ event predict\_raw predict\_prop 1 No Lev+5FU Male No Moderate No -3.159698 0.0407109 2 No Lev+5FU Male No Moderate No -3.159698 0.0407109 3 Yes Lev+5FU Female Yes Moderate Yes -2.049911 0.1140614 4 No Lev+5FU Female No Moderate No -2.732142 0.0611032 5 No Lev Male No Moderate No -2.869505 0.0536818 6 No Lev Male No Moderate No -2.869505 0.0536818 7 No Lev+5FU Female No Moderate No -2.732142 0.0611032 8 No Lev Female No Moderate No -2.441949 0.0800293 9 No Lev+5FU Male No Poor No -1.566667 0.1726920 10 No Lev Male Yes Moderate No -2.187275 0.1008991

2. Validation of a novel prediction model

While people often stop at deriving a new model, you should be testing whether the model derived (fit) is valid on new data.

With predictr() this can again be done simply by supplying the new data (data_val) to the function (keeping fit unchanged).

predictr(data = data_val, fit = fit) %>% 
  head(10) %>%
  knitr::kable() %>% kableExtra::scroll_box(width = 400)
rowid mort365 rx sex obstruct differ event predict\_raw predict\_prop 1 Yes Lev Male No Well Yes -3.7741169 0.0224421 2 No Lev+5FU Female Yes Moderate No -2.0499110 0.1140614 3 No Lev Female No Poor No -0.8489192 0.2996596 4 No Lev Male No Moderate No -2.8695053 0.0536818 5 No Lev Female Yes Moderate No -1.7597188 0.1468256 6 No Lev Female No Moderate No -2.4419494 0.0800293 7 Yes Lev+5FU Female Yes Poor Yes -0.4568807 0.3877261 8 No Lev Male No Moderate No -2.8695053 0.0536818 9 Yes Lev+5FU Male No Moderate Yes -3.1596975 0.0407109 10 No Lev Female No Moderate No -2.4419494 0.0800293

Once again, this output can be used for model evaluation using the functions described in subsequent vignettes.

3. (External) Validation of a previous prediction model

Now let’s say someone else has developed a model that you now want to validate (or vice versa). It’s unlikely the R fit object will be shared in this instance (they may not have used R, and even if they did the fit object contains a lot of potentially sensitive patient data).

You will often need to use the model coefficents and intercept provided in a paper to be able to reproduce.

Deriving coefficents from fit objects

The normal fit object has coefficients stored in it, but not necessarily in a massively useful / informative format:

fit$coefficients

##    (Intercept)      rxLev+5FU        sexMale    obstructYes 
##     -3.3465610     -0.2901922     -0.4275559      0.6822306 
## differModerate     differPoor 
##      0.9046116      2.4976418

Instead you can use the predictr::coefficient() function to get this information out in a useful and shareable format.

coefficient(fit) %>% knitr::kable()
label levels type value coefficient outcome rx Lev factor 0.0000000 beta mort365 rx Lev+5FU factor -0.2901922 beta mort365 sex Female factor 0.0000000 beta mort365 sex Male factor -0.4275559 beta mort365 obstruct No factor 0.0000000 beta mort365 obstruct Yes factor 0.6822306 beta mort365 differ Well factor 0.0000000 beta mort365 differ Moderate factor 0.9046116 beta mort365 differ Poor factor 2.4976418 beta mort365 intercept NA NA -3.3465610 beta mort365

This provides all the information required for subsequent external validation using the predictr() function, and can be used as an alternative to the fit parameter.

Prediction using model coefficients

For coefficents to be used within predictr(), this must be in the format provided by the coefficient() function (whether using coefficient(fit) or manually extracting from a publication to create the required table).

predictr(data = data_val,
         coefficient = coefficient(fit)) %>%
  head(10) %>% knitr::kable()
mort365 rx sex obstruct differ event predict\_raw predict\_prop Yes Lev Male No Well Yes -3.7741169 0.0224421 No Lev+5FU Female Yes Moderate No -2.0499110 0.1140614 No Lev Female No Poor No -0.8489192 0.2996596 No Lev Male No Moderate No -2.8695053 0.0536818 No Lev Female Yes Moderate No -1.7597188 0.1468256 No Lev Female No Moderate No -2.4419494 0.0800293 Yes Lev+5FU Female Yes Poor Yes -0.4568807 0.3877261 No Lev Male No Moderate No -2.8695053 0.0536818 Yes Lev+5FU Male No Moderate Yes -3.1596975 0.0407109 No Lev Female No Moderate No -2.4419494 0.0800293

The default approach in predictr() is to use beta-coefficents, however if the paper you have only supplies odds ratios (OR), you can specify this and predictr() will handle this internally to produce the appropriate predictions.

predictr(data = data_val,
         coefficient = coefficient(fit, coefficient = "or")) %>%
  head(10) %>% knitr::kable()
mort365 rx sex obstruct differ event predict\_raw predict\_prop Yes Lev Male No Well Yes -3.7741169 0.0224421 No Lev+5FU Female Yes Moderate No -2.0499110 0.1140614 No Lev Female No Poor No -0.8489192 0.2996596 No Lev Male No Moderate No -2.8695053 0.0536818 No Lev Female Yes Moderate No -1.7597188 0.1468256 No Lev Female No Moderate No -2.4419494 0.0800293 Yes Lev+5FU Female Yes Poor Yes -0.4568807 0.3877261 No Lev Male No Moderate No -2.8695053 0.0536818 Yes Lev+5FU Male No Moderate Yes -3.1596975 0.0407109 No Lev Female No Moderate No -2.4419494 0.0800293

kamclean/predictr documentation built on Aug. 14, 2022, 4:35 a.m.