Predictive modelling is a fantastically useful statistical technique to estimate how likely a patient is to have an event, based on the characteristics of patients similar to them. There are three general scenarios where you’d want to do predictive modelling:
Development of a novel prediction model.
Validation of a novel prediction model.
(External) Validation of a previous prediction model
Firstly let’s load our example data - we will be using the survival::colon dataset as an example.
data <- tibble::as_tibble(survival::colon) %>%
dplyr::filter(etype==2) %>% # Outcome of interest is death
dplyr::filter(rx!="Obs") %>% # rx will be our binary treatment variable
dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables
# Convert into numeric and factor variables
dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
dplyr::mutate(rx = factor(rx),
mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
mort365 = factor(mort365, levels = c("No", "Yes")),
sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
surg = factor(surg, levels = c(0,1), labels = c("Short", "Long")))
head(data, 10) %>% knitr::kable()
id
rx
sex
age
obstruct
perfor
adhere
nodes
differ
extent
surg
node4
time
mort365
1
Lev+5FU
Male
43
No
No
No
5
Moderate
Serosa
Short
Yes
1521
No
2
Lev+5FU
Male
63
No
No
No
1
Moderate
Serosa
Short
No
3087
No
4
Lev+5FU
Female
66
Yes
No
No
6
Moderate
Serosa
Long
Yes
293
Yes
6
Lev+5FU
Female
57
No
No
No
9
Moderate
Serosa
Short
Yes
1767
No
7
Lev
Male
77
No
No
No
5
Moderate
Serosa
Long
Yes
420
No
9
Lev
Male
46
No
No
Yes
2
Moderate
Serosa
Short
No
3173
No
10
Lev+5FU
Female
68
No
No
No
1
Moderate
Serosa
Long
No
3308
No
11
Lev
Female
47
No
No
Yes
1
Moderate
Serosa
Short
No
2908
No
12
Lev+5FU
Male
52
No
No
No
2
Poor
Serosa
Long
No
3309
No
14
Lev
Male
68
Yes
No
No
3
Moderate
Serosa
Short
No
2910
No
Now let’s split this very simply into development and validation datasets - this should be done far more robustly in practice!
data_dev = data %>% head(0.5 * nrow(data))
data_val = data %>% tail(0.5 * nrow(data))
Now let’s say we want to create a new logistic regression model (fit
)
to predict our event (death at 1 year aka “mort365”) based on patient
and operative factors.
fit <- finalfit::glmmulti(data_dev, dependent = "mort365", explanatory = c("rx", "sex","obstruct", "differ"))
summary(fit)
##
## Call:
## glm(formula = ff_formula(dependent, explanatory), family = family,
## data = .data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9905 -0.4084 -0.3551 -0.2883 2.7557
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3466 1.0588 -3.161 0.00157 **
## rxLev+5FU -0.2902 0.4397 -0.660 0.50930
## sexMale -0.4276 0.4396 -0.973 0.33079
## obstructYes 0.6822 0.4922 1.386 0.16573
## differModerate 0.9046 1.0529 0.859 0.39026
## differPoor 2.4976 1.0862 2.299 0.02148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 176.12 on 295 degrees of freedom
## Residual deviance: 161.08 on 290 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 173.08
##
## Number of Fisher Scoring iterations: 6
Now we want to get the patient-level predicted risk of the outcome based on the model - this is needed to evaluate the performance of the model you have derived (see the next vignettes). This isn’t necessarily difficult to do in R, but you need to know how to do this.
Using the traditional approach using tidyverse code, you might need to do something like this:
# Traditional approach
data %>%
dplyr::select(all_of(c("mort365", c("rx", "sex","obstruct", "differ")))) %>%
tidyr::drop_na() %>%
dplyr::mutate(predict_raw = predict(fit, newdata = ., ),
predict_prop = predict(fit, type = "response", newdata = ., )) %>%
head(10) %>%
knitr::kable() %>% kableExtra::scroll_box(width = 400)
mort365
rx
sex
obstruct
differ
predict\_raw
predict\_prop
No
Lev+5FU
Male
No
Moderate
-3.159698
0.0407109
No
Lev+5FU
Male
No
Moderate
-3.159698
0.0407109
Yes
Lev+5FU
Female
Yes
Moderate
-2.049911
0.1140614
No
Lev+5FU
Female
No
Moderate
-2.732142
0.0611032
No
Lev
Male
No
Moderate
-2.869505
0.0536818
No
Lev
Male
No
Moderate
-2.869505
0.0536818
No
Lev+5FU
Female
No
Moderate
-2.732142
0.0611032
No
Lev
Female
No
Moderate
-2.441949
0.0800293
No
Lev+5FU
Male
No
Poor
-1.566667
0.1726920
No
Lev
Male
Yes
Moderate
-2.187275
0.1008991
The exact same can easily be achieved using the predictr()
function.
You simply provide the data (data_dev
), and the model (fit
), and
this will be done automatically.
# PredictR approach
predictr(data = data_dev, fit = fit) %>%
head(10) %>%
knitr::kable() %>% kableExtra::scroll_box(width = 400)
rowid
mort365
rx
sex
obstruct
differ
event
predict\_raw
predict\_prop
1
No
Lev+5FU
Male
No
Moderate
No
-3.159698
0.0407109
2
No
Lev+5FU
Male
No
Moderate
No
-3.159698
0.0407109
3
Yes
Lev+5FU
Female
Yes
Moderate
Yes
-2.049911
0.1140614
4
No
Lev+5FU
Female
No
Moderate
No
-2.732142
0.0611032
5
No
Lev
Male
No
Moderate
No
-2.869505
0.0536818
6
No
Lev
Male
No
Moderate
No
-2.869505
0.0536818
7
No
Lev+5FU
Female
No
Moderate
No
-2.732142
0.0611032
8
No
Lev
Female
No
Moderate
No
-2.441949
0.0800293
9
No
Lev+5FU
Male
No
Poor
No
-1.566667
0.1726920
10
No
Lev
Male
Yes
Moderate
No
-2.187275
0.1008991
While people often stop at deriving a new model, you should be testing
whether the model derived (fit
) is valid on new data.
With predictr()
this can again be done simply by supplying the new
data (data_val
) to the function (keeping fit
unchanged).
predictr(data = data_val, fit = fit) %>%
head(10) %>%
knitr::kable() %>% kableExtra::scroll_box(width = 400)
rowid
mort365
rx
sex
obstruct
differ
event
predict\_raw
predict\_prop
1
Yes
Lev
Male
No
Well
Yes
-3.7741169
0.0224421
2
No
Lev+5FU
Female
Yes
Moderate
No
-2.0499110
0.1140614
3
No
Lev
Female
No
Poor
No
-0.8489192
0.2996596
4
No
Lev
Male
No
Moderate
No
-2.8695053
0.0536818
5
No
Lev
Female
Yes
Moderate
No
-1.7597188
0.1468256
6
No
Lev
Female
No
Moderate
No
-2.4419494
0.0800293
7
Yes
Lev+5FU
Female
Yes
Poor
Yes
-0.4568807
0.3877261
8
No
Lev
Male
No
Moderate
No
-2.8695053
0.0536818
9
Yes
Lev+5FU
Male
No
Moderate
Yes
-3.1596975
0.0407109
10
No
Lev
Female
No
Moderate
No
-2.4419494
0.0800293
Once again, this output can be used for model evaluation using the functions described in subsequent vignettes.
Now let’s say someone else has developed a model that you now want to validate (or vice versa). It’s unlikely the R fit object will be shared in this instance (they may not have used R, and even if they did the fit object contains a lot of potentially sensitive patient data).
You will often need to use the model coefficents and intercept provided in a paper to be able to reproduce.
The normal fit object has coefficients stored in it, but not necessarily in a massively useful / informative format:
fit$coefficients
## (Intercept) rxLev+5FU sexMale obstructYes
## -3.3465610 -0.2901922 -0.4275559 0.6822306
## differModerate differPoor
## 0.9046116 2.4976418
Instead you can use the predictr::coefficient()
function to get this
information out in a useful and shareable format.
coefficient(fit) %>% knitr::kable()
label
levels
type
value
coefficient
outcome
rx
Lev
factor
0.0000000
beta
mort365
rx
Lev+5FU
factor
-0.2901922
beta
mort365
sex
Female
factor
0.0000000
beta
mort365
sex
Male
factor
-0.4275559
beta
mort365
obstruct
No
factor
0.0000000
beta
mort365
obstruct
Yes
factor
0.6822306
beta
mort365
differ
Well
factor
0.0000000
beta
mort365
differ
Moderate
factor
0.9046116
beta
mort365
differ
Poor
factor
2.4976418
beta
mort365
intercept
NA
NA
-3.3465610
beta
mort365
This provides all the information required for subsequent external
validation using the predictr()
function, and can be used as an
alternative to the fit
parameter.
For coefficents to be used within predictr()
, this must be in the
format provided by the coefficient()
function (whether using
coefficient(fit)
or manually extracting from a publication to create
the required table).
predictr(data = data_val,
coefficient = coefficient(fit)) %>%
head(10) %>% knitr::kable()
mort365
rx
sex
obstruct
differ
event
predict\_raw
predict\_prop
Yes
Lev
Male
No
Well
Yes
-3.7741169
0.0224421
No
Lev+5FU
Female
Yes
Moderate
No
-2.0499110
0.1140614
No
Lev
Female
No
Poor
No
-0.8489192
0.2996596
No
Lev
Male
No
Moderate
No
-2.8695053
0.0536818
No
Lev
Female
Yes
Moderate
No
-1.7597188
0.1468256
No
Lev
Female
No
Moderate
No
-2.4419494
0.0800293
Yes
Lev+5FU
Female
Yes
Poor
Yes
-0.4568807
0.3877261
No
Lev
Male
No
Moderate
No
-2.8695053
0.0536818
Yes
Lev+5FU
Male
No
Moderate
Yes
-3.1596975
0.0407109
No
Lev
Female
No
Moderate
No
-2.4419494
0.0800293
The default approach in predictr()
is to use beta-coefficents, however
if the paper you have only supplies odds ratios (OR), you can specify
this and predictr()
will handle this internally to produce the
appropriate predictions.
predictr(data = data_val,
coefficient = coefficient(fit, coefficient = "or")) %>%
head(10) %>% knitr::kable()
mort365
rx
sex
obstruct
differ
event
predict\_raw
predict\_prop
Yes
Lev
Male
No
Well
Yes
-3.7741169
0.0224421
No
Lev+5FU
Female
Yes
Moderate
No
-2.0499110
0.1140614
No
Lev
Female
No
Poor
No
-0.8489192
0.2996596
No
Lev
Male
No
Moderate
No
-2.8695053
0.0536818
No
Lev
Female
Yes
Moderate
No
-1.7597188
0.1468256
No
Lev
Female
No
Moderate
No
-2.4419494
0.0800293
Yes
Lev+5FU
Female
Yes
Poor
Yes
-0.4568807
0.3877261
No
Lev
Male
No
Moderate
No
-2.8695053
0.0536818
Yes
Lev+5FU
Male
No
Moderate
Yes
-3.1596975
0.0407109
No
Lev
Female
No
Moderate
No
-2.4419494
0.0800293
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.