knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette will demonstrate:
First, setup the R environment and make some example dCVnet objects:
library(ggplot2) library(dplyr) library(dCVnet) data(prostate, package = "dCVnet") prostate$agecat <- factor(prostate$age > 65, levels = c(FALSE, TRUE), labels = c("le65", "over65")) set.seed(42) # Model 1 predicts SVI (seminal vesicle invasion): mod1 <- dCVnet::dCVnet(y = prostate$svi, data = subset(prostate, select = c(-train, -svi)), nrep_outer = 3, k_outer = 3, nrep_inner = 1, k_inner = 10) # Model 2 predicts whether Age > 65: mod2 <- dCVnet::dCVnet(y = prostate$agecat, data = subset(prostate, select = c(-train, -age, -agecat)), nrep_outer = 3, k_outer = 3, nrep_inner = 1, k_inner = 10)
By default, calling plot(my_dCVnet, type = "ROC")
on a dCVnet object
produces a plot of the cross-validated ROCs for the different outer-loop
repetitions (to show the variability).
Under the hood this function is composing three steps:
dCVnet::performance
)dCVnet::extract_rocdata
plot.rocdata
).p1 <- plot(mod1, type = "roc") # Note: # this is the same as: p1 <- plot(extract_rocdata(dCVnet::performance(mod1)))
To plot overall (average) cross-validated ROC taken over the repetitions of the
repeated k-fold outer-loop cross-validation, use dCVnet::average_rocdata()
on
the rocdata object.
p2 <- plot(average_rocdata(extract_rocdata(performance(mod1)))) # The above nested function is more readable if you use pipes (%>%): #> p2 <- mod1 %>% #> performance %>% #> extract_rocdata %>% #> average_rocdata %>% #> plot()
Perhaps we want to see both the average performance, and its variability over outerloop-cv repetitions. We can combine two rocdata objects with rbind (they are just data.frame format with a particular set of columns):
combined_roc_data <- rbind(extract_rocdata(performance(mod1)), average_rocdata(extract_rocdata(performance(mod1)))) p3 <- plot(combined_roc_data) # Or, as above, but (mostly) using pipes: #> p3 <- rbind(mod1 %>% #> performance() %>% #> extract_rocdata(), #> mod1 %>% #> performance() %>% #> extract_rocdata() %>% #> average_rocdata())
The 'production model' of dCVnet is the model we cross-validated, it would be used to make a prediction for new data. However its performance in the training data not crossvalidated, and so is typically an overestimate.
Sometimes we want to see this training set performance all the same - it forms an upper bound on cross-validated performance. However, care must always be taken not to interpret this as a cross-validated estimate.
To plot the uncrossvalidated performance of a production model:
plot(extract_rocdata(mod1$prod$performance))
Sometimes we want to display the results from two or more models
in the same ROC plot. Do this by rbind
-ing the separate rocdata objects.
However, we first need to give informative names in the run column:
d1 <- mod1 %>% performance() %>% extract_rocdata() %>% average_rocdata() d1$run <- "Model 1" # labels the data from this model. d2 <- mod2 %>% performance() %>% extract_rocdata() %>% average_rocdata() d2$run <- "Model 2" # labels the data from this model d3 <- mod1$prod$performance %>% extract_rocdata() %>% average_rocdata() d3$run <- "Model 1 (train)" # labels the data from this model. d4 <- mod2$prod$performance %>% extract_rocdata() %>% average_rocdata() d4$run <- "Model 2 (train)" # labels the data from this model. p4 <- plot(rbind(d1, d2, d3, d4))
A good deal of customisation can be done with ggplot (see below), but there are
a few options provided for plot customisation in plot.rocdata
. The legend can
be toggled off with legend = FALSE
, different threshold label points can be
chosen (alphalabel = c(0.05, 0.5, 0.95)
) or these can be omitted
(alphalabel = FALSE
). Finally alternative labels can be provided for
legend headings with the guide_labels
argument.
# Make a version of p4 without a legend or markers for threshold: p5 <- plot(rbind(d1, d2), alphalabel = FALSE, legend = FALSE) # And one with custom threshold markers and legend headings: p5 <- plot(rbind(d1, d2), alphalabel = c(0.5), guide_labels = list(group = "plotdata", threshold = "50% Threshold", refline = "line"))
dCVnet ROC plots are ggplot2 objects. As a result many manipulations can be done
without recalculating the plot. In the following example I will modify p4$plot
(the multiple-models plot) in order to...
Note: ggplot2 legends are determined by aesthetics which are mapped to the data, in this this plot Model is mapped to colour, Threshold markers to shape and the reference line to lty (linetype).
p4$plot + # add a title, change axis labels: labs(title = "Comparison of ROC Performance") + xlab("1 - Specificity") + ylab("Sensitivity") + # set manual colours: scale_colour_manual(values = c(`Model 2` = "hotpink4", `Model 2 (train)` = "hotpink", `Model 1` = "blue4", `Model 1 (train)` = "blue")) + # suppress part of the legend: guides(lty = guide_none()) + # use a different theme: theme_classic()
If you need more control over the display, or would prefer to plot using
base R, or different software entirely, the underlying data are
returned by the call to plot.rocdata
.
head(p4$data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.