knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Challenges, such as those promoted by the DREAM community, are a popular mechanism to compare the performance of computational algorithms in predicting the values of a hidden test dataset.

The goal of this package is to provide the statistical infrastructure for conducting these challenges. Specifically, this package helps challenge organizers calculate scores for participant predictions, using the bootLadderBoot approach to prevent model overfitting.

bootLadderBoot algorithm

Two major pitfalls in running predictive modeling challenges are the possibility of overfitting of models to a test set or test set data leakage. This is particularly applicable to challenges with small test datasets. A consequence of these pitfalls is the creation of models that are not generalizable, i.e. they cannot be meaningfully applied to the same prediction problems on different test datasets, reducing their real-world applicability.

One approach to prevent this, which can be implemented using this package, is the bootLadderBoot algorithm. A detailed description can be found here, but in brief, this method uses two approaches to prevent overfitting. First, this method reports an average of a small set of bootstrapped scores, rather than actual scores (the boot in bootLadderBoot). Second, this method only reports new scores if the participants have substantially improved over a previous prediction, based on a Bayes factor calculated from a large set of bootstrapped scores (the ladderBoot in bootLadderBoot).

In this package, the function bootLadderBoot() is used to provide this information.

Using bootLadderBoot()

Here is a basic example of using the bootLadderBoot function. We've provided some test data for use in these examples. Please note, for ease of rerunning these examples, we've set the bootstrapN parameter to 10. We would recommend a larger number of bootstraps for real-world applications. We typically perform 10000 (the default).

In this example, we'll pretend that we are running a challenge to develop algorithms that predict a patient's resting heartrate as measured immediately 10 minutes of walking. We have "truth" or test data from 1000 patients (truth), where value is the heart rate.

library(challengescoring)
library(tidyverse)
head(truth)

The participants have some other training data upon which they are generating predictions, such as age, gender, height, weight, etc. For simplicity, and because this package is just about scoring predictions, we aren't concerned about those data here, so let's just assume the participants have already trained their models and are ready to submit predictions against a test dataset.

For this example, the best performing predictions will be determined using the Spearman correlation of predicted values to the test dataset.

We've opened our challenge to our one participant who has submitted their first prediction, badSim, which isn't that great of a prediction.

head(badSim)

Let's score that prediction using the bootLadderBoot() function. badSim must have the same ID columns (patient) as truth. They can have different prediction column names - we'll define those in the function. Since this is their first submission, we don't have to worry about the previous prediction arguments. (to make this f

bootLadderBoot(predictions = badSim,
               predictionColname = "prediction",
               goldStandard = truth,
               goldStandardColname = "value",
               scoreFun = spearman,
               bootstrapN = 10,
               reportBootstrapN = 10) #how many bootstraps to base the returned score off of

OK - so they have a Spearman correlation of 0.2. They modify their predictions and submit again.

anotherBadSim <- badSim
anotherBadSim$prediction <- badSim$prediction+1

And we score it. This time, however, they have a previous "best" score, so that will become our reference for determining whether to return a new score or not (see the bootLadderBoot algorithm section for a brief explanation of why we do this).

We set prevPredictions as badSim, their first submission, and predictions as anotherBadSim, their second submission. We want our threshold for "improvement" to be a Bayes factor of 3 or more between the two predictions, so we set bayesThreshold = 3.

bootLadderBoot(predictions = anotherBadSim,
               predictionColname = "prediction",
               goldStandard = truth,
               goldStandardColname = "value",
               prevPredictions = badSim,
               scoreFun = spearman,
               bayesThreshold = 3,
               bootstrapN = 10,
               reportBootstrapN = 10)

This prediction was about as poor as the first, and within a bayesThreshold of 3, so the function returns FALSE for metBayesCutoff, and they will recieve a bootstrapped score based on their previous submission.

On the third try, however, they submit a much better prediction (goodSim)!

bootLadderBoot(predictions = goodSim,
               predictionColname = "prediction",
               goldStandard = truth,
               goldStandardColname = "value",
               prevPredictions = badSim,
               scoreFun = spearman,
               bayesThreshold = 3,
               bootstrapN = 10,
               reportBootstrapN = 10)

This submission met the Bayes threshold, so a new score was reported. This prediction file now becomes the reference file for "best" prediction. If you'd like to see what the Bayes factor actually was, you can run the function with the option verbose=T. This will provide progress information as well.

Finally, let's see what happens if they submit a worse prediction after a previous one.

bootLadderBoot(predictions = badSim,
               predictionColname = "prediction",
               goldStandard = truth,
               goldStandardColname = "value",
               prevPredictions = goodSim,
               scoreFun = spearman,
               bayesThreshold = 3,
               bootstrapN = 10,
               reportBootstrapN = 10)

They no longer meet the cutoff, and a bootstrapped score of the previous best (prevPredictions) is reported instead.

Using other scoring functions

challengescoring ships with several scoring functions (for use in the scoreFun parameter of bootLadderBoot). However, you can also use any custom function of the form mymetric <- function(gold, pred) where gold is a numeric vector of gold standard values, and pred is a numeric vector of prediction values for one submission.

Let's define one now for normalized root mean squared error (NRMSE), where the error is normalized by the range of the gold standard data (max-min).

nrmse_max_min <- function(gold, pred){
    rmse <- sqrt(mean((gold - pred) ** 2))
    rmse_std <- rmse/(max(gold)-min(gold))
}

Once nrmse_max_min is define, we can simply plug it into bootLadderBoot (note, unlike the previous example, lower NRMSE is better, so largerIsBetter = F).

library(challengescoring)

bootLadderBoot(predictions = goodSim,
               predictionColname = "prediction",
               goldStandard = truth,
               goldStandardColname = "value",
               prevPredictions = badSim,
               scoreFun = nrmse_max_min,
               largerIsBetter = F,
               bayesThreshold = 3,
               bootstrapN = 10,
               reportBootstrapN = 10)

Analysis of survival data

To analyze survival predictions, you may need to preprocess your data before running it through bootLadderBoot because this package requires each prediction to be one numeric vector that can be sampled for bootstrapping purposes.

For example, you might have survival data like this:

library(survival)
library(Hmisc)

set.seed(1)
##prediction data
age_previous <- rnorm(400, 50, 10) 
age_new <- rnorm(400, 50, 10) 

##truth data
d.time <- rexp(400)
cens   <- runif(400,.5,2)
death  <- d.time <= cens
d.time <- pmin(d.time, cens)

Where both the numeric vector d.time AND the boolean vector death are components of the truth data. The prediction data age_previous and age_new are simply a numeric vectors of ages.

library(survival)
goldStandard <- Surv(d.time, death) %>% as.character() %>% as.data.frame() %>% magrittr::set_colnames(c("gold"))
predictions <- age_new %>% as.data.frame() %>% magrittr::set_colnames(c("pred"))
prevPredictions <- age_previous %>% as.data.frame() %>% magrittr::set_colnames(c("pred"))

# TODO: fix this error: Error: `by` required, because the data sources have no common variables
# bootLadderBoot(predictions = predictions,
#                predictionColname = "pred", 
#                prevPredictions = prevPredictions,
#                goldStandard = goldStandard,
#                goldStandardColname = "gold",
#                scoreFun = c_statistic,
#                verbose = T)

And then use the bootLadderBoot function with the c_statistic parameter to determine the concordance index (Harrell's C-statistic) of the prediction to this value.



Sage-Bionetworks/challengescoring documentation built on Sept. 28, 2020, 8:27 a.m.