knitr::opts_chunk$set(echo = TRUE, fig.width = 7, collapse = TRUE, comment = "#>") library(scoringutils) library(magrittr) library(data.table) library(ggplot2) library(knitr)
The scoringutils
package provides a collection of metrics and proper scoring rules that make it simple to score probabilistic forecasts against the true observed values. The scoringutils
package offers convenient automated forecast evaluation in a data.table
format (using the function score()
), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matriced they can build upon in other applications. In addition it implements a wide range of flexible plots that are able to cover many use cases.
The goal of this package is to provide a tested and reliable collection of metrics that can be used for scoring probabilistic forecasts (forecasts with a full predictive distribution, rather than point forecasts). It has a much stronger focus on convenience than e.g. the scoringRules
package, which provides a comprehensive collection of proper scoring rules (also used in scoringutils
). In contrast to other packages, scoringutils
offers functionality to automatically evaluate forecasts, to visualise scores and to obtain relative scores between models.
Predictions can be handled in various formats: scoringutils
can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.
Most of the time, the score()
function will be able to do the entire evaluation for you. All you need to do is to pass in a data.frame
with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see example_quantile
for the expected format), based on predictive samples (see example_continuous
and example_integer
for the expected format in each case) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below):
requirements <- data.table( "Format" = c( "quantile-based", "sample-based", "binary", "pairwise-comparisons" ), `Required columns` = c( "'true_value', 'prediction', 'quantile'", "'true_value', 'prediction', 'sample'", "'true_value', 'prediction'", "additionally a column 'model'" ) ) kable(requirements)
Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. It is important, that there are only columns present which are relevant in order to group forecasts. A combination of different columns should uniquely define the unit of a single forecast, meaning that a single forecast is defined by the values in the other columns.
The function check_forecasts()
can be used to check the input data. It gives a summary of what scoringutils
thinks you are trying to achieve. It infers the type of the prediction target, the prediction format, and the unit of a single forecasts, gives an overview of the number of unique values per column (helpful for spotting missing data) and returns warnings or errors.
head(example_quantile)
check_forecasts(example_quantile)
If you are unsure what your input data should look like, have a look at the example_quantile
, example_integer
, example_continuous
and example_binary
data sets provided in the package.
The function avail_forecasts()
may also be helpful to determine where forecasts are available. Using the by
argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run
avail_forecasts(example_quantile, by = c("model", "target_type"))
We see that 'epiforecasts-EpiNow2' has some missing forecasts for the deaths forecast target and that UMass-MechBayes has no case forecasts.
This information can also be visualised using the plot_avail_forecasts()
function:
example_quantile %>% avail_forecasts(by = c("model", "forecast_date", "target_type")) %>% plot_avail_forecasts() + facet_wrap(~ target_type)
You can also visualise forecasts directly using the plot_predictions()
function:
example_quantile %>% make_NA(what = "truth", target_end_date >= "2021-07-15", target_end_date < "2021-05-22" ) %>% make_NA(what = "forecast", model != 'EuroCOVIDhub-ensemble', forecast_date != "2021-06-28" ) %>% plot_predictions( x = "target_end_date", by = c("target_type", "location") ) + facet_wrap(target_type ~ location, ncol = 4, scales = "free")
Forecasts can easily be scored using the score()
function. This function returns unsumarised scores, which in most cases is not what the user wants. A second function, summarise_scores()
takes care of summarising these scores to the level specified by the user. If you like, you can also use sumarise_scores()
to round your outputs by specifying e.g. signif()
as a summary function.
score(example_quantile) %>% head()
example_quantile %>% score() %>% summarise_scores(by = c("model", "target_type")) %>% summarise_scores(fun = signif, digits = 2) %>% kable()
The by
argument can be used to define the level of summary. By default, by = NULL
is set to the unit of a single forecast. For quantile-based forecasts, unsummarised scores are returned for every quantile individually. It can therefore make sense to run summarise_scores
even without any arguments provided.
score(example_quantile) %>% summarise_scores()
Point forecasts can be scored by making use of the quantile-based format, but with a value of NA
or "point"
in the quantile column. Point forecasts can be scored alongside other quantile-based forecasts. As point forecasts will have values of NA
for metrics designed for probabilistic forecasts, it is important to use na.rm = TRUE
when summarising.
score(example_point) %>% summarise_scores(by = "model", na.rm = TRUE)
For quantile-based forecasts we are often interested in specific coverage-levels, for example, what percentage of true values fell between all 50% or the 90% prediction intervals. We can add this information using the function add_coverage()
. This function also requires a by
argument which defines the level of grouping for which the percentage of true values covered by certain prediction intervals is computed.
score(example_quantile) %>% add_coverage(ranges = c(50, 90), by = c("model", "target_type")) %>% summarise_scores(by = c("model", "target_type")) %>% summarise_scores(fun = signif, digits = 2)
In order to better compare models against each other we can use relative scores which are computed based on pairwise comparisons (see details below). Relative scores can be added to the evaluation using the function summarise_scores()
. This requires a column called 'model' to be present. Pairwise comparisons are computed according to the grouping specified in by
: essentially, the data.frame with all scores gets split into different data.frames according to the values specified in by
and relative scores are computed for every individual group separately. The baseline
argument allows us to specify a baseline that can be used to scale relative scores (all scores are divided by the baseline relative score). For example, to obtain relative scores separately for different forecast targets, we can run
score(example_quantile) %>% summarise_scores(by = c("model", "target_type"), relative_skill = TRUE, baseline = "EuroCOVIDhub-ensemble")
A simple coloured table can be produced based on the scores:
score(example_integer) %>% summarise_scores(by = c("model", "target_type"), na.rm = TRUE) %>% summarise_scores(fun = signif, digits = 2) %>% plot_score_table(by = "target_type") + facet_wrap(~ target_type, nrow = 1)
We can also summarise one particular metric across different categories using a simple heatmap:
score(example_continuous) %>% summarise_scores(by = c("model", "location", "target_type")) %>% plot_heatmap(x = "location", metric = "bias") + facet_wrap(~ target_type)
The weighted interval score can be split up into three components: Over-prediction, under-prediction and dispersion. These can be visualised separately in the following way:
score(example_quantile) %>% summarise_scores(by = c("target_type", "model")) %>% plot_wis() + facet_wrap(~ target_type, scales = "free")
Calibration is a measure statistical consistency between the forecasts and the observed values. The most common way of assessing calibration (more precisely: probabilistic calibration) are PIT histograms. The probability integral transform (PIT) is equal to the cumulative distribution function of a forecast evaluated at the true observed value. Ideally, pit values should be uniformly distributed after the transformation.
We can compute pit values as such:
example_continuous %>% pit(by = "model")
And visualise the results as such:
example_continuous %>% pit(by = c("model", "target_type")) %>% plot_pit() + facet_grid(model ~ target_type)
Similarly for quantile-based forecasts:
example_quantile[quantile %in% seq(0.1, 0.9, 0.1), ] %>% pit(by = c("model", "target_type")) %>% plot_pit() + facet_grid(model ~ target_type)
Another way to look at calibration are interval coverage and quantile coverage. Interval coverage is the percentage of true values that fall inside a given central prediction interval. Quantile coverage is the percentage of observed values that fall below a given quantile level.
In order to plot interval coverage, you need to include "range" in the by
argument to summarise_scores()
. The green area on the plot marks conservative behaviour, i.e. your empirical coverage is greater than it nominally need be (e.g. 55% of true values covered by all 50% central prediction intervals.)
example_quantile %>% score() %>% summarise_scores(by = c("model", "range")) %>% plot_interval_coverage()
To visualise quantile coverage, you need to include "quantile" in by
. Again, the green area corresponds to conservative forecasts, where central prediction intervals would cover more than needed.
example_quantile %>% score() %>% summarise_scores(by = c("model", "quantile")) %>% plot_quantile_coverage()
Relative scores for different models can be computed using pairwise comparisons, a sort of pairwise tournament where all combinations of two models are compared against each other based on the overlapping set of available forecasts common to both models. Internally, a ratio of the mean scores of both models is computed. The relative score of a model is then the geometric mean of all mean score ratios which involve that model. When a baseline is provided, then that baseline is excluded from the relative scores for individual models (which therefore differ slightly from relative scores without a baseline) and all relative scores are scaled by (i.e. divided by) the relative score of the baseline model.
In scoringutils
, pairwise comparisons can be made in two ways: Through the standalone function pairwise_comparison()
or from within summarise_scores()
which simply adds relative scores to an existing set of scores.
example_quantile %>% score() %>% pairwise_comparison(by = "model", baseline = "EuroCOVIDhub-baseline")
example_quantile %>% score() %>% summarise_scores( by = "model", relative_skill = TRUE, baseline = "EuroCOVIDhub-baseline" )
If using the pairwise_comparison()
function, we can also visualise pairwise comparisons by showing the mean score ratios between models. By default, smaller values are better and the model we care about is showing on the y axis on the left, while the model against it is compared is shown on the x-axis on the bottom. In the example above, the EuroCOVIDhub-ensemble performs best (it only has values smaller 1), while the EuroCOVIDhub-baseline performs worst (and only has values larger than 1). For cases, the UMass-MechBayes model is of course excluded as there are no case forecasts available and therefore the set of overlapping forecasts is empty.
example_quantile %>% score() %>% pairwise_comparison(by = c("model", "target_type")) %>% plot_pairwise_comparison() + facet_wrap(~ target_type)
It may sometimes be interesting to see how different scores correlate with each other. We can examine this using the function correlation()
. When dealing with quantile-based forecasts, it is important to call summarise_scorees()
before correlation()
to summarise over quantiles before computing correlations.
example_quantile %>% score() %>% summarise_scores() %>% correlation()
Visualising correlations:
example_quantile %>% score() %>% summarise_scores() %>% correlation() %>% plot_correlation()
If you would like to see how different forecast interval ranges contribute to average scores, you can visualise scores by interval range:
example_quantile %>% score() %>% summarise_scores(by = c("model", "range", "target_type")) %>% plot_ranges() + facet_wrap(~ target_type, scales = "free")
Different metrics are available for different forecasting formats. In some cases, you may for example have forecasts in a sample-based format, but wish to make use of some of the functionality only available to quantile-based forecasts. For example, you may want to use the decomposition of the weighted interval score, or may like to compute interval coverage values.
You can convert your forecasts into a sample-based format using the function sample_to_quantile()
. There is, however, one caveat: Quantiles will be calculated based on the predictive samples, which may introduce a bias if the number of available samples is small.
example_integer %>% sample_to_quantile( quantiles = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99) ) %>% score() %>% add_coverage(by = c("model", "target_type"))
An overview of available metrics can be found in the metrics
data set that is included in the package.
metrics
Most of these metrics are available as lower level functions and extensively documented. Have a look at the help files to understand these better.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.