knitr::opts_chunk$set(dpi=300, collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) knitr::opts_chunk$set(fig.align="center")
library(MagmaClustR) library(dplyr) library(ggplot2)
The overall pipeline for standard GP regression in MagmaClustR can be decomposed in 3 main steps: training, prediction and plotting of results. The corresponding functions are:
Before using train_gp()
, our dataset should present a particular format. It must contains those 2 columns:
Input
: numeric
,Output
: numeric
. If we fail to ensure that the dataset satisfies those conditions, the algorithm will return an error.
The data frame can also provide as many covariates (i.e. additional columns) as desired, with no constraints on the column names (except the name 'Reference' that should always be avoided, as it is used inside internal functions of the algorithm). These covariates are treated as additional input dimensions.
To explore the features of standard GP in MagmaClustR, we use the swimmers
dataset provided by the French Swimming Federation (available here, and studied more thoroughly here and there).
Our goal is to model the progression curves of swimmers in order to forecast their future performances. Thus, we randomly select a female swimmer from the dataset; let's give her the fictive name Michaela for the sake of illustration.
The swimmers
dataset contains 4 columns: ID
, Age
, Performance
and Gender
.
Therefore, we first need to change the name and type of the columns, and remove Gender
before using train_gp()
.
Michaela <- swimmers %>% filter(ID == 1718) %>% select(-Gender) %>% rename(Input = Age, Output = Performance)
We display Michaela's performances according to her age to visualise her progression from raw data.
ggplot() + geom_point(data = Michaela, mapping = aes(x=Input,y=Output), size = 3, colour = "black") + theme_classic()
To obtain a GP that best fits our data, we must specify some parameters:
prior_mean
: if we assume no prior knowledge about the 100m freestyle, we can decide to leave the default value for this parameter (i.e. zero). However, if we want to take expert advice into account, we can modify the value of prior_mean
accordingly.
kern
: the relationship between observed data and prediction targets can be control through the covariance kernel.
Therefore, in order to correctly fit our data, we need to choose a suitable covariance kernel.
In the case of swimmers, we want a smooth progression curve for Michaela; therefore, we specify kern = "SE"
.
The most commonly used kernels and their properties are covered in the kernel cookbook.
Details of available kernels and how to combine them in MagmaClustR are available in help(train_gp)
.
set.seed(2) model_gp <- train_gp(data = Michaela, kern = "SE", prior_mean = 0)
Thanks to train_gp()
, we learn the model hyper-parameters from data, which can then be used to make prediction for Michaela's performances.
The arguments kern
and mean
remain the same as above. We now need to specify:
train_gp()
in hp
;grid_inputs
. Here, we want to predict Michaela's performances until 15/16 years old, so we set grid_inputs = seq(11, 15.5, 0.1)
.See help(pred_gp)
to get information about the other optional arguments.
pred_gp <- pred_gp(data = Michaela, kern = "SE", hp = model_gp, grid_inputs = seq(11,15.5,0.1), plot = FALSE)
Let us note that we could have called the pred_gp()
function directly on data.
If we didn't use the train_gp()
function beforehand, it is possible to omit the hp
parameter; pred_gp()
would automatically learn hyperparameters with default settings and us them to make predictions afterwards.
Finally, we display Michaela's prediction curve and its 95% associated credibility interval thanks to the plot_gp()
function. If we want to get a prettier graphic, we could specify heatmap = TRUE
to create a heatmap of probabilities instead (note that it could be longer to display when evaluating on fine grids).
plot_gp(pred_gp = pred_gp, data = Michaela)
Close to Michaela’s observed data ($t \in [ 10, 14 ]$), the standard GP prediction behaves as expected: it accurately fits data points and the confidence interval is narrow. However, as soon as we move away from observed points, the GP drifts to the prior mean and uncertainty increases significantly.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.