knitr::opts_chunk$set(dpi=300, collapse = T, comment = "#>") options(tibble.print_min = 6L, tibble.print_max = 6L) knitr::opts_chunk$set(fig.align="center")
library(MagmaClustR) library(dplyr) library(ggplot2)
In Magma and MagmaClust vignettes, it has been emphasised that, to be processed within MagmaClustR
, a dataset must contain at least 3 mandatory columns: ID
, Input
and Output
. However, let us point out that both Magma and MagmaClust can also handle multi-dimensional inputs. Notice that there are no constraints on the names of the additional input columns (except for the name Reference that is used internally in the code and should be strictly avoided).
Throughout the following synthetic example, the Magma algorithm is applied to tackle a 2-dimensional forecasting problem. More specifically, the model is trained on a dataset that contains 2 input variables, namely Input
and Covariate
.
To explore the features of Magma in 2D, we simulate a synthetic dataset thanks to the simu_db()
function. To customise this dataset, we can specify several parameters, such as:
the number K of underlying clusters in our dataset. Since we do not assume that there are group structures in our data, we set K = 1
(default).
the number M of individuals per cluster. Even if Magma performances improve with respect to the number of training individuals, 30 are more than enough to get an idea of how this 2D version works. Thus, we specify M = 31
: 30 individuals for the training, 1 for the prediction.
the number N of observations per individual. Here, we set N = 10
data points.
the presence / absence of an additional input named Covariate
. As we want to handle multi-dimensional inputs, we specify covariate = TRUE
.
the fact that all individuals share common inputs (regular or irregular measurements among individuals). Here, we set common_input = FALSE
to define a grid that covers various regions of the input space (Input
and Covariate
).
Many additional arguments are available, see simu_db()
for details.
set.seed(3) data_dim2 <- simu_db(M = 31, N = 10, K = 1, covariate = TRUE, common_input = FALSE) knitr::kable(head(data_dim2))
To provide a visual intuition of the values contained in training set, we can display raw data (using the signature gradient colour of MagmaClustR
;) ) with the following code:
ggplot(data_dim2) + geom_tile(aes(x = Input, y = Covariate, fill = Output)) + theme_classic() + scale_fill_gradientn(colours = c( "white", "#FDE0DD", "#FCC5C0", "#FA9FB5", "#F768A1", "#DD3497", "#AE017E", "#7A0177"))
Let us mention that any dataset in MagmaClustR
can be regularised if necessary (in the sense that we can modify the input grid to match a specific format). To this end, the regularize_data()
function is proposed, allowing us to constrain the number of data points per input, or specify 'by hand' a surrogate grid of inputs on which to project our data. This function takes the arguments:
data
,corresponding to database we want to regularise;size_grid
, indicating how many points each axis of the grid must contain;grid_inputs
, the grid on which we want to project our data. If NULL (default), a dedicated grid of inputs is defined: for each input column, a regular sequence is created from the min input values to the max, with a number of equispaced points equal to the 'size_grid' argument. We could also project our data on a specific grid_inputs
(which may come from the dedicated expand_grid_inputs()
function; see Customise graphs for details). This grid does not necessarily have the same number of points along all axes.summarise_fct
, a function used to summarise data points if several similar inputs are associated with different outputs. For instance, if we would want to evaluate our synthetic dataset on a $5 \times 5$ grid of inputs summarise the projected outputs according through their mean, we could call:
data_dim2_reg <- regularize_data(data = data_dim2, size_grid = 5, grid_inputs = NULL, summarise_fct = "mean") knitr::kable(head(data_dim2_reg))
Finally, we split our individuals into training and prediction sets.
dim2_train <- data_dim2 %>% filter(ID %in% 1:30) dim2_pred <- data_dim2 %>% filter(ID == 31)
The overall process of Magma remains identical as in the 1-D case, and can be decomposed in 3 main steps: training, prediction and plotting of results. We refer to the Magma vignette for the complete description of the classical pipeline:
- we call the train_magma()
function to train our model:
set.seed(3) model_dim2 <- train_magma(data = dim2_train, kern_0 = "SE", kern_i = "SE", common_hp = TRUE)
pred_magma()
. Here, we do not specify the grid_inputs
argument; in this case, a default grid is automatically generated by pred_magma()
, ranging between the min and max Input
and Covariate
values of our dataset. However, if we want to perform prediction on a specific 2-D area, we could provide a specific grid_inputs
; see Even prettier graphics section for details. pred_dim2 <- pred_magma(data = dim2_pred, trained_model = model_dim2, plot = FALSE)
plot_gp()
. As this step constitutes the main evolution of the one-dimensional version, we devote an entire section to discuss it below.With the plot_gp()
function, we can display the predicted posterior mean values of ID = 31
. The prediction is represented as a 2-D heatmap of probabilities where:
Input
column of our dataset; the y-axis, to the Covariate
column;Unfortunately, 2-D inputs inevitably lead to constraints when it comes to visualisation, preventing us to provide as much information as for 1-D. In particular, the graph below does not display the mean process nor the training data, as we were not able to find appropriate representations (that would also not surcharge the graph).
plot_gp(pred_gp = pred_dim2, data = dim2_pred)
As with the unidimensional version, we can create our own grid of inputs if we want to perform prediction on a specific 2D area (potentially wider than the dataset one). Contrary to the unidimensional version, the grid_inputs
can no longer be a simple sequence of numbers for which the prediction must be performed, but a 2D grid containing :
Input
for which we want to perform prediction ; Covariate
for which we want to perform prediction.To create our grid, we use the expand_grid_inputs()
function. We only have to specify a sequence of Input
and as many covariates as we want. However, we must also ensure that we do not generate too much data points ; we recall that Magma has a cubic complexity, so the execution can be extremely long depending on the number and length of sequences. Therefore, we advise to reduce the length of the Inputs sequences if we want to perform a high dimensional prediction. Moreover, each Input must have the same name as in the data base to avoid errors during the prediction step.
grid_inputs_dim2 <- expand_grid_inputs(Input = seq(0,10,0.5), Covariate = seq(0,10,0.5)) pred_dim2 <- pred_magma(data = dim2_pred, trained_model = model_dim2, grid_inputs = grid_inputs_dim2, plot = TRUE)
Here, we perform prediction on a grid_inputs
wider than the one generated automatically by pred_magma()
. For Input
values higher than the dataset ones, no observations are available, neither from ID = 31
nor from the training dataset. Magma behaves as expected, with a slow drifting to the prior mean (here, zero) and an highly increasing variance.
As with 1-D Magma, it is possible to create animated representations thanks to pred_gif()
and plot_gif()
.
These functions offer dynamic plots by generating GIFs thanks to gganimate
package. Then, we can observe how the prediction evolves as we add more data points to our prediction dataset.
pred_gif()
and plot_gif()
functions work the same as pred_magma()
and plot_gp()
in the 2D version. Some extra arguments can be customized in plot_gif()
, like adjusting the GIF speed, saving the plotted GIF, etc...
The MagmaClustR package also provides a prediction with classic GPs in 2D; thus, we can compare Magma and classic GPs predictions.
The graphs below correspond to the 2D dynamic prediction with Magma (first one) and classic GP (second one).
model_gp_2D <- train_gp(data = dim2_pred, kern ="SE") pred_gp_2D <- pred_gp(data = dim2_pred, grid_inputs = NULL, hp = model_gp_2D, kern = "SE", plot = FALSE)
pred_gif_magma <- pred_gif(data = dim2_pred, trained_model = model_dim2, grid_inputs = grid_inputs_dim2) pred_gif_GP <- pred_gif(data = dim2_pred, grid_inputs = grid_inputs_dim2)
plot_gif(pred_gp = pred_gif_magma, data = dim2_pred)
plot_gif(pred_gp = pred_gif_GP, data = dim2_pred)
This example highlights the advantage of sharing information across individuals: even before seeing any data, Magma provides an already accurate mean process. As we add more data for the individual of interest, the uncertainty of predictions slightly decreases, although the main trend remains mainly similar.
On the other hand, the classic GP struggles to find the shape of the underlying process. On unobserved regions of the graph, the prediction remains far from the original process. As expected, though, the more data we add, the better becomes the prediction with standard GP.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.