Compute an aggregation rule

Share:

Description

The function mixture builds an aggregation rule chosen by the user. It can then be used to predict new observations Y sequentially. If observations Y and expert advice experts are provided, mixture is trained by predicting the observations in Y sequentially with the help of the expert advice in experts. At each time instance t=1,2,…,T, the mixture forms a prediction of Y[t,] by assigning a weight to each expert and by combining the expert advice.

Print an aggregation procedure

Summary of an aggregation procedure

Usage

1
2
3
4
5
6
7
8
9
mixture(Y = NULL, experts = NULL, model = "MLpol", loss.type = "square",
  loss.gradient = TRUE, coefficients = "Uniform", awake = NULL,
  parameters = list())

## S3 method for class 'mixture'
print(x, ...)

## S3 method for class 'mixture'
summary(object, ...)

Arguments

Y

A matrix with T rows and d columns. Each row Y[t,] contains a d-dimensional observation to be predicted sequentially.

experts

An array of dimension c(T,d,K), where T is the length of the data-set, d the dimension of the observations, and K is the number of experts. It contains the expert forecasts. Each vector experts[t,,k] corresponds to the d-dimensional prediction of Y[t,] proposed by expert k at time t=1,…,T. In the case of real prediction (i.e., d = 1), experts is a matrix with T rows and K columns.

model

A character string specifying the aggregation rule to use. Currently available aggregation rules are:

'EWA'

Exponentially weighted average aggregation rule. A positive learning rate eta can be chosen by the user. The bigger it is the faster the aggregation rule will learn from observations and experts performances. However, too hight values lead to unstable weight vectors and thus unstable predictions. If it is not specified, the learning rate is calibrated online. A finite grid of potential learning rates to be optimized online can be specified with grid.eta.

'FS'

Fixed-share aggregation rule. As for ewa, a learning rate eta can be chosen by the user or calibrated online. The main difference with ewa aggregation rule rely in the mixing rate alpha\in [0,1] wich considers at each instance a small probability alpha to have a rupture in the sequence and that the best expert may change. Fixed-share aggregation rule can thus compete with the best sequence of experts that can change a few times (see oracle), while ewa can only compete with the best fixed expert. The mixing rate alpha is either chosen by the user either calibrated online. Finite grids of learning rates and mixing rates to be optimized can be specified with parameters grid.eta and grid.alpha.

'Ridge'

Ridge regression. It minimizes at each instance a penalized criterion. It forms at each instance linear combination of the experts' forecasts and can assign negative weights that not necessarily sum to one. It is useful if the experts are biased or correlated. It cannot be used with specialized experts. A positive regularization coefficient lambda can either be chosen by the user or calibrated online. A finite grid of coefficient to be optimized can be specified with a parameter grid.lambda.

'MLpol'

Polynomial Potential aggregation rule with different learning rates for each expert. The learning rates are calibrated using theoretical values. There are similar aggregation rules like 'BOA' (Bernstein online Aggregation see [Wintenberger, 2014], 'MLewa', and 'MLprod' (see [Gaillard, Erven, and Stoltz, 2014])

'OGD'

Online Gradient descent (see Zinkevich, 2003). The optimization is performed with a time-varying learning rate. At time step t ≥q 1, the learning rate is chosen to be t^{-α}, where α is provided by alpha in the parameters argument. The algorithm may or not perform a projection step into the simplex space (non-negative weights that sum to one) according to the value of the parameter 'simplex' provided by the user.

loss.type

A string or a list with a component 'name' specifying the loss function considered to evaluate the performance. It can be 'square', 'absolute', 'percentage', or 'pinball'. In the case of the pinball loss, the quantile can be provided by assigning to loss.type a list of two elements:

name

A string defining the name of the loss function (i.e., 'pinball')

tau

A number in [0,1] defining the quantile to be predicted. The default value is 0.5 to predict the median.

'Ridge' aggregation rule is restricted to square loss.

loss.gradient

A boolean. If TRUE (default) the aggregation rule will not be directly applied to the loss function at hand but to a gradient version of it. The aggregation rule is then similar to gradient descent aggregation rule.

coefficients

A probability vector of length K containing the prior weights of the experts (not possible for 'MLpol'). The weights must be non-negative and sum to 1.

awake

A matrix specifying the activation coefficients of the experts. Its entries lie in [0,1]. Possible if some experts are specialists and do not always form and suggest prediction. If the expert number k at instance t does not form any prediction of observation Y_t, we can put awake[t,k]=0 so that the mixture does not consider expert k in the mixture to predict Y_t.

parameters

A list that contains optional parameters for the aggregation rule. If no parameters are provided, the aggregation rule is fully calibrated online. Possible parameters are:

eta

A positive number defining the learning rate. Possible if model is either 'EWA' or 'FS'

grid.eta

A vector of positive numbers defining potential learning rates for 'EWA' of 'FS'. The learning rate is then calibrated by sequentially optimizing the parameter in the grid. The grid may be extended online if needed by the aggregation rule.

gamma

A positive number defining the exponential step of extension of grid.eta when it is needed. The default value is 2.

alpha

A number in [0,1]. If the model is 'FS', it defines the mixing rate. If the model is 'OGD', it defines the order of the learning rate: η_t = t^{-α}.

grid.alpha

A vector of numbers in [0,1] defining potential mixing rates for 'FS' to be optimized online. The grid is fixed over time. The default value is [0.0001,0.001,0.01,0.1].

lambda

A positive number defining the smoothing parameter of 'Ridge' aggregation rule.

grid.lambda

Similar to grid.eta for the parameter lambda.

simplex

A boolean that specifies if 'OGD' does a project on the simplex. In other words, if TRUE (default) the online gradient descent will be under the constraint that the weights sum to 1 and are non-negative. If FALSE, 'OGD' performs an online gradient descent on K dimensional real space. without any projection step.

averaged

A boolean (default is FALSE). If TRUE the coefficients and the weights returned (and used to form the predictions) are averaged over the past. It leads to more stability on the time evolution of the weights but needs more regularity assumption on the underlying process genearting the data (i.i.d. for instance).

x

An object of class mixture

...

Additional parameters

object

An object of class mixture

Value

An object of class mixture that can be used to perform new predictions. It contains the parameters model, loss.type, loss.gradient, experts, Y, awake, and the fields

coefficients

A vector of coefficients assigned to each expert to perform the next prediction.

weights

A matrix of dimension c(T,K), with T the number of instances to be predicted and K the number of experts. Each row contains the convex combination to form the predictions

prediction

A matrix with T rows and d columns that contains the predictions outputted by the aggregation rule.

loss

The average loss (as stated by parameter loss.type) suffered by the aggregation rule.

parameters

The learning parameters chosen by the aggregation rule or by the user.

training

A list that contains useful temporary information of the aggregation rule to be updated and to perform predictions.

Methods (by class)

  • mixture: print

  • mixture: summary

Author(s)

Pierre Gaillard <pierre@gaillard.me>

See Also

See opera-package and opera-vignette for a brief example about how to use the package.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#' 
library('opera')  # load the package
set.seed(1)

# Example: find the best one week ahead forecasting strategy (weekly data)
# packages
library(mgcv)

# import data
data(electric_load)
idx_data_test <- 620:nrow(electric_load)
data_train <- electric_load[-idx_data_test, ]
data_test <- electric_load[idx_data_test, ]

# Build the expert forecasts 
# ##########################

# 1) A generalized additive model
gam.fit <- gam(Load ~ s(IPI) + s(Temp) + s(Time, k=3) + 
                s(Load1) + as.factor(NumWeek), data = data_train)
gam.forecast <- predict(gam.fit, newdata = data_test)

# 2) An online autoregressive model on the residuals of a medium term model

# Medium term model to remove trend and seasonality (using generalized additive model)
detrend.fit <- gam(Load ~ s(Time,k=3) + s(NumWeek) + s(Temp) + s(IPI), data = data_train)
electric_load$Trend <- c(predict(detrend.fit), predict(detrend.fit,newdata = data_test))
electric_load$Load.detrend <- electric_load$Load - electric_load$Trend

# Residual analysis
ar.forecast <- numeric(length(idx_data_test))
for (i in seq(idx_data_test)) {
 ar.fit <- ar(electric_load$Load.detrend[1:(idx_data_test[i] - 1)])
 ar.forecast[i] <- as.numeric(predict(ar.fit)$pred) + electric_load$Trend[idx_data_test[i]]
}

# Aggregation of experts
###########################

X <- cbind(gam.forecast, ar.forecast)
colnames(X) <- c('gam', 'ar')
Y <- data_test$Load

matplot(cbind(Y, X), type = 'l', col = 1:6, ylab = 'Weekly load', xlab = 'Week')


# How good are the expert? Look at the oracles
oracle.convex <- oracle(Y = Y, experts = X, loss.type = 'square', model = 'convex')
plot(oracle.convex)
oracle.convex

# Is a single expert the best over time ? Are there breaks ?
oracle.shift <- oracle(Y = Y, experts = X, loss.type = 'percentage', model = 'shifting')
plot(oracle.shift)
oracle.shift

# Online aggregation of the experts with BOA
#############################################

# Initialize the aggregation rule
m0.BOA <- mixture(model = 'BOA', loss.type = 'square')

# Perform online prediction using BOA There are 3 equivalent possibilities 1)
# start with an empty model and update the model sequentially
m1.BOA <- m0.BOA
for (i in 1:length(Y)) {
 m1.BOA <- predict(m1.BOA, newexperts = X[i, ], newY = Y[i])
}

# 2) perform online prediction directly from the empty model
m2.BOA <- predict(m0.BOA, newexpert = X, newY = Y, online = TRUE)

# 3) perform the online aggregation directly
m3.BOA <- mixture(Y = Y, experts = X, model = 'BOA', loss.type = 'square')

# These predictions are equivalent:
identical(m1.BOA, m2.BOA)  # TRUE
identical(m1.BOA, m3.BOA)  # TRUE

# Display the results
summary(m3.BOA)
plot(m1.BOA)