Description Details Usage Arguments Methods References See Also
GittinsBrezziLaiPolicy
Algorithm based on Brezzi and Lai (2002)
"Optimal learning and experimentation in bandit problems."
The algorithm provides an approximation of the Gittins index, by specifying a closed-form expression, which is a function of the discount factor, and the number of successes and failures associated with each arm.
1 | policy <- GittinsBrezziLaiPolicy$new(discount=0.95, prior=NULL)
|
discount
numeric; discount factor
prior
numeric matrix; prior beliefs over Bernoulli parameters governing each arm. Beliefs are specified by Beta distribution with two parameters (alpha,beta) where alpha = number of success, beta = number of failures. Matrix is of arms times two (alpha / beta) dimensions
new(discount=0.95, prior=NULL)
Generates and initializes a new Policy
object.
get_action(t, context)
arguments:
t
: integer, time step t
.
context
: list, containing the current context$X
(d x k context matrix),
context$k
(number of arms) and context$d
(number of context features)
computes which arm to play based on the current values in named list theta
and the current context
. Returns a named list containing
action$choice
, which holds the index of the arm to play.
set_reward(t, context, action, reward)
arguments:
t
: integer, time step t
.
context
: list, containing the current context$X
(d x k context matrix),
context$k
(number of arms) and context$d
(number of context features)
(as set by bandit
).
action
: list, containing action$choice
(as set by policy
).
reward
: list, containing reward$reward
and, if available,
reward$optimal
(as set by bandit
).
utilizes the above arguments to update and return the set of parameters in list theta
.
set_parameters()
Helper function, called during a Policy's initialisation, assigns the values
it finds in list self$theta_to_arms
to each of the Policy's k arms.
The parameters defined here can then be accessed by arm index in the following way:
theta[[index_of_arm]]$parameter_name
.
Brezzi, M., & Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control, 27(1), 87-108.
Implementation follows https://github.com/elarry/bandit-algorithms-simulated
Core contextual classes: Bandit
, Policy
, Simulator
,
Agent
, History
, Plot
Bandit subclass examples: BasicBernoulliBandit
, ContextualLogitBandit
,
OfflineReplayEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy
, ContextualLinTSPolicy
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.