Description Details Usage Methods See Also
Parent or superclass of all {contextual} Policy subclasses.
On every t = {1, ..., T}, a policy receives d dimensional feature vector or
d x k dimensional matrix
context$X*, the current number of Bandit arms in context$k, and the current
number of contextual features in context$d.
To make sure a policy supports both contextual feature vectors and matrices in context$X, it is
suggested any contextual policy makes use of contextual's get_arm_context(context, arm)
utility function to obtain the current context for a particular arm, and get_full_context(context)
where a policy makes direct use of a d x k context matrix.
It has to compute which of the k
Bandit arms to pull by taking into account this contextual information plus the policy's
current parameter values stored in the named list theta. On selecting an arm, the policy then
returns its index as action$choice.
On pulling a Bandit arm the policy receives a Bandit reward through
reward$reward. In combination with the current context$X* and action$choice,
this reward can then be used to update to the policy's parameters as stored in list theta.
* Note: in context-free scenario's, context$X can be omitted.
1 |
new()Generates and initializes a new Policy object.
get_action(t, context)arguments:
t: integer, time step t.
context: list, containing the current context$X (d x k context matrix),
context$k (number of arms) and context$d (number of context features)
computes which arm to play based on the current values in named list theta
and the current context. Returns a named list containing
action$choice, which holds the index of the arm to play.
set_reward(t, context, action, reward)arguments:
t: integer, time step t.
context: list, containing the current context$X (d x k context matrix),
context$k (number of arms) and context$d (number of context features)
(as set by bandit).
action: list, containing action$choice (as set by policy).
reward: list, containing reward$reward and, if available,
reward$optimal (as set by bandit).
utilizes the above arguments to update and return the set of parameters in list theta.
post_initialization()Post-initialization happens after cloning the Policy instance number_of_simulations times.
Do sim level random generation here.
set_parameters()Helper function, called during a Policy's initialisation, assigns the values
it finds in list self$theta_to_arms to each of the Policy's k arms.
The parameters defined here can then be accessed by arm index in the following way:
theta[[index_of_arm]]$parameter_name.
Core contextual classes: Bandit, Policy, Simulator,
Agent, History, Plot
Bandit subclass examples: BasicBernoulliBandit, ContextualLogitBandit,
OfflineReplayEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy, ContextualLinTSPolicy
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.