GittinsBrezziLaiPolicy: Policy: Gittins Approximation algorithm for choosing arms in...

Description Details Usage Arguments Methods References See Also

Description

GittinsBrezziLaiPolicy Algorithm based on Brezzi and Lai (2002) "Optimal learning and experimentation in bandit problems."

Details

The algorithm provides an approximation of the Gittins index, by specifying a closed-form expression, which is a function of the discount factor, and the number of successes and failures associated with each arm.

Usage

1
  policy <- GittinsBrezziLaiPolicy$new(discount=0.95, prior=NULL)

Arguments

discount

numeric; discount factor

prior

numeric matrix; prior beliefs over Bernoulli parameters governing each arm. Beliefs are specified by Beta distribution with two parameters (alpha,beta) where alpha = number of success, beta = number of failures. Matrix is of arms times two (alpha / beta) dimensions

Methods

new(discount=0.95, prior=NULL)

Generates and initializes a new Policy object.

get_action(t, context)

arguments:

  • t: integer, time step t.

  • context: list, containing the current context$X (d x k context matrix), context$k (number of arms) and context$d (number of context features)

computes which arm to play based on the current values in named list theta and the current context. Returns a named list containing action$choice, which holds the index of the arm to play.

set_reward(t, context, action, reward)

arguments:

  • t: integer, time step t.

  • context: list, containing the current context$X (d x k context matrix), context$k (number of arms) and context$d (number of context features) (as set by bandit).

  • action: list, containing action$choice (as set by policy).

  • reward: list, containing reward$reward and, if available, reward$optimal (as set by bandit).

utilizes the above arguments to update and return the set of parameters in list theta.

set_parameters()

Helper function, called during a Policy's initialisation, assigns the values it finds in list self$theta_to_arms to each of the Policy's k arms. The parameters defined here can then be accessed by arm index in the following way: theta[[index_of_arm]]$parameter_name.

References

Brezzi, M., & Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control, 27(1), 87-108.

Implementation follows https://github.com/elarry/bandit-algorithms-simulated

See Also

Core contextual classes: Bandit, Policy, Simulator, Agent, History, Plot

Bandit subclass examples: BasicBernoulliBandit, ContextualLogitBandit, OfflineReplayEvaluatorBandit

Policy subclass examples: EpsilonGreedyPolicy, ContextualLinTSPolicy


Nth-iteration-labs/contextual documentation built on July 28, 2020, 1:13 p.m.