ContextualWheelBandit: Bandit: ContextualWheelBandit

Description Details Usage Arguments Methods References See Also Examples

Description

Samples from Wheel bandit game.

Details

The Wheel bandit game offers an artificial problem where the need for exploration is smoothly parameterized through exploration parameter delta.

In the game, contexts are sampled uniformly at random from a unit circle divided into one central and four edge areas for a total of k = 5 possible actions. The central area offers a random normal sampled reward independent of the context, in contrast to the outer areas which offer a random normal sampled reward dependent on a d = 2 dimensional context.

For more information, see https://arxiv.org/abs/1802.09127.

Usage

1
  bandit <- ContextualWheelBandit$new(delta, mean_v, std_v, mu_large, std_large)

Arguments

delta

numeric; exploration parameter: high reward in one region if norm above delta.

mean_v

numeric vector; mean reward for each action if context norm is below delta.

std_v

numeric vector; gaussian reward sd for each action if context norm is below delta.

mu_large

numeric; mean reward for optimal action if context norm is above delta.

std_large

numeric; standard deviation of the reward for optimal action if context norm is above delta.

Methods

new(delta, mean_v, std_v, mu_large, std_large)

generates and instantializes a new ContextualWheelBandit instance.

get_context(t)

argument:

  • t: integer, time step t.

returns a named list containing the current d x k dimensional matrix context$X, the number of arms context$k and the number of features context$d.

get_reward(t, context, action)

arguments:

  • t: integer, time step t.

  • context: list, containing the current context$X (d x k context matrix), context$k (number of arms) and context$d (number of context features) (as set by bandit).

  • action: list, containing action$choice (as set by policy).

returns a named list containing reward$reward and, where computable, reward$optimal (used by "oracle" policies and to calculate regret).

References

Riquelme, C., Tucker, G., & Snoek, J. (2018). Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. arXiv preprint arXiv:1802.09127.

Implementation follows https://github.com/tensorflow/models/tree/master/research/deep_contextual_bandits

See Also

Core contextual classes: Bandit, Policy, Simulator, Agent, History, Plot

Bandit subclass examples: BasicBernoulliBandit, ContextualLogitBandit, OfflineReplayEvaluatorBandit

Policy subclass examples: EpsilonGreedyPolicy, ContextualLinTSPolicy

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## Not run: 

horizon       <- 1000L
simulations   <- 10L

delta         <- 0.95
num_actions   <- 5
context_dim   <- 2
mean_v        <- c(1.0, 1.0, 1.0, 1.0, 1.2)
std_v         <- c(0.05, 0.05, 0.05, 0.05, 0.05)
mu_large      <- 50
std_large     <- 0.01

bandit        <- ContextualWheelBandit$new(delta, mean_v, std_v, mu_large, std_large)
agents        <- list(Agent$new(UCB1Policy$new(), bandit),
                      Agent$new(LinUCBDisjointOptimizedPolicy$new(0.6), bandit))

simulation     <- Simulator$new(agents, horizon, simulations)
history        <- simulation$run()

plot(history, type = "cumulative", regret = FALSE, rate = TRUE, legend_position = "bottomright")

## End(Not run)

robinvanemden/contextual documentation built on Aug. 12, 2019, 9:30 p.m.