SimulateMultiplePeriods: Simulate strategies for Multi-Armed Bandit in multiple...

Description Usage Arguments Details Value Examples

Description

This function is aimed to simulate data to run strategies of Multi-Armed Bandit in a sequence of periods. Weight plot and regret plot are provided if needed. In each period there could be multiple pulls and each method can only be applied once. The default setting is that in each period there is only 1 pull, corresponding to continuous updating.

Usage

1
2
3
4
SimulateMultiplePeriods(method = "Thompson-Sampling",
  method.par = list(ndraws.TS = 1000), nburnin, nperiod, reward.family,
  mean.reward, sd.reward = NULL, npulls.per.period = 1,
  weight.plot = FALSE, regret.plot = FALSE)

Arguments

method

A character string choosing from "Epsilon-Greedy", "Epsilon-Decreasing", "Thompson-Sampling", "EXP3", "UCB", "Bayes-Poisson-TS", "Greedy-Thompson-Sampling", "EXP3-Thompson-Sampling", "Greedy-Bayes-Poisson-TS", "EXP3-Bayes-Poisson-TS" and "HyperTS". For details of these methods, see below. Default is "Thompson-Sampling".

method.par

A list of parameters needed for different methods:

epsilon: A real number between 0 and 1; needed for "Epsilon-Greedy", "Epsilon-Decreasing", "Greedy-Thompson-Sampling" and "Greedy-Bayes-Poisson-TS".

ndraws.TS: A positive integer specifying the number of random draws from the posterior; needed for "Thompson-Sampling", "Greedy-Thompson-Sampling" and "EXP3-Thompson-Sampling". Default is 1000.

EXP3: A list consisting of two real numbers eta and gamma; eta > 0 and 0 <= gamma < 1; needed for "EXP3", "EXP3-Thompson-Sampling" and "EXP3-Bayes-Poisson-TS".

BP: A list consisting of three postive integers iter.BP, ndraws.BP and interval.BP; needed for "Bayes-Poisson-TS", "Greedy-Bayes-Poisson-TS" and "EXP3-Bayes-Poisson-TS"; iter.BP specifies the number of iterations to compute posterior; ndraws.BP specifies the number of posterior samples drawn from posterior distribution; interval.BP is specified to draw each posterior sample from a sample sequence of length interval.BP.

HyperTS: A list consisting of a vector method.list, needed for "HyperTS". method.list is a vector of character strings choosing from "Epsilon-Greedy", "Epsilon-Decreasing", "Thompson-Sampling", "EXP3", "UCB", "Bayes-Poisson-TS", "Greedy-Thompson-Sampling", "EXP3-Thompson-Sampling", "Greedy-Bayes-Poisson-TS" and "EXP3-Bayes-Poisson-TS". "HyperTS" will construct an ensemble consisting all the methods in method.list.

nburnin

A positive integer specifying the number of periods to allocate each arm equal traffic before applying any strategy.

nperiod

A positive integer specifying the number of periods to apply the strategy.

reward.family

A character string specifying the distribution family of reward. Available distribution includes "Bernoulli", "Poisson" and "Gaussian". If "Gaussian" is chosen to be the reward distribution, a vector of standard deviation should be provided in sd.reward.

mean.reward

A vector or a matrix of real numbers specifying the mean reward of each arm. If mean.reward is a vector, each element is the mean reward for each arm and the mean reward of each arm is unchanged throughout all periods (corresponding to the stationary Multi-Armed Bandit). If mean.reward is a matrix, it should have (nburnin + nperiod) rows. The mean reward of each arm could change. Each row represents a mean reward vector for each period (corresponding to nonstationary and adversarial Multi-Armed Bandit).

sd.reward

A vector of non-negative numbers specifying standard deviation of each arm's reward distribution if "Gaussian" is chosen to be the reward distribution. Default to be NULL. See reward.family.

npulls.per.period

A positive integer or a vector of positive integers. Default value is 1. If npulls.per.period is a positive integer, the number of pulls is npulls.per.period for each period. If npulls.per.period is a vector, each element represents the number of pulls for one period; the length of npulls.per.period should be equal to nburnin + nperiod.

weight.plot

A logic value with FALSE as default. If TRUE, weight plot object for each arm is returned.

regret.plot

A logic value with FALSE as default. If TRUE, relative regret plot object is returned.

Details

Various methods have been implemented. "Epsilon-Greedy" and "Epsilon-Decreasing" allocates 1 - epsilon traffic to the arm which has the largest average reward and equally distribute the traffic to other arms. For "Epsilon-Greedy" epsilon in method.par serves as constant exploration rate . For "Epsilon-Decreasing" epsilon in method.par serves as exploration rate at period 1, while in period t exploration rate is epsilon / t. See https://en.wikipedia.org/wiki/Multi-armed_bandit#Approximate_solutions for more details about these strategies.

"Thompson-Sampling" refers to Beta-Binomial Thompson Sampling using Beta(1, 1) as a prior. "Bayes-Poisson-TS" refers to Poisson-Gamma Thompson Sampling using a Bayesian Generalized Linear Mixed Effects Model to compute weights. "Bayes-Poisson-TS", "Greedy-Bayes-Poisson-TS" and "EXP3-Bayes-Poisson-TS" depends on the package "emre" to compute posterior distribution. For algorithm details, see the paper https://arxiv.org/abs/1602.00047.

UCB (Upper Confidence Bound) is a classical method for Multi-Armed Bandit. For algorithm details, see the paper http://personal.unileoben.ac.at/rortner/Pubs/UCBRev.pdf. EXP3 is a method which needs to specify exploration rate gamma and exploitation rate eta. For algorithm details, see the paper https://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf.

Ensemble methods are also implemented. "Greedy-Thompson-Sampling" and "Greedy-Bayes-Poisson-TS" allocate 1 - epsilon traffic to the arm corresponding to the largest Thompson sampling weight and allocate epsilon traffic corresponding to Thompson sampling weights. Instead of using average reward for each period to update weights in "EXP3", "EXP3-Thompson-Sampling" and "EXP3-Bayes-Poisson-TS" use Thompson sampling weights in the updating formula in "EXP3". "HyperTS" is an ensemble by applying Thompson Sampling to selecting the best method in each period based on previous performance. For algorithm details, see the paper http://yxjiang.github.io/paper/RecSys2014-ensemble-bandit.pdf.

To measure the performance. Regret is computed by summing over the products of the number of pulls on one arm at one period and the difference of the mean reward of that arm compared with the largest one. Relative regret is computed by dividing the regret of a certain method over the regret of the benchmark method that allocates equal weights to each arm throughout all the periods.

Value

a list consisting of:

weight

A weight matrix whose each element is the allocated weight for each arm and period. Each row represents one arm and each column represents one period.

regret

A relative regret vector whose each element is relative regret for each period. For definition of relative regret, see above.

weight.plot.object

If weight.plot = TRUE, a ggplot object is returned.

regret.plot.object

If regret.plot = TRUE, a ggplot object is returned.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
### Simulate Thompson-Sampling
set.seed(100)
res <- SimulateMultiplePeriods(method = "Thompson-Sampling",
                               method.par = list(ndraws.TS = 1000),
                               nburnin = 30,
                               nperiod = 180,
                               npulls.per.period = 5,
                               reward.family = "Bernoulli",
                               mean.reward = runif(3, 0, 0.1),
                               weight.plot = TRUE)
res$weight.plot.object
### Simulate EXP3-Thompson-Sampling
set.seed(100)
res <- SimulateMultiplePeriods(
           method = "EXP3-Thompson-Sampling",
           method.par = list(ndraws.TS = 1000,
                             EXP3 = list(gamma = 0, eta = 0.1)),
           nburnin = 30,
           nperiod = 180,
           npulls.per.period = 5,
           reward.family = "Bernoulli",
           mean.reward = runif(3, 0, 0.1),
           weight.plot = TRUE)
res$weight.plot.object
### Simulate ensemble method HyperTS given "Thompson-Sampling", "Epsilon-Greedy" and "Epsilon-Decreasing"
set.seed(100)
res <- SimulateMultiplePeriods(
           method = "HyperTS",
           method.par = list(
               ndraws.TS = 1000,
               epsilon = 0.1,
               HyperTS = list(method.list = c("Thompson-Sampling",
                                              "Epsilon-Greedy",
                                              "Epsilon-Decreasing"))),
               nburnin = 30,
               nperiod = 180,
               npulls.per.period = 5,
               reward.family = "Poisson",
               mean.reward = runif(3, 0, 0.1),
               weight.plot = TRUE)
res$weight.plot.object

google/MAB documentation built on May 23, 2019, 9:01 p.m.