A multi-armed bandit is a sequential experiment with the goal of achieving the largest possible reward from a payoff distribution with unknown parameters. At each stage, the experimenter must decide which arm of the experiment to observe next. The choice involves a fundamental trade-off between the utility gain from exploiting arms that appear to be doing well (based on limited sample information) vs exploring arms that might potentially be optimal, but which appear to be inferior because of sampling variability. [@Scott2010]
A bandit algorithm is a learning algorithm that approaches this problem in a principled way. It selects selects experimental arms to optimally balance between exploration and exploitation and maximize the reward in the long run.
The banditr
package implements several popular algorithms that solve the multi-armed bandit problem. Currently, the package supports two algorithms:
The banditr
package provides a comprehensive solution that handles the entire experimental sequence in a single object. Because such sequential experiments may involve large amounts of data, the package allows storing its data in memory, as a self-contained R object, but also remotely, using the host's filesystem and a database management system (DBMS). Currently, the package supports one DBMS: Microsoft SQL Server.
This vignette gives an overview of how banditr
handles bandit
objects, using a LinUCB bandit as an example. Another vignette details how to use banditr
with a DBMS.
In the multi-armed bandit problem, the experimenter starts with $M_0$ completed experiments for which she observes the vector of outcomes $Y_0$, and the $M_0 \times K$ matrix $X_0$ of features. She has an $N_0 \times K$ matrix $Z$ of characteristics for treatment arms for which she does not observe the outcome. At each time period $t$, she selects a treatment arm $i$ for which she observes the outcome $y_i$. Her goal is to solve the following maximization problem: $$ \max_{i \in {1,...,R_t}} y_i $$
Bandit algorithms implement mechanisms to solve this maximization problem. Suppose that the true model is $Y = X \beta$. The LinUCB and Thompson sampling algorithms perform a similar task at each time period. They update the model using completed experiments, and provide a rule $f(\beta_t, Z_t): (\mathbb{R}^{K}, \mathbb{R}^{(N_t \times K)}) \rightarrow {1,...,N_t}$ to select the next next treatment arm:
Let $\hat{y}_{it}$ be the predicted outcome for treatment arm $i$. At each time period $t$, the LinUCB algorithm selects the experimental arm $i$ with the highest upper confidence bound $p(i,t) = \hat{y}_i^T \hat{\beta}_t + \alpha U_t(i)$, where $U_t(i) = \sqrt{z_i^T\left(X_t^T X_t + \lambda I_K \right)^{-1} z_i}$ is the uncertainty around $\hat{y}_i$. The parameter $\alpha \geq 0$ is a tuning parameter that arbitrates between exploitation (selecting treatment arms with a high predicted reward $\hat{y}_i$), and exploration (selecting treatment arms with high uncertainty $U_t(i)$). The parameter $\lambda > 0$ introduces ridge regularization.
For modelling, the package currently supports linear, and logistic regressions with ridge regularization, and allows for a first stage of variable selection using the LASSO, as well as the automatic selection of tuning parameters using K-fold cross-validation. All these functionalities are implemented using the glmnet
package, wrapped into a new banditGlmnet
class, that supports the formula interface.
The Thomspon sampling algorithm operate in a Bayesian framework. That is, model parameters follow a posterior distribution $\Pr(\beta_t | X_t, Y_t)$ At each time period, the algorithm selects the next experimental arm at random, with a weight proportional to the posterior expected predicted probability of maximizing the reward. Let $\hat{y}{it}(\beta)$ be the predicted outcome for treatment arm $i$ at time $t$ for parameters $\beta$, and define $I{it}(\beta) = 1$ if treatment arm $i$ solves $\max_{i \in (1, ..., Z_t)} \hat{y}{it}(\beta)$, and $I{it}(\beta) = 0$ otherwise. Then the Thompson sampling weights for arm $i$ at time $t$ can be estimated with the following: $$ w_{it} = \frac{1}{M} \sum_{m=1}^M I_{it}(\beta_t^{(m)}), $$ where $\beta^{(m)}$ is the $m$-th draw from the posterior distribution at time $t$. The equation basically defines weights as the proportion of posterior draws for which $i$ is maximal. One can set the amount of exploration by sampling according to some weight $w_{it}^\gamma$, where the algorithm will tend to explore more for $\gamma < 1$, and exploit more for $\gamma > 1$.
For modelling, the package currently supports linear, and logistic regressions, as well as random effect models. Estimation is conducted using the rstanarm
package.
First, let's simulate data according to the following payoff distribution: $$ \Pr(y_i = 1 | x_i) = \text{logit}\left(\sum_{j=-4}^{5} \beta_j x_{ij} \right), $$ where $\text{logit}$ is the logistic function $\text{logit}(x) = \frac{1}{1+\exp(-x)}$.
set.seed(999) x <- matrix(rnorm(10e3), 1e3, 10) beta <- -4:5 y <- as.numeric(plogis(x %*% beta)) y <- sapply(y, rbinom, n = 1, size = 1) colnames(x) <- paste0("v", 1:10) df <- as.data.frame(x) df <- cbind(y, df)
The package uses Reference Classes (RC) to store bandit objects. To handle future samples in the experimental sequence, the data needs to contain an id
column, that serves as a primary key, and a column named y
that contains the ouctome.
df <- cbind(id = 1:1000, df) # add a primary key to the data
Let's create a LinUCB bandit. By default, the package uses a value of $\alpha = 1$ as a tuning parameter. Let's create a binomial LinUCB bandit and inspect it.
library(banditr) # initialize the bandit using the first 100 samples start <- df[1:100,] bdt_ucb <- bandit_ucb(formula = y ~ -1 + v1 + v2 + v3 + v4 + v5 + v6 + v7 + v8 + v9 + v10, family = "binomial", data = start) summary(bdt_ucb)
Thomspon sampling bandits are created in a similar way, and use a default value $\gamma = 1$ for the tuning parameter:
bdt_thom <- bandit_stan_glm(formula = y ~ -1 + v1 + v2 + v3 + v4 + v5 + v6 + v7 + v8 + v9 + v10, family = "binomial", data = start)
The user manages the experiment by performing jobs on the bandit. First, let's update the model using our current training set:
bdt_ucb$train() bdt_thom$train(chains = 1) # use only one Markov chain for demonstration purposes summary(bdt_ucb)
We then add experimental arms to the bandit, and obtain their score (for the LinUCB algorithm), and their sampling weight (for the Thomspon sampling algorithm):
add <- df[101:1000,] add$y <- NA # let's pretend we haven't observed the reward yet. bdt_ucb$addSamples(add) bdt_thom$addSamples(add) pr_ucb <- predict(bdt_ucb) pr_thom <- predict(bdt_thom)
For LinUCB bandits, the predict
method returns a data.frame
with the predicted response of all experimental arms whose outcome is missing, and their score. The LinUCB algorithm requires observing the outcome of the experimental arm with the highest score.
head(pr_ucb) id_ucb <- rownames(pr_ucb)[which.max(pr_ucb$score)] y_ucb <- df$y[df$id == id_ucb] y_ucb
The Thompson sampling algorithm requires sampling the next observation with sampling weights.
head(pr_thom) id_thom <- rownames(pr_thom)[sample.int(nrow(pr_thom), 1, prob = pr_thom$weight)] y_thom <- df$y[df$id == id_thom] y_thom
Let's add this newly observed outcome to the bandit, and update the model with this new information. Outcomes must be added as a named vector, with names pointing to the ids of the corresponding arms.
names(y_ucb) <- id_ucb bdt_ucb$addOutcomes(y_ucb) # add the new outcome to the bandit bdt_ucb$train() # update it names(y_thom) <- id_thom # likewise, for the Thompson bandit bdt_thom$addOutcomes(y_thom) bdt_thom$train(chains = 1)
Note that the LinUCB bandit supports ridge regularization and variable selection with the LASSO. By default, the model uses neither. Both tuning parameters can be set manually, or using K-fold validation (with glmnet::cv.glmnet
). Let's introduce some ridge regularization, and a stage of variable selection.
# set the tuning parameter for ridge regularization to 1 bdt_ucb$tune(param = 'lambdaRidge', value = 1) # select the tuning parameter for the LASSO using 10-fold cross validation bdt_ucb$tune(param = 'lambdaLasso', value = 'auto', lambdaAuto = 'lambda.1se')
The banditr
package provides a few built-in diagnostic plots, and allows for convenient extraction of relevant objects for further exploration. First, let's extend the experiment for a few more iterations:
for (i in 1:20) { pr_ucb <- predict(bdt_ucb) id <- rownames(pr_ucb)[which.max(pr_ucb$score)] y <- df$y[df$id == id] names(y) <- id bdt_ucb$addOutcomes(y) bdt_ucb$train() }
The package provides two diagnostics plots:
plot(bdt_ucb) # the default diagnostic: cumulative reward over time
plot(bdt_ucb, what = "coef") # parameter values over time
Besides the standard coef
method to extract current model coefficients, other extraction methods are prefixed with get
.
coef(bdt_ucb) mod <- getModel(bdt_ucb) # extract the last model mf <- getSamples(bdt_ucb) # extract complete samples jobs <- getJobs(bdt_ucb) # extract a summary of all jobs
Bandits can either be stored in memory, as a self-contained R object, or store the data remotely, using a DBMS and the host's filesystem. Currently, only Microsoft SQL Server is supported, through the RODBC
package.
Remote bandits are created the same way as local bandit, but take two extra arguments: db
, which is a list passed to RODBC::odbcDriverConnect
to establish the connection with the database, and path
, which is the folder in which models will be stored.
connectionString <- "driver={SQL Server};server=localhost;database=bandit" path <- "banditFolder/" bdt_db <- bandit_ucb(y ~ -1 + v1 + v2 + v3 + v4 + v5 + v6 + v7 + v8 + v9 + v10, family = "binomial", data = start, db = list(connection = connectionString), path = path)
The tables created by banditr
use the following schema:
jobs
: completed jobs. job
: primary keysamples
: samples, and their (possibly NULL) outcomes; added to the bandit using addSamples()
and addOutcomes()
respectively. id
: primary keyjobSamples
: foreign key on jobs.job
jobOutcome
: foreign key on jobs.job
stats
: statistics on samples, computed when using addOutcomes()
id
: foreign key on samples.id
coef
: value of model coefficients after each training job. jobTrain
: foreign key on jobs.job
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.