In jdtrat/rlsims: Simulate Reinforcement Learning Agents in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(magrittr)

library(rlsims)
library(ggplot2)

Introduction to K-Armed Bandits

Multi-armed Bandits, also called $K$-armed Bandits, describe a type of reinforcement learning problem whereby an agent selects one of K possible arms for any number of trials. After each selection, the agent experiences a reinforcement which influences the agent's action on the next trial. Multi-armed bandits can help formalize a number of real-world problems. Consider a developer who wants to try different application icons and find which one is most appealing to customers and maximizes the number of downloads. They might use a multi-armed bandit to learn which $k$-icon maximizes positive reinforcements (number of downloads).

Similar ideas are explained in greater detail in Alksandrs Slivkins textbook, Introduction to Multi-Armed Bandits. In the introduction, Slivkins describes the key points of bandit algorithms, highlighting the tradeoff between "exploration and exploitation: making optimal near-term decisions based on the available information." Exploration involves selecting new arms; exploitation involves selecting the arm that previously led to the best reinforcement.

This tradeoff is often formalized with a decision-making policy, describes how an agent makes a choice. In this package, we have implemented three widely-used policies: greedy, epsilon-greedy, and softmax.

An agent with a 'greedy' policy will always select the arm with the highest expected value. If, as in the first trial, all arms have the same expected value because they have not been sampled, the agent will choose randomly.
An agent with an 'epsilon-greedy' policy has a parameter epsilon, which describes the propensity to explore the action (arm) space. The higher the epsilon, the more likely the agent is to select a random action; the lower epsilon, the more likely the agent is to select the exploitative action (one with highest expected value).
An agent with a 'softmax' policy has a parameter tau, which describes the propensity to explore the action space. The higher the tau (temperature), the more random the actions; the lower the tau (temperature), the more exploitative the actions (e.g., lower temperature increases the probability of taking the action with the highest expected value).