knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(cramR) library(data.table)
In order to use cram_bandit()
, users must supply a matrix of action selection probabilities ( \pi_t(X_j, A_j) ) for each combination of policy update ( t) and context ( j) in the historical dataset.
While some environments log these probabilities directly, many contextual bandit libraries (such as contextual
) only store policy parameters (e.g., regression coefficients) without explicit probability tracking.
This article explains how Cram Bandit Helpers reconstruct ( \pi_t(X_j, A_j) ) from these parameters for common policies:
| Policy Type | Class Name |
|------------------|------------------------------------------|
| Epsilon-Greedy | BatchContextualEpsilonGreedyPolicy
|
| LinUCB Disjoint with ( \varepsilon )-greedy exploration | BatchLinUCBDisjointPolicyEpsilon
|
| Thompson Sampling| BatchContextualLinTSPolicy
|
Both theoretical formulas and practical code snippets are provided.
When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm (k) maintains summary statistics (parameters) to estimate the expected reward:
( A_k ) is the Gram matrix:
[
A_k = X_k^T X_k
]
where (X_k) is the matrix of feature vectors (contexts) for all rounds where arm (k) was selected.
β Interpretation: (A_k) captures the amount of information (and correlation structure) about the features for arm (k). It plays the role of a "feature covariance matrix."
( b_k ) is the response vector:
[
b_k = X_k^T y_k
]
where (y_k) are the observed rewards for arm (k).
β Interpretation: (b_k) captures the relationship between the observed rewards and the contexts for arm (k).
These sufficient statistics allow the policy to compute the Least Squares estimate for the reward model:
[ \theta_k = A_k^{-1} b_k ]
where:
Thus:
The policy selects an action based on the (\theta_k) of each arm ( k ) and then observe the reward associated with this choice, which is used to update the parameters (A_k) and (b_k) of the policy.
In Epsilon-Greedy, with exploration rate ( \varepsilon ), the probability of selecting one of the best arms is:
[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{# \text{best arms}} + \varepsilon \times \frac{1}{K} ]
While the probability of selecting an arm that is not among the best arms is:
[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} ]
where:
We define the least squares estimate as:
[ \theta_k = A_k^{-1} b_k \quad \text{(Least Squares estimate)} ]
where:
Best arms are identified via the estimated expected reward:
[ \text{Expected reward} = X_t^T \theta_k ]
In cramR
, this is implemented by:
get_proba_c_eps_greedy(eps, A_list, b_list, contexts, chosen_arms)
This function:
LinUCB selects arms based on Upper Confidence Bounds (UCBs):
[ \text{UCB}_k(X_t) = \mu_k(X_t) + \alpha \sigma_k(X_t) ]
where:
The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e. the probability to select one of the best arms is:
[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{# \text{best arms}} + \varepsilon \times \frac{1}{K} ]
While the probability to select an arm that is not among the best arms is:
[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} ]
where "best arms" are those with highest UCB scores.
In cramR
, this is implemented by:
get_proba_ucb_disjoint(alpha, eps, A_list, b_list, contexts, chosen_arms)
This function:
In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm ( A_t ) is optimal is:
[ P(A_t | X_t) = \mathbb{P}\left( \theta_{A_t}^T X_t > \theta_{k}^T X_t \quad \forall k \neq A_t \right) ]
where ( \theta_k \sim \mathcal{N}(A_k^{-1} b_k, \sigma^2 A_k^{-1}) ).
This requires computing a multivariate probability, which we approximate via adaptive numerical integration.
In cramR
, this is implemented by:
get_proba_thompson(sigma, A_list, b_list, contexts, chosen_arms)
This function:
When using your bandit policy in practice:
pi
, arm
, and reward
into cram_bandit()
for evaluation of your policy.cram_bandit_sim()
The following only concerns the simulation framework we implemented for benchmarking purposes.
Once the policies are reconstructed, we compute their true expected value β referred to as the estimand β by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation.
This is done via:
compute_estimand(data_group, list_betas, policy, policy_name, batch_size, bandit)
Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations.
contextual
package: original frameworkcram_bandit()
: Cram evaluation for contextual banditscram_bandit_sim()
: Full simulation engine with automatic pi estimationThese helper functions were designed to faithfully reconstruct action probabilities for the policies implemented in contextual
, while enabling reproducible Cram-based evaluation.
```r autograph_files <- list.files(tempdir(), pattern = "^__autograph_generated_file.*\.py$", full.names = TRUE) if (length(autograph_files) > 0) { try(unlink(autograph_files, recursive = TRUE, force = TRUE), silent = TRUE) }
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.