Demo: Bandits, Propensity Weighting & Simpson's Paradox in R"
In contextual: Simulation and Analysis of Contextual Multi-Armed Bandit Policies

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Imagine a website with Sport and Movie related articles where the "actual" preference of men and women for Sport and Movie articles is the following:

Contexts   | Sport (arm) |  Movie (arm)
-----------------------------------------
Male       | 0.4         |  0.3
Female     | 0.8         |  0.7

In other words, both Male and Female visitors actually prefer Sports articles over Movie articles. When visitors are randomly assigned to types of articles, the overall CTR rate per category reflects this:

Contexts   | Sport (arm) |  Movie (arm)
-----------------------------------------
Male       | 0.4 x 0.5   |  0.3 x 0.5
Female     | 0.8 x 0.5   |  0.7 x 0.5
-----------------------------------------
CTR total  | 0.6         |  0.5

Now suggest the site's editor just "knows" that men like sports, and women like movie related articles. So the editor has some business logic implemented, assigning Movie related articles, on average, to 75% of Female visitors, and Sports articles, on average, to 75% of Male visitors:

Contexts   | Sport (arm) |  Movie (arm)
-----------------------------------------
Male       | 0.4 x 0.75  |  0.3 x 0.25
Female     | 0.8 x 0.25  |  0.7 x 0.75
-----------------------------------------
CTR total  | 0.5         |  0.6

This results in a higher CTR for movies than for Sports related articles - even though these CTR's do not actually reflect the overall preferences of website visitors, but rather the editor's prejudice.

A clear example of Simpson's Paradox!

Below an R code based illustration (making use of our "contextual" bandit package) of how Simpson's Paradox could give rise to biased logged data, resulting in biased offline evaluations of bandit policies. Next, we demonstrate how inverse propensity weighting can help make such data usable for offline evaluation after all.

Original bandit weights

Set up simulation bandit weights representing Male and Female actual preferences for Sports and Movies:

horizon                           <- 10000L
simulations                       <- 1L

#                    S----M------------> Arm 1:   Sport
#                    |    |              Arm 2:   Movie
#                    |    |
weights <- matrix( c(0.4, 0.3,    #-----> Context: Male
                     0.8, 0.7),   #-----> Context: Female

                     nrow = 2, ncol = 2, byrow = TRUE)

These weights will be fed to contextual's ContextualBernoulliBandit so it can simulate clicks by Male and Female according to their preferences per category.

Generate data by running a fully random online policy

Let's first run contextual's basic random policy against the bandit that models actual visitor's preferences. This random policy assigns Males and Females fully at random to either Sport or Movie articles:

policy                            <- RandomPolicy$new()
bandit                            <- ContextualBernoulliBandit$new(weights = weights)
agent                             <- Agent$new(policy, bandit, "Random")

simulation                        <- Simulator$new(agent, horizon, simulations, 
                                                   save_context = TRUE, do_parallel = F)

history                           <- simulation$run()

Simulation horizon: 10000
Number of simulations: 1
Number of batches: 1
Starting main loop.
Completed simulation in 0:00:01.781

u_dt                              <- history$get_data_table()

print("1a. Unbiased data generation.")
[1] "1a. Unbiased data generation."

print(paste("Sport:",sum(u_dt[choice==1]$reward)/nrow(u_dt[choice==1]))) 
[1] "Sport: 0.603323988786544"   # 0.6 CTR Sport - equals preferences!

print(paste("Movie:",sum(u_dt[choice==2]$reward)/nrow(u_dt[choice==2]))) 
[1] "Movie: 0.501997602876548"   # 0.5 CTR Movie - equals preferences!

The results are clear: when running the random policy, the logged data accurately represents visitor's preferences.

Using the random policy's offline logged data to evaluate another policy

The previous simulation produced a data.table with fully randomised historical data. Let's use this data to evaluate some other policy:

f                                 <- formula("reward ~ choice | X.1 + X.2")
bandit                            <- OfflineReplayEvaluatorBandit$new(formula = f,
                                                                      data = u_dt, 
                                                                      k = 2 , d = 2)

policy                            <- EpsilonGreedyPolicy$new(0.1)
agent                             <- Agent$new(policy, bandit, "OfflineLinUCB")

simulation                        <- Simulator$new(agent, horizon, simulations, do_parallel = F)
history                           <- simulation$run()

Simulation horizon: 10000
Number of simulations: 1
Number of batches: 1
Starting main loop.
Completed simulation in 0:00:01.606

ru_dt                             <- history$get_data_table()

print("1b. Offline unbiased policy evaluation.")
[1] "1b. Offline unbiased policy evaluation."

print(paste("Sport:",sum(ru_dt[choice==1]$reward)/nrow(ru_dt[choice==1])))
[1] "Sport: 0.602566799915843"   # 0.6 CTR Sport - equals preferences!

print(paste("Movie:",sum(ru_dt[choice==2]$reward)/nrow(ru_dt[choice==2]))) 
[1] "Movie: 0.493589743589744"   # 0.5 CTR Movie - equals preferences!

Accurate numbers again: clearly, the logged data from a randomizing policy can be used to test other 'off-policy' algorithms.

Generate data by running a biased online policy

Now suggest some editor just "knows' that men like Sport, and women like Movie. So some business logic was added to the site assigning Movie related articles, on average, to 75% of Female visitors, and Sport articles, on average, to 75% of Male visitors.

This business logic might be implemented through the following policy:

BiasedPolicy                      <- R6::R6Class(
  portable = FALSE,
  class = FALSE,
  inherit = RandomPolicy,
  public = list(
   class_name = "BiasedPolicy",
   get_action = function(t, context) {
     if(context$X[1]==1) {           # 1: Male || 0: Female.
       prob                      <- c(0.75,0.25) # Editor thinks men like Sport articles more.
     } else {
       prob                      <- c(0.25,0.75) # Editor thinks women like Movie articles more.
     }
     action$choice               <- sample.int(context$k, 1, replace = TRUE, prob = prob)
      # Store the propensity score for the current action too:
      action$propensity           <- prob[action$choice]
      action
    }
  )
)

Now run this policy against the Bandit modeling actual visitor preferences:

policy                            <- BiasedPolicy$new()
bandit                            <- ContextualBernoulliBandit$new(weights = weights)
agent                             <- Agent$new(policy, bandit, "Random")

simulation                        <- Simulator$new(agent, horizon, simulations, 
                                                   save_context = TRUE, do_parallel = F)
history                           <- simulation$run()

Simulation horizon: 10000
Number of simulations: 1
Number of batches: 1
Starting main loop.
Completed simulation in 0:00:01.954

b_dt                              <- history$get_data_table()

print("2a. Biased data generation.")
[1] "2a. Biased data generation."

print(paste("Sport:",sum(b_dt[choice==1]$reward)/nrow(b_dt[choice==1]))) 
[1] "Sport: 0.506446414182111"  # 0.5 CTR Sport - Simpson's paradox at work

print(paste("Movie:",sum(b_dt[choice==2]$reward)/nrow(b_dt[choice==2]))) 
[1] "Movie: 0.600675138999206"  # 0.6 CTR Movie - Simpson's..

Clearly, the BiasedPolicy gives rise to, well, biased results! If you'd only be able to look at the data, without knowing of the biased business logic, you'd falsely conclude Movies is more popular then Sports, overall.

Using the biased policy's offline logged data to evaluate another policy

This time, the simulation generated a data.table with biased data. Let's see what happens if we use this data to evaluate some other policy:

f                                 <- formula("reward ~ choice | X.1 + X.2")
bandit                            <- OfflineReplayEvaluatorBandit$new(formula = f, 
                                                                      data = b_dt, 
                                                                      k = 2 , d = 2)
policy                            <- EpsilonGreedyPolicy$new(0.1)
agent                             <- Agent$new(policy, bandit, "rb")

simulation                        <- Simulator$new(agent, horizon, simulations, do_parallel = F)
history                           <- simulation$run()

Simulation horizon: 10000
Number of simulations: 1
Number of batches: 1
Starting main loop.
Completed simulation in 0:00:01.478

rb_dt                             <- history$get_data_table()

print("2b. Offline biased policy evaluation.")
[1] "2b. Offline biased policy evaluation."

print(paste("Sport:",sum(rb_dt[choice==1]$reward)/nrow(rb_dt[choice==1]))) 
[1] "Sport: 0.5"  # 0.5 CTR Sport - Simpson's paradox, again!

print(paste("Movie:",sum(rb_dt[choice==2]$reward)/nrow(rb_dt[choice==2]))) 
[1] "Movie: 0.602175277138674" # 0.6 CTR Sport - Simpson's paradox, again!

The bias has propagated itself! So, does that mean it is not possible to use the "biased" data to evaluate other, off-line policies? That would severely limit the number of data sets for use in offline evaluation.

Repairing the biased policy's logged data with inverse probability weights

Luckily, inverse propensity score weighting enables us to use propensity scores to obtain unbiased estimates of the original preferences of Male and Female visitors. That is, since our biased policy actually saved the propensity ("the probability of a unit being assigned to a particular treatment or category") with which a certain category was chosen, we can correct for this bias while "replaying" the data. In "contextual", there are several types of offline bandits that are able to use either such presaved propensities, or estimate propensities based on certain properties of the dataset. Here, we use its basic "OfflinePropensityWeightingBandit":

f                                 <- formula("reward ~ choice | X.1 + X.2 | propensity")
bandit                            <- OfflinePropensityWeightingBandit$new(formula = f, data = b_dt,
                                                                          k = 2 , d = 2)
policy                            <- EpsilonGreedyPolicy$new(0.1)
agent                             <- Agent$new(policy, bandit, "prop")

simulation                        <- Simulator$new(agent, horizon, simulations, do_parallel = F)
history                           <- simulation$run()

Simulation horizon: 10000
Number of simulations: 1
Number of batches: 1
Starting main loop.
Completed simulation in 0:00:01.257

prop_dt                           <- history$get_data_table()

print("2c. Offline biased policy evaluation, inverse propensity scores.")
[1] "2c. Offline biased policy evaluation, inverse propensity scores."

print(paste("Sport:",sum(prop_dt[choice==1]$reward)/nrow(prop_dt[choice==1])))
[1] "Sport: 0.618266176609179"  # 0.6 CTR Sport, representing actual preferences - yay!

print(paste("Movie:",sum(prop_dt[choice==2]$reward)/nrow(prop_dt[choice==2]))) 
[1] "Movie: 0.496500591177808" # 0.5 CTR Movie, again, representing actual preferences..

Hurray - inverse propensity score weighting has removed the bias! In other words: if and where possible, save propensity scores to your log files when experimenting with online policies. You will thank yourself at a later time!