In metamaden/montyhall: Simulate and analyze the Monty Hall Problem.

# require(devtools); install_github("metamaden/montyhall")
# library(montyhall)
knitr::opts_chunk$set(eval = TRUE, echo = TRUE,
                      warning=FALSE, message=FALSE)
library(montyhall)

On a gameshow stage before you wait three closed doors, behind which have been deposited 2 goats and one prize, respectively. You are called on to pick a door to open and reveal either a goat or a prize. Monty Hall, the gameshow's host, proceeds to reveal a goat behind one of the two unpicked doors. You must then decide whether to stick with your original choice or switch to the final unpicked door. What do you do?

This is the Monty Hall Problem, a kind of logic puzzle involving conditional probability. Assuming you value prizes over goats, and lack prior knowledge about which door has the prize, it can be readily shown that always switching doors increases your win probability. If you stick with your first choice, your success probability never exceeds 1 out of 3 tries, while switching increases your probability to 2 out of 3, a pretty substantial increase!

It's telling that the Monty Hall Problem still serves as a good brain teaser to this day. Given its simple rules and decision parameters, it's a problem that lends itself to programmatic simulation. In this post, I wanted to show how I wrote a simulation function that captures the basic or "classical" rules of the Monty Hall Problem while allowing for exploration of how outcomes change as conditions and rule definitions change. Hopefully I can inspire you to think generally about opportunities to conceptualize problems in simulations with useful code.

I've deployed the code with a strictly reproducible vignette in the montyhall R package, available on GitHub here. Deploying work as an R package can be extremely worthwhile in production level data science projects. Note that I'm knowingly omitting a few "best practices" in favor of more clearly written code, but there are many great places to learn about common package standards. In this package, I've included my simulation function, mhsim(), and a wrapper function getfw() that returns win frequencies across simulations. These functions only make use of base R without added dependencies. I've also included several visualization utilities functions that make use of some stellar R packages as dependencies, including ggplot2. We'll take a look at the functions and some scripts as we explore Monty's doors below.

Formulating the game

We can call the typical formulation of the game, as above, the "canonical" or "classical" formulation. This has the following attributes or rules:

Three doors total, behind which 1 has a prize, and the remaining 2 have goats.
The player picks a door.
Monty reveals one of the two remaining doors to be a goat.
The player decides whether to stick with their initial choice (step 2) or switch to the last unpicked door.
The final player-selected door is revealed to be either a goat or the prize.

Note a few characteristics of the above. First, we've framed the player's second decision as "whether to switch", though I'd argue the natural naive interpretation is we're merely picking doors twice more or less at random. Second, there's randomness implied in steps 2 and 3, where the player picks an initial door at random, and 2/3rds of the time Monty chooses randomly between one of two goats to reveal.

In terms of semantics, I'll consider certain parameters (e.g. 3 total doors, 1 prize, etc) as the set of "classical" game parameters or rules. Also, I've chosen to assess outcomes using game win frequencies, though I could just as easily have assessed game loss frequencies. But each concept is related in this binary state problem of win or lose, where knowing one outcome determines the other (e.g. a win frequency of 0.66 implies a lose frequency of 1 - 0.66 = 0.34, etc.).

To simulate the problem in code, pseudocode can be a helpful tool for visualizing the objective. Pseudocode is simply a type of abstraction for programming that is programming language agnostic. For the Monty Hall simulation, the initial pseudocode might be something like:

define Monty_Hall_Game:
get door_indices from 1:ndoors
assign prize_door
randomize player_door_index1
get remaining_doors
get monty_door_indices up to ndoors - 2
get player_door_index2 as remaining_door_index
if player_door_index2 == prize_door, return "win", else "lose"
function do Monty_Hall_Simulation:
do Monty_Hall_Game up to num_iterations

I've shown two functions above. In programming, it's often preferred to break a problem down into discrete and smaller units that achieve distinct objectives. This can make debugging and unit testing (a lot) easier. Here, I've written a function to simulate a single game, and another to carry out a series of game iterations. Ultimately, I also wrapped the second function inot a third getfw() function, to derive win frequencies across simulations. Let's look at the final simulation function in greater detail below.

The simulation function

I've written 3 functions that allow us to complete Monty Hall Problem simulations in reasonable time, and allow for changing various underlying rules. First, the mhgame() function runs a single game or "game iteration." Second, mhsim() executes a series of game iterations that defines a simulation run, up to N = niter total game iterations, where every such series uses a single seed for reproducible randomization (more on that below). Finally, getfw() takes the output from mhsim(), a list of game outcome vectors (either "win" or "loss"), and returns a single vector of win fractions.

Importantly, mhsim() vectorizes game simulations with lapply(). Vectorization is a great way to speed up repetitive tasks in coding, and can be crucial for especially memory-taxing or large problems. The lapply function is a member of the apply family of R functions, which have been specialized for different varieties of vectorization tasks. Some other useful ways of speeding up your code can include parallelization of tasks with multithreading. However, note that some parallelization solutions aren't strictly replicable and often require one or several additional dependencies. It's important to tailor the complexity a solution to that of its problem and avoid premature- or over-optimization. For our purposes, running thousands of Monty Hall simulations isn't memory intensive, and our operations below will all complete in about a minute or less.

The arguments niter and seed in mhsim() should be considered for each run. The niter argument specifies the number of game iterations to simulate, while setting seed specifies the value passed to set.seed(). Setting the seed allows for exact replication of run results with the same seed, an important component for operations implementing randomness. As already mentioned, the simulation assumes random player selection in step 2 and Monty selection in step 3, with other possibilities for randomness including the player door switch frequency in step 4. I implemented randomization with the sample R function.

To show how mhgame() completes the above pseudocode tasks, let's break down the steps of each game iteration. First, the index of the prize door is specified.

which.prize <- sample(doorseq, nprize)

Then the player's first decision is simulated.

dec1select <- sample(doorseq, ndec1)

Next, Monty reveals a goat. This entails either selecting 1 of 2 doors at random (e.g. if the player already picked the prize door, an unlikely event), or simply picking the last remaining door (e.g. if the player already picked a goat door, a likely event).

mdooroptions <- doorremain1[!doorremain1 %in% which.prize]
if(length(mdooroptions) < 2){
  mselect <- mdooroptions
} else{
  mselect <- sample(mdooroptions, nr)
}

We check that the number of remaining "non-revealed" doors specified by nrevealdif is valid for the problem, then proceed to either specify a random set ofdoor indices to reveal or the only remaining valid door.

Next, the player either switches or stays on their initial door selection. Note how switch frequency from doorswitch impacts the likelihood of passing switch or stay to ssvar.

if(ssvar == "switch"){
  if(length(doorremain2) > 1){
    dec2select <- sample(doorremain2, ndec2)
  } else{
    dec2select <- doorremain2
  }
}

The function then returns a vector of game outcomes (either win or loss) of length equal to niter. I've also included a verbose.results option that stores the granual game details for each iteration alongside outcome. I mainly included this for bug squashing.

Finally, note that I've wrapped the mhsim function into a new function, getfw. This is because mhsim runs a single simulation consisting of niter games, and we'd like to make it easy to run many simulations automatically. Thus getfw returns the win fraction per simulation across nsimulations, where certain arguments like ndoors can be passed to mhsim when we want to change simulation parameters.

Simulating the canonical/classic problem

Let's study the impact of varying the number of simulations and iterations per simulation on the distribution of win frequencies across simulations. We'll start small with just 5 simulations of 2 games, and increase this to 100 and then 1,000 simulations and iterations, respectively. We can use the wrapper function getfw to generate and store the reproducible simulation data.

Let's generate the simulation data with the 3 parameter sets, and time it. Note we'll use a for loop to iterate over vectors of 3 values for simulations and iterations, using the given index to point to the parameters for each run.

# parameter sets
simv <- c(5, 100, 1000)
iterv <- c(2, 100, 1000)
lr <- list()
t1 <- Sys.time()
for(s in 1:length(simv)){
  runname <- paste0(simv[s], ";", iterv[s])
  lr[[runname]] <- getfw(nsimulations = simv[s], niterations = iterv[s])
}
tdif <- Sys.time() - t1

The 3 runs completed in r round(tdif, 0) seconds. With so few iterations and simulations in the first run, there's huge variance in the win fraction (standard deviation of r round(sd(lr[[1]]), 2)). Increasing iterations and simulations each to 100 already shows the distribution converging on the expected win frequency of 0.66. Further increase to 1,000 simulations and iterations results in a more clearly normal distribution with much tighter standard deviation of r round(sd(lr[[3]], 2)).

Let's now show the composite plot of win frequency distributions across the 3 runs. Note I've stored the run info (number of simulations and iterations per run) in the list names, and we can unpack these with regular exressions using gsub() for the respective plot titles. We'll use par to manage the plot output and formatting, where nrow = c(1, 3) specifies the plot output conforms to a matrix of 1 row and 3 columns, and oma = c(3, 3, 3, 1) adds outer margin whitespace for axis labels. We'll remove redundant axis labels for each plot and add these back with mtext().

# pdf("mh_3runs.pdf", 10, 4)
png("mh_3runs.png", width = 10, height = 4, units = "in", res = 400)
# format image output
par(mfrow = c(1, 3), oma = c(3, 3, 3, 1))
for(r in 1:length(lr)){
  rdat <- lr[[r]]
  # get plot title info
  rname <- names(lr)[r]
  simr <- gsub(";.*", "", rname)
  iterr <- gsub(".*;", "", rname)
  pmain <- paste0(simr," simulations\n", iterr, " iterations")
  # add run histogram to image output
  hist(rdat, main = pmain, xlab = "", ylab = "")
}
# add outer axis labels
mtext("Win Frequency", side = 1, outer = T)
mtext("Number of Simulations", side = 2, outer = T)
dev.off()

If you prefer to be more precise about the increase in normalcy, we can show greater distribution normalcy by high confidence from the Shapiro-Wilk Normality test with shapiro.test, where we test the null hypothesis that data were drawn from a normal distribution.

# run normalcy tests
st1 <- shapiro.test(lr[["5;2"]])$p.value
st2 <- shapiro.test(lr[["100;100"]])$p.value
st3 <- shapiro.test(lr[["1000;1000"]])$p.value

With increased simulations and iterations, our p-value increase from r round(st1, 3) in the first and smallest run to r round(st3, 3) in the third and largest run. Practically, this means confidence to reject the alternative hypothesis (e.g. of non-normality) is decreasing as the underlying simulation win fractions converge on an approximately normal distribution.

Bending the rules

I've written mhsim() and its getfw() wrapper to allow us to modify the game rules in a few ways. Exploring how bending the rules impacts simulated win frequencies can lead us to better understand the roles of key game conditions. Showing quantitatively how win frequencies change in light of different conditions can help us to better characterize the problem and inform the notion that it's always better to switch doors under the classical game rules.

The first condition we'll explore is the number of doors. I've allowed for the door quantity to be changed with the ndoors argument. Practically, this just changes the game setup for an interation by defining a vector of sequential door indices of length ndoors. Thus increasing ndoors from 3 still preserves other parameters for the classical game be default.

I've also allowed for changing the frequency with which the player switches doors with doorswitch. The default value of 1 means the player switches 100% of the time, and setting this a lower value between 0 and 1 means decreasing the switch frequency. I did this by implementing sample() to randomly select from what's essentiallt a weighted binomial distribution (e.g. possible outcomes are binary but each outcome has a distinct weight). If doorswitch = 0.2, we parse player decision by sampling from a distribution where 20% of options are "switch" and (100 - 20 = ) 80% of options are "stay".

Many doors as an extrapolation mnemonic

One of the more useful mnemonic devices I've encountered for intuiting the answer to the Monty Hall Problem is to increase the number of doors. Maybe we're unsure if switching doors will increase our odds when there are just 3 doors. But if there are 100 doors, and Monty reveals goats behind 98 of them, it's much clearer that switching will increase our chances of winning. We can quantitatively visualize this intuitively useful device with the simulation code.

Let's now generate and time the results from running 100 simulations of 100 iterations each, varying ndoors from 3 to 103 by 10, with otherwise classical rule parameters.

# get win frequencies from varying ndoors
simi = 100; iteri = 100
ndoorl <- seq(3, 103, 10)
seedl <- seq(1, 100, 1)
lnd <- list()
t1 <- Sys.time()
for(nd in ndoorl){
  fw <- getfw(simi, iteri, nd)
  lnd[[paste0(nd)]] <- fw
}
tdif <- Sys.time() - t1
# store the reference plot
pref <- getlineplot(lnd, ptitle = "Canonical rules, varying doors")

All runs completed in r round(as.numeric(tdif), 0) seconds. Let's visualize results in a few different ways. First, we'll generate violin plots, a powerful way of visualizing data in a relatively distribution-agnostic manner (and thus typically better than boxplots). Next, we'll use overlapping density plots, variously called "ridge plots" or "joyplots" after their use in the iconic visualization of CP 1919 pulsar's radio waves on the cover of Joy Division's Unknown Pleasures record (awesome!). Finally, we'll show line plots of run means with confidence boundaries. While the first two plots show the exact data distributions, we'll focus on the third line plots for their economy of space and meaningful reflection of simulation distribution properties.

To generate these visualizations, I've wrapped the code inot the several plotting utilities functions getggdat() (format data for violin and ridge plots), getlinedat (format data for line plots), getlineplot() (generates line plot), getprettyplots() (generate composite of 3 ggplot2 plot types). I won't exhaustively describe these, but will note the code may be generally useful if you're looking for a generalizable way of plotting data for your own data science project. This code makes use of the supremely awesome R packages ggplot2 (powerful plotting functions and meta syntax), gridExtra (ridge plot options), ggridges (managing plot outputs and composite plotting).

# pdf("mh_ndoors_3plots.pdf", 10, 4)
png("mh_ndoors_3plots.png", width = 10, height = 4, units = "in", res = 400)
getprettyplots(lnd, "Varying door count")
dev.off()

This quantitatively shows the magnitude of win likelihood increase with ndoors increase, reinforcing our intuition about the mnemonic device. It's also interesting to note how the standard deviation converges after the means in runs with higher door counts as the win frequency increase becomes both higher and more certain.

We'll focus on the line plot visualizations below, but I've allowed for two plot types with the ribbontype argument. This defines the gray-colored confidence visualization to be either the standard deviation of run distribution (if sd, the default), or the minimum and maximum win frequencies observed (if minmax). Let's show these side-by-side to illustrate the difference.

pclassic1 <- getlineplot(lnd, ptitle = "Std. dev. overlay", 
                         ribbontype = "sd", ylim = c(0.5, 1), xlim = c(0, 105), ylab = "", xlab = "")
pclassic2 <- getlineplot(lnd, ptitle = "Min. max. overlay", 
                         ribbontype = "minmax", ylim = c(0.5, 1), 
                         xlim = c(0, 105), 
                         ylab = "", xlab = "")

# pdf("mh_2lineplots.pdf", 5, 3)
png("mh_2lineplots.png", width = 5, height = 3, units = "in", res = 400)
grid.arrange(pclassic1, pclassic2, top = "Ribbon overlay comparison",
             bottom = "Number of Doors", left = "Win Fraction",
             ncol = 2)
dev.off()

We'll lean on these line plot representations using distribution standard deviations to calculate the overlaid ribbons.

What if the player doesn't always switch?

Let's now observe the impact of player switch frequency, or how often the player switches from their initial door selection. As mentioned, this is set by passing the decimal switch frequency to the doorswitch argument, which then parses player choice for each iteration from a weighted binomial distribution.

Let's run 10 simulations varying the switch frequency from 0% to 100% in increments of 10%. I'll store the results data in ldat and plots in plist object, then make a composite plot of the 10 results plots. Note I've also set the x- and y-axis min and max values to be the same in getlineplot so that visual comparison is easier.

# get fwin dist across ndoors
plist <- list()
sfreq <- seq(0, 1, 0.1)
t1 <- Sys.time()
ldat <- list()
for(s in sfreq){
  simi = 100; iteri = 100
  ndoorl <- seq(3, 103, 10)
  seedl <- seq(1, 100, 1)
  lnd <- list()
  for(nd in ndoorl){
    fw <- getfw(simi, iteri, nd, doorswitch = s)
    lnd[[paste0(nd)]] <- fw
  }
  plist[[paste0(s)]] <- getlineplot(lnd, ptitle = paste0("S.F. = ", s),
                                    xlim = c(0, 100), ylim = c(0, 1),
                                    xlab = "", ylab = "")
  ldat[[paste0(s)]] <- lnd
  # message(s)
}
tdif <- Sys.time() - t1

All runs completed in r round(as.numeric(tdif), 2) minutes. The composite plot is then generated from the plist plots list as follows.

# pdf("mh_switchfreq.pdf", 10, 6)
png("mh_switchfreq.png", width = 10, height = 6, units = "in", res = 400)
grid.arrange(plist[[1]], plist[[2]], plist[[3]],
             plist[[4]], plist[[5]], plist[[6]],
             plist[[7]], plist[[8]], plist[[9]],
             plist[[10]],
             ncol = 5, 
             bottom = "Number of Doors", left = "Win Fraction")
dev.off()

Across run sets of each door switch frequency, there's a clear transition from an approximate negative power function (e.g. x ^ -1, top leftmost plot), to something approaching a fractional power function (e.g. x ^ 1/2, bottom rightmost plot).

Increasing run switch frequency incrementally under classical rules should show progressive increase in win fraction distributions. Let's generate and visualize the simulation results for this. Note, I'll appropriate my getlineplot() function for this as written, but in some applications it can be better to add code that explicitly handles specific axis variables like ndoors and doorswitch. The resulting plot shows a clear linear win fraction increase with switch frequency, maxing out at the now-familiar 2/3rds fraction.

sfreq <- seq(0, 1, 0.1)
lnd <- list()
for(s in sfreq){
  simi = 100; iteri = 100
  seedl <- seq(1, 100, 1)
  fw <- getfw(simi, iteri, doorswitch = s)
  lnd[[paste0(s)]] <- fw
}

# pdf("mh_switchfreq_classicrules.pdf", 4, 4)
png("mh_switchfreq_classicrules.png", width = 4, height = 4, res = 400, units = "in")
getlineplot(lnd, ptitle = "Win Freq. by Switch Freq.", 
            xlim = c(0, 1), ylim = c(0, 1), 
            xlab = "Switch frequency")
dev.off()

Conclusions and analysis extensions

We've explored simulations of the Monty Hall Problem using a brute force approach. By exploring changes in win frequency across varying problem conditions, we've proven that always switching doors will increase player win frequency. We've also quantitatively shown how improved win frequency converges as the number of doors is increased. Finally, we explored the role of certain conditions to the problem itself. Unsuprisingly, as player switch frequency increases, so too does win frequency. In so doing, we showed how switch frequency increase leads to different win frequency improvements across doors.

This brute force simulation approach is one of many possible ways of sloving and exploring the Monty Hall Problem, and alternate approaches implementing Bayesian models could lead to further interesting insights. There are several other game conditions that could also be explored. These include changing the total number of doors with prizes, for games of at least 4 doors. Ultimately, I hope this investigation provided some useful code that equips you with a framework for investigating new problems through simulation.

Bonus plot animations

In data science, more tools in our toolkit means more options for tackling future problems. For this reason, I'll briefly show how to turn some of the natural sequentional plots (e.g. win frequency distributions across ndoors) into animations.

I've written the function getprettygifs(), which uses the R packages gganimate and magick and leans heavily on the helpful code provided here for the composite gif. Passing options to plottype results in either a composite plot of the ndoors data (violin and line plots), or a line plot showing how doorswitch increase impacts win fraction across games varying ndoors.

Let's use the plot gif function to generate the gif files. Note all the data is contained in the ldat object, where the final item shows win fraction in relation to door count when the player always switches.

getprettygif(ldat[[11]], plottype = "composite_ndoors", gifname = "mh_ndoors.gif")
getprettygif(ldat, plottype = "lineplots_doorswitch", gifname = "mh_switchfreq.gif")