knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

It's Selection Sunday! The full field of 68 teams for the NCAA women's basketball tournament has been announced, and now that colleague of yours who cares way too much about sports is inviting you to submit a bracket to the office pool. You have a few days to choose one of the 9,233,372,036,854,775,808 possible brackets to call your own. But which one?

Scraping data

Is this the first day you've paid any attention to college basketball all year? We've got you covered, with scraping functions that will tell you everything that went down this season. You can scrape all of the game scores from the 2023-2024 season using the function. In fact, you can use this same function to scrape game results for any other season, past or future, as long as the data are available on Later, we'll use the data from the current year to predict matchup results.

games.2024 =, league = "women")

It also helps to know who everybody else is picking to advance in each round of the tournament, so that you know what you need to beat. Another function, scrape.population.distribution, grabs these data for you, based on the population distribution of picks according to ESPN. Note that these data are currently available online only for the two most recent tournaments.

pred.pop.2024 = scrape.population.distribution(league = "women")

We've made life even easier for you by pre-scraping games.women.2024, pred.pop.women.2024, and These are all available as datasets in the package. You're welcome.


The game results are stored using team IDs, and we will see that so is the tournament bracket. But you can use teams.women to see who these teams are.


Predicting matchup results

Now that you know everything that happened this year, you know how good all of the teams are, right? The Bradley-Terry model can help you make sense of these scores. According to this model, in game $i$ between home team $H_i$ and away team $A_i$, the number of points $y_i$ (which may be negative) by which the home team wins has the following distribution: $$y_i\sim\mathcal N(\beta_{H_i} - \beta_{A_i}, \sigma^2)$$ where the $\beta$'s represent the unknown quality of the teams. We provide the bradley.terry function to estimate the $\beta$'s, $\sigma$ and the corresponding probabilities for each team beating each other team. This function returns a matrix of probabilities, with one row for each team and one column for each team. Each entry of the matrix gives the estimated probability of the team in that row beating the team in that column. Using this, we can estimate for the probability that North Carolina beats Duke, for example.

prob.matrix = bradley.terry(games = games.women.2024)
prob.matrix["153", "150"] # What is the predicted probability that UNC beats Duke on a neutral site?

Simulating the tournament

We've already taken care of those pesky first four games for you and stored the bracket in a vector called bracket.women.2024. This lists the tournament teams in order of overall seed.


If you want to play out your fantasy, though, you can specify any bracket you'd like, as long as it's a character vector of length 64. Once you've done so, you can use the sim.bracket function to play out your own personal tournament and the draw.bracket function to display the outcome.

outcome = sim.bracket(
  bracket.empty = bracket.women.2024,
  prob.matrix = prob.matrix
draw.bracket(bracket.empty = bracket.women.2024, bracket.filled = outcome)

Congratulations are in order for South Carolina, for winning the set.seed(2024) NCAA women's basketball tournament!

Finding a good bracket

That's all well and good for South Carolina, but wouldn't you prefer to win something yourself? Now you can, using the find.bracket function. This function produces a number (num.candidates) of candidate brackets and then evaluates each of them across a number (num.sims) of simulations. The larger you make num.candidates and num.sims, the better the bracket will be, but at the cost of increased computation time. You can choose whether you want the bracket which maximizes your expected score, expected percentile within your pool, or even the probability of winning your pool!

You can also customize these results based on the scoring rules and size (excluding you) of your pool. Below we search for a good bracket to use if we want to maximize our chances of winning a pool with 30 other people in it, using the default scoring rules for CBS Sports.

my.bracket = find.bracket(
  bracket.empty = bracket.women.2024,
  prob.matrix = prob.matrix,
  league = "women",
  num.candidates = 100,
  num.sims = 1000,
  criterion = "win",
  pool.size = 30,
  bonus.round = c(1, 2, 4, 8, 16, 32),
  bonus.seed = rep(0, 16),
  bonus.combine = "add"
draw.bracket(bracket.empty = bracket.women.2024, bracket.filled = my.bracket)

Testing your bracket

So you're picking South Carolina? Way to go out on a limb. Let's see how well we can expect this bracket to do, using the test.bracket function. This function simulates your pool to determine your expected score, expected percentile and probability of winning.

test = test.bracket(
  bracket.empty = bracket.women.2024,
  bracket.picks = my.bracket,
  prob.matrix = prob.matrix,
  league = "women",
  pool.size = 30,
  num.sims = 1000,
  bonus.round = c(1, 2, 4, 8, 16, 32),
  bonus.seed = rep(0, 16),
  bonus.combine = "add"
hist(test$score, breaks = 20)
hist(test$percentile, breaks = 20)

There you have it. You can expect this bracket to win 17.3% of the Groundhog Day replays of your bracket pool. Here's to the "real" outcome being one of those 17.3%!

elishayer/mRchmadness documentation built on March 27, 2024, 2:11 p.m.