knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# ifelse(future::supportsMulticore(), future::plan("multicore"), future::plan("multisession"))

Running a Simulation

Loading the package is obligatory, so it is done first (along with dplyr for data wrangling and the pipe):

library(nflseedR)
library(dplyr, warn.conflicts = FALSE)
options(digits = 3)

Note: For this guide, we'll set an initial random seed of 4 at the beginning and simulations = 100 for the purposes of this document so you can follow along by entering the same code and get the same results shown here. We'll also set fresh_season = TRUE to blank out the existing results from the 2020 season, but normally when simulating an incomplete season, you wouldn't do these things.

set.seed(4)
sims <- simulate_nfl(
  nfl_season = 2020,
  fresh_season = TRUE,
  simulations = 100
)

The output contains a lot of pre-aggregated information, as well as the individual results from each game of each simulation. For example, let's look at the overall results for the Bears:

sims$overall |> dplyr::filter(team == "CHI") |> knitr::kable()

We can see the Bears got 10.8 wins on average. They made the playoffs 87% of simulations, won the division in 39%, won the Super Bowl in 4%, and only in 1% did they receive a top five draft pick. The teams section of the output will show how a team did in each simulated season.

sims$teams |>
  dplyr::filter(team == "CHI") |>
  dplyr::select(sim, team, wins, seed, draft_order) |> 
  utils::head(6) |>
  knitr::kable()

Let's check out the playoff games from the first simulation, where the Bears went 10-6 and got the 6th seed.

sims$games |> dplyr::filter(sim == 1, game_type != "REG") |> knitr::kable()

In this simulation, the Bears beat the Rams in a wildcard game by 26 points, then beat the Cowboys in the divisional round by 11 points, took the Eagles by a field goal in the NFC Championship Game, and finally defeated the Patriots by 6 in the Super Bowl.

As you may have gathered at this point, the default simulation code picks a random Elo for every team, and uses those as the starting Elo ratings for all 32 teams. However, the default code Elo will adjust independently within each simulation as each week is simulated. (The Elo model used is loosely based off of that of FiveThirtyEight.)

Use Your Own Model

But of course the real value of nflseedR is putting in your own model into the simulator. To accomplish this, you can write your own function which will determine the output of games instead. As an example, here's a very stupid model that makes the team earlier alphabetically win by 3 points 90% of the time, and lose by 3 points the other 10% of the time.

stupid_games_model <- function(teams, games, week_num, ...) {
  # make the earlier alphabetical team win 90% of the time
  games <- games |>
    dplyr::mutate(
      result = dplyr::case_when(
        !is.na(result) | week != week_num ~ result,
        away_team < home_team ~ sample(c(-3, 3), n(), prob = c(0.9, 0.1), replace = TRUE),
        away_team > home_team ~ sample(c(-3, 3), n(), prob = c(0.1, 0.9), replace = TRUE),
        TRUE ~ 0
      )
    )

  # return values
  return(list(teams = teams, games = games))
}

When you create this function, the first two inputs are data on the teams (one row per team per sim), and data on the games (one row per game per sim). The third argument is the week number currently being simulated, as only one week is processed at a time.

Your function's job - by whatever means you choose - is to update the result column for that week's games in each of the sims with the number of points the home team won by (or lost by if negative, or 0 if the game ended in a tie).

It returns both the teams and the games data. It does both because this way you can store information in new columns by team or by game to use in the next call. Make sure your code both accepts and returns the appropriate information, or the simulator will break!

For example, the default function updates a team's Elo after the game, and stores it in the teams data. When the simulator processes the next week, it uses the updated Elo rating to inform the team's next game.

!! Also, make sure you aren't overriding completed games or games that aren't in the current week of w. The simulator will not stop you from setting past, present, or future game results in your function, whether you meant to do so or not. !!

Let's run a simulation with stupid_games_model and see what happens:

sims2 <- simulate_nfl(
  nfl_season = 2020,
  process_games = stupid_games_model,
  fresh_season = TRUE,
  simulations = 100
)

sims2$overall |> dplyr::arrange(team) |> utils::head() |> knitr::kable()
sims2$overall |> dplyr::arrange(team) |> utils::tail() |> knitr::kable()

As you might expect, the earliest alphabetical teams win a lot. The Cardinals won the Super Bowl in 59% of seasons! Meanwhile, the teams at the bottom alphabetically are virtually certain to be at the top of the draft order.

Adding In Your Own Data

This is all well and good, you might be thinking, but your model works off of other data not in the simulator! How can that work? This is where we utilize R's ability to have generic arguments.

The ... at the end of the function definition means that the function can be called with any number of additional arguments. You can name these whatever you want, as long as they're not already the name of other defined arguments.

When you call the simulate_nfl() function, it too uses the ... syntax, which allows you to pass in any number of additional arguments to the function. The simulator will in turn pass these on to your function that processes games.

For example, let's slightly modify our last example:

biased_games_model <- function(teams, games, week_num, ...) {

  # arguments
  args <- list(...)
  best <- ""
  worst <- ""

  # best team?
  if ("best" %in% names(args)) {
    best <- args$best
  }

  # worst team?
  if ("worst" %in% names(args)) {
    worst <- args$worst
  }

  # make the best team always win and the worst team always lose
  # otherwise, make the earlier alphabetical team win 90% of the time
  games <- games |>
    dplyr::mutate(
      result = dplyr::case_when(
        !is.na(result) | week != week_num ~ result,
        away_team == best | home_team == worst ~ -3,
        away_team == worst | home_team == best ~ 3,
        away_team < home_team ~ sample(c(-3, 3), n(), prob = c(0.9, 0.1), replace = TRUE),
        away_team > home_team ~ sample(c(-3, 3), n(), prob = c(0.1, 0.9), replace = TRUE),
        TRUE ~ 0
      )
    )

  # return values
  return(list(teams = teams, games = games))
}

This allows us to define best and worst, and use that information to determine a result (in this case, have the best team always win and the worst team always lose). While best and worst are in this example single-length character vectors, they can be data frames or any other R data type.

Let's simulate using this:

sims3 <- simulate_nfl(
  nfl_season = 2020,
  process_games = biased_games_model, 
  fresh_season = TRUE, 
  simulations = 100,
  best = "CHI", 
  worst = "GB"
)

Now let nflseedR summarize the simulation for you by using summary() with the nflseedR simulation object. This will print a gt table.

summary(sims3)

And this shows exactly what we expect. By defining the Bears as the best team, they always go 16-0, win the division, and win the Super Bowl. Interestingly, they do not always get the #1 seed. This makes sense, however, as in games without the Bears or the Packers, the alphabetically earlier teams still wins 90% of the time. The Cardinals would therefore be expected to go 16-0 in some of the simulations, and in some of those have thee tiebreakers over the Bears. However, even in these simulations, they'll still lose to Bears in the end when they meet in the playoffs.

Similarly, the Packers always go 0-16, and never make the playoffs. While in these simulated seasons they got the #1 draft pick every time, they aren't guaranteed to do so. Using the same logic as above, sometimes the Washington Commanders will go 0-16 too, and may beat the Packers out for the #1 pick through tiebreakers.

Passing Data in from One Week to the Next

Sometimes though you want your data to keep updating as the simulation progresses. For example, an Elo-based model that updates each team's Elo after each game. You can pass in the starting Elo values per team, and as games are simulated, update the Elo values for each team and store them in the teams data. This column will be part of the teams data passed into your function when the following week is simulated and your function is called.

Read the comments in the code below for specific tips on doing this but here are good ones:

elo_model <- function(teams, games, week_num, ...) {

  # round out (away from zero)
  # this way the simulator never simulates a tie
  # the simulator will still allow ties to be simulated if you want
  # ... but not on playoff games
  round_out <- function(x) {
    x[!is.na(x) & x < 0] <- floor(x[!is.na(x) & x < 0])
    x[!is.na(x) & x > 0] <- ceiling(x[!is.na(x) & x > 0])
    return(x)
  }

  # we're going to store elo as a new columns in the teams data
  # it won't start off there of course, so we need to determine it
  # from our arguments
  if (!("elo" %in% colnames(teams))) {
    args <- list(...)
    if ("elo" %in% names(args)) {
      # pull the elo info from custom arguments
      teams <- teams |>
        dplyr::inner_join(args$elo |> dplyr::select(team, elo), by = c("team" = "team"))
    } else {
      # error with a friendly error message if no elo data is passed in
      stop("Pass in a tibble `elo` as an argument to `simulate_nfl()`")
    }
  }

  # isolate the ratings data by sim and by team only
  # we will want to join to the games data later and don't want excess columns
  ratings <- teams |> dplyr::select(sim, team, elo)

  # simulate game outcomes
  games <- games |>
    # add in the away team's elo to the game data
    # note we join on both `sim` and the team
    # always join on `sim` to make sure each sim cares about only its data
    dplyr::inner_join(ratings, by = c("sim" = "sim", "away_team" = "team")) |>
    dplyr::rename(away_elo = elo) |>
    # repeat for the home team as well
    dplyr::inner_join(ratings, by = c("sim" = "sim", "home_team" = "team")) |>
    dplyr::rename(home_elo = elo) |>
    dplyr::mutate(
      # calculate the elo difference
      elo_diff = home_elo - away_elo,
      # add in a small HFA amount if played at home
      elo_diff = elo_diff + ifelse(location == "Home", 20, 0),
      # make an adjustment for rest
      elo_diff = elo_diff + (home_rest - away_rest) / 7 * 25,
      # playoff games swing elo more
      elo_diff = elo_diff * ifelse(game_type == "REG", 1, 1.2),
      # from elo, we calculate the home team's win percentage
      wp = 1 / (10^(-elo_diff / 400) + 1),
      # we also can calculate the estimate (mean points home team wins by)
      estimate = elo_diff / 25,
      result = dplyr::case_when(
        # !!! ALWAYS DO THIS NEXT LINE IN YOUR `result` CHANGES !!!
        # you have to make sure you're only changing unfinished games in current week
        # if you don't do this, it will usually error out on a friendly error message
        is.na(result) & week == week_num ~ 
          as.integer(round_out(rnorm(n(), estimate, 13))),
        # if not this week or known result, leave as-is
        TRUE ~ as.integer(result)
      ),
      # simplify to 1 = win, 0 = loss, 0.5 = tie to help calculate elo shift
      outcome = dplyr::case_when(
        is.na(result) ~ NA_real_,
        result > 0 ~ 1,
        result < 0 ~ 0,
        TRUE ~ 0.5
      ),
      # calculate the amount to adjust home team's elo by
      elo_input = dplyr::case_when(
        is.na(result) ~ NA_real_,
        result > 0 ~ elo_diff * 0.001 + 2.2,
        result < 0 ~ -elo_diff * 0.001 + 2.2,
        TRUE ~ 1.0,
      ),
      elo_mult = log(pmax(abs(result), 1) + 1.0) * 2.2 / elo_input,
      elo_shift = 20 * elo_mult * (outcome - wp)
    ) |>
    # we don't want these columns in `games` any more
    # remove any columns you don't need when you're done
    # otherwise the next week they'll get joined as `col.x` and `col.y`
    # which will almost certainly break your script
    dplyr::select(
      -away_elo, -home_elo, -elo_diff, -wp, -estimate,
      -outcome, -elo_input, -elo_mult
    )

  # apply elo shifts
  teams <- teams |>
    # join games results from this week to away teams (within same sim!)
    # note this is a LEFT join, we don't want to remove any teams rows
    dplyr::left_join(games |>
        dplyr::filter(week == week_num) |>
        dplyr::select(sim, away_team, elo_shift),
      by = c("sim" = "sim", "team" = "away_team")
    ) |>
    # away team's elo gets subtracted by elo amount
    # if the team wasn't an away team, do nothing
    dplyr::mutate(elo = elo - ifelse(!is.na(elo_shift), elo_shift, 0)) |>
    # we don't want to keep `elo_shift` in `teams` either, remove it
    dplyr::select(-elo_shift) |>
    # repeat the above except now do it for the home team
    dplyr::left_join(games |>
        dplyr::filter(week == week_num) |>
        dplyr::select(sim, home_team, elo_shift),
      by = c("sim" = "sim", "team" = "home_team")
    ) |>
    # note that a team on a bye will have `elo_shift` as NA for both joins
    # this means it won't change, which is what we want
    dplyr::mutate(elo = elo + ifelse(!is.na(elo_shift), elo_shift, 0)) |>
    dplyr::select(-elo_shift)

  # we need to keep `elo_shift` out of `games` too and we're done with it
  games <- games |>
    dplyr::select(-elo_shift)

  # return the updated teams and games information
  # note that `teams` will now have an updated `elo` column which will
  # be used for the next week's games
  # note that starting `elo` values are the same per-team... 
  # ... but after that will differ per sim depending on that sim's results
  return(list(teams = teams, games = games))
}

Let's generate initial random Elo values for each team. To see how this works, we'll supply an test_week = 3 as an argument into simulate_nfl() which will abort after simulating Week 3, and instead return the result of our elo_model() function.

initial_elo <- tibble::tibble(
  team = unique(nflseedR::divisions$team),
  elo = rnorm(length(unique(nflseedR::divisions$team)), 1500, 150)
)
test <- simulate_nfl(
  nfl_season = 2020,
  process_games = elo_model,
  elo = initial_elo,
  fresh_season = TRUE,
  test_week = 3
)

Let's look at the Bears' Elo after Week 3 in the top handful of simulations:

test$teams |>
  dplyr::filter(team == "CHI") |>
  utils::head() |>
  knitr::kable()

You can see that different simulations have different Elo results for the Bears, as the simulated seasons had different results for the games, and the Elos were adjusted accordingly.

Let's examine the Bears' games in that first simulation:

test$games |>
  filter(sim == 1) |>
  filter(away_team == "CHI" | home_team == "CHI")

Note that only the first three weeks have the result filled in, while the others are NA, indicating that game hasn't yet occurred or been simulated. This is because the test_week = 3 input aborted the simulation after Week 3, which was useful for seeing the Elo above.

Simulation Configuration

There is a lot of flexibility in how you choose to run the simulation. These are the parameters and how to configure them when you run the simulate_nfl() function.

Simulation Output

The output of simulate_nfl(), assuming you don't put in a test_week to debug your function, is a list of class "nflseedR_simulation" that holds four data frames with simulation results as well as a list of parameters used in the simulation. Here are the contents of each:



nflverse/nflseedR documentation built on April 17, 2025, 9:37 p.m.