knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
future::plan("multisession")
options(dplyr.summarise.inform = FALSE)
options(nflreadr.verbose = FALSE)

If you are new to R or are having trouble understanding the code in the below sections we highly recommend the nflfastR beginner's guide in vignette("beginners_guide").

The Main Functions

nflfastR comes with a set of functions to access NFL play-by-play data and team rosters. This section provides a brief introduction to the essential functions.

nflfastR processes and cleans up play-by-play data and adds variables through it's models. Since some of these tasks are performed by separate functions, the easiest way to compute the complete nflfastR dataset is build_nflfastR_pbp(). The main input for that function is a set of game ids which can be accessed with fast_scraper_schedules(). The following code demonstrates how to build the nflfastR dataset for the Super Bowls of the 2017 - 2019 seasons.

library(nflfastR)
library(dplyr, warn.conflicts = FALSE)
ids <- nflfastR::fast_scraper_schedules(2017:2019) %>%
  dplyr::filter(game_type == "SB") %>%
  dplyr::pull(game_id)
pbp <- nflfastR::build_nflfastR_pbp(ids)

In most cases, however, it is not necessary to use this function for individual games, because nflfastR provides both a data repository and two main play-by-play functions: load_pbp() and update_db(). We cover load_pbp() below, and please see [Example 8: Using the built-in database function] for how to work with the database function update_db().

The easiest way to access the data in the data repository is the new function load_pbp(). It can load multiple seasons directly into memory and supports multiple data formats. Loading all play-by-play data of the 2018-2020 seasons is as easy as

pbp <- nflfastR::load_pbp(2018:2020)

Joining roster data to the play-by-play data set is possible as well. The data can be accessed with the function fast_scraper_roster() and its application is demonstrated in [Example 10: Working with roster and position data].

Application Examples

All examples listed below assume that the following two libraries are installed and loaded.

``` {r load, warning = FALSE, message = FALSE} library(nflfastR) library(tidyverse)

## Example 1: replicate nflscrapR with fast_scraper

The functionality of `nflscrapR` can be duplicated by using `fast_scraper()`. This obtains the same information contained in `nflscrapR` (plus some extra) but much more quickly. To compare to `nflscrapR`, we use their data repository as the program no longer functions now that the NFL has taken down the old Gamecenter feed. Note that EP differs from nflscrapR as we use a newer era-adjusted model (more on this in [this post on Open Source Football](https://www.opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/)).

This example also uses the built-in function `clean_pbp()` to create a 'name' column for the primary player involved (the QB on pass play or ball-carrier on run play).

``` {r ex1-nflscrapR, warning = FALSE, message = FALSE}
readr::read_csv("https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true") %>%
  dplyr::filter(home_team == "SF" & away_team == "SEA") %>%
  dplyr::select(desc, play_type, ep, epa, home_wp) %>%
  utils::head(6) %>%
  knitr::kable(digits = 3)

``` {r ex1-fs, warning = FALSE, message = FALSE} nflfastR::fast_scraper("2019_10_SEA_SF") %>% nflfastR::clean_pbp() %>% dplyr::select(desc, play_type, ep, epa, home_wp, name) %>% utils::head(6) %>% knitr::kable(digits = 3)

## Example 2: scrape a batch of games very quickly with fast_scraper

This is a demonstration of `nflfastR`'s capabilities. While `nflfastR` can scrape a batch of games very quickly, **please be respectful of Github's servers and use the [data repository](https://github.com/nflverse/nflfastR-data) which hosts all the scraped and cleaned data** whenever possible. The only reason to ever actually use the scraper is if it's in the middle of the season and we haven't updated the repository with recent games (but it is automatically updated overnight every day).

``` {r ex2-bigscrape, warning = FALSE, message = FALSE}
# get list of some games from 2019
games_2019 <- nflfastR::fast_scraper_schedules(2019) %>%
  utils::head(10) %>%
  dplyr::pull(game_id)
tictoc::tic(glue::glue("{length(games_2019)} games with nflfastR:"))
f <- nflfastR::fast_scraper(games_2019)
tictoc::toc()

Example 3: Completion Percentage Over Expected (CPOE)

Let's look at CPOE leaders from the 2009 regular season.

As discussed above, nflfastR has a data repository for old seasons, so there's no need to actually scrape them. Let's use that here with the convenience function load_pbp() which fetches data from the repository (for non-R users, .csv and .parquet are also available in the data repository).

``` {r ex3-cpoe, warning = FALSE, message = FALSE} tictoc::tic("loading all games from 2009") games_2009 <- nflfastR::load_pbp(2009) %>% dplyr::filter(season_type == "REG") tictoc::toc() games_2009 %>% dplyr::filter(!is.na(cpoe)) %>% dplyr::group_by(passer_player_name) %>% dplyr::summarize(cpoe = mean(cpoe), Atts = n()) %>% dplyr::filter(Atts > 200) %>% dplyr::arrange(-cpoe) %>% utils::head(5) %>% knitr::kable(digits = 1)

## Example 4: Using Drive Information

When working with `nflfastR`, drive results are automatically included. We use `fixed_drive` and `fixed_drive_result` since the NFL-provided information is a bit wonky. Let's look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2000.

``` {r ex4, warning = FALSE, message = FALSE}
pbp <- nflfastR::load_pbp(c(2003, 2015))

out <- pbp %>%
  dplyr::filter(season_type == "REG" & down == 1 & ydstogo == 10 & yardline_100 == 80) %>%
  dplyr::mutate(drive_score = dplyr::if_else(fixed_drive_result %in% c("Touchdown", "Field goal"), 1, 0)) %>%
  dplyr::group_by(season) %>%
  dplyr::summarize(drive_score = mean(drive_score))

out %>% 
  knitr::kable(digits = 3)

So r scales::percent(out$drive_score[1], accuracy = 0.1) of 1st & 10 plays from teams' own 20 would see the drive end up in a score in 2003, compared to r scales::percent(out$drive_score[2], accuracy = 0.1) in 2015. This has implications for Expected Points models (see this article).

Example 5: Plot offensive and defensive EPA per play for a given season

Let's build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. Creating data viz including NFL team logos (or wordmarks, or headshots), we recommend the nflverse R package nflplotR.

When using load_pbp(), the helper function clean_pbp() has already been run, which creates "rush" and "pass" columns that (a) properly count sacks and scrambles as pass plays and (b) properly include plays with penalties. Using this, we can keep only rush or pass plays.

library(nflplotR)
pbp <- nflfastR::load_pbp(2005) %>%
  dplyr::filter(season_type == "REG") %>%
  dplyr::filter(!is.na(posteam) & (rush == 1 | pass == 1))
offense <- pbp %>%
  dplyr::group_by(team = posteam) %>%
  dplyr::summarise(off_epa = mean(epa, na.rm = TRUE))
defense <- pbp %>%
  dplyr::group_by(team = defteam) %>%
  dplyr::summarise(def_epa = mean(epa, na.rm = TRUE))
offense %>%
  dplyr::inner_join(defense, by = "team") %>%
  ggplot2::ggplot(aes(x = off_epa, y = def_epa)) +
  ggplot2::geom_abline(slope = -1.5, intercept = c(.4, .3, .2, .1, 0, -.1, -.2, -.3), alpha = .2) +
  nflplotR::geom_mean_lines(aes(y0 = off_epa, x0 = def_epa)) +
  nflplotR::geom_nfl_logos(aes(team_abbr = team), width = 0.07, alpha = 0.7) +
  ggplot2::labs(
    x = "Offense EPA/play",
    y = "Defense EPA/play",
    caption = "Data: @nflfastR",
    title = "2005 NFL Offensive and Defensive EPA per Play"
  ) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(size = 12, hjust = 0.5, face = "bold")
  ) +
  ggplot2::scale_y_reverse()

Example 6: Expected Points calculator

We have provided a calculator for working with the Expected Points model. Here is an example of how to use it, looking for how the Expected Points on a drive beginning following a touchback has changed over time.

While I have put in 'SEA' for home_team and posteam, this only matters for figuring out whether the team with the ball is the home team (there's no actual effect for given team; it would be the same no matter what team is supplied).

``` {r ex6a} data <- tibble::tibble( "season" = 1999:2019, "home_team" = "SEA", "posteam" = "SEA", "roof" = "outdoors", "half_seconds_remaining" = 1800, "yardline_100" = c(rep(80, 17), rep(75, 4)), "down" = 1, "ydstogo" = 10, "posteam_timeouts_remaining" = 3, "defteam_timeouts_remaining" = 3 ) nflfastR::calculate_expected_points(data) %>% dplyr::select(season, yardline_100, td_prob, ep) %>% knitr::kable(digits = 2)

Not surprisingly, offenses have become much more successful over time, with the kickoff touchback moving from the 20 to the 25 in 2016 providing an additional boost. Note that the `td_prob` in this example is the probability that the next score within the same half will be a touchdown scored by team with the ball, **not** the probability that the current drive will end in a touchdown (this is why the numbers are different from Example 4 above).

We could compare the most recent four years to the expectation for playing in a dome by inputting all the same things and changing the `roof` input:

``` {r ex6b}
data <- tibble::tibble(
  "season" = 2016:2019,
  "week" = 5,
  "home_team" = "SEA",
  "posteam" = "SEA",
  "roof" = "dome",
  "half_seconds_remaining" = 1800,
  "yardline_100" = c(rep(75, 4)),
  "down" = 1,
  "ydstogo" = 10,
  "posteam_timeouts_remaining" = 3,
  "defteam_timeouts_remaining" = 3
)
nflfastR::calculate_expected_points(data) %>%
  dplyr::select(season, yardline_100, td_prob, ep) %>%
  knitr::kable(digits = 2)

So for 2018 and 2019, 1st & 10 from a home team's own 25 yard line had higher EP in domes than at home, which is to be expected.

Example 7: Win probability calculator

We have also provided a calculator for working with the win probability models. Here is an example of how to use it, looking for how the win probability to begin the game depends on the pre-game spread.

While I have put in 'SEA' for home_team and posteam, this only matters for figuring out whether the team with the ball is the home team (there's no actual effect for given team; it would be the same no matter what team is supplied).

``` {r ex7} data <- tibble::tibble( "receive_2h_ko" = 0, "home_team" = "SEA", "posteam" = "SEA", "score_differential" = 0, "half_seconds_remaining" = 1800, "game_seconds_remaining" = 3600, "spread_line" = c(1, 3, 4, 7, 14), "down" = 1, "ydstogo" = 10, "yardline_100" = 75, "posteam_timeouts_remaining" = 3, "defteam_timeouts_remaining" = 3 ) nflfastR::calculate_win_probability(data) %>% dplyr::select(spread_line, wp, vegas_wp) %>% knitr::kable(digits = 2)

Not surprisingly, `vegas_wp` increases with the amount a team was coming into the game favored by.

## Example 8: Using the built-in database function

If you're comfortable using `dplyr` functions to manipulate and tidy data, you're ready to use a database. Why should you use a database?

* The provided function in `nflfastR` makes it extremely easy to build a database and keep it updated
* Play-by-play data over 20+ seasons takes up a lot of memory: working with a database allows you to only bring into memory what you actually need
* R makes it *extremely* easy to work with databases.

### Start: install and load packages

To start, we need to install the two packages required for this that aren't installed automatically when `nflfastR` installs: `DBI` and `RSQLite` (advanced users can use other types of databases, but this example will use SQLite). The `if` statements make sure the packages won't be updated if they are already installed:

``` {r eval = FALSE}
if (!require("DBI")) install.packages("DBI")
if (!require("RSQLite")) install.packages("RSQLite")

As with always, you only need to install these once. They don't need to be loaded to build the database because nflfastR knows how to use them, but we do need them later on when working with the database.

library(DBI)
library(RSQLite)

Build database

There's exactly one function in nflfastR that works with databases: update_db(). Some notes:

Let's say I just want to dump a database into the current working directory. Here we go!

``` {r create-db} nflfastR::update_db()

This created a database in the current directory called `pbp_db`.

Wait, that's it? That's it! What if it's partway through the season and you want to make sure all the new games are added to the database? What do you run? `update_db()`! (just make sure you're in the directory the database is saved in or you supply the right file path)

``` {r update-db}
nflfastR::update_db()

If it's partway through a season and you want to re-build a season to allow for data corrections from the NFL to propagate into your database, you can specify one season to be rebuilt:

``` {r update-db-yr} nflfastR::update_db(force_rebuild = 2020)

### Connect to database

Now we can make a connection to the database. This is the only part that will look a little bit foreign, but all you need to know is where your database is located. If it's in your current working directory, this will work:

``` {r}
connection <- DBI::dbConnect(RSQLite::SQLite(), "./pbp_db")
connection

It looks like nothing happened, but we now have a connection to the database. Now we're ready to do stuff. If you aren't familiar with databases, they're organized around tables. Here's how to see which tables are present in our database:

DBI::dbListTables(connection)

Since we went with the defaults, there's a table called nflfastR_pbp. Another useful function is to see the fields (i.e., columns) in a table:

DBI::dbListFields(connection, "nflfastR_pbp") %>%
  utils::head(10)

This is the same list as the list of columns in nflfastR play-by-play. Notice we had to supply the name of the table above ("nflfastR_pbp").

With all that out of the way, there's only a couple more things to learn. The main driver here is tbl, which helps get output with a specific table in a database:

pbp_db <- dplyr::tbl(connection, "nflfastR_pbp")

And now, everything will magically just "work": you can forget you're even working with a database!

pbp_db %>%
  dplyr::group_by(season) %>%
  dplyr::summarize(n = dplyr::n())
pbp_db %>%
  dplyr::filter(rush == 1 | pass == 1, down <= 2, !is.na(epa), !is.na(posteam)) %>%
  dplyr::group_by(pass) %>%
  dplyr::summarize(mean_epa = mean(epa, na.rm = TRUE))

So far, everything has stayed in the database. If you want to bring a query into memory, just use collect() at the end:

russ <- pbp_db %>%
  dplyr::filter(name == "R.Wilson" & posteam == "SEA") %>%
  dplyr::select(desc, epa) %>%
  dplyr::collect()
russ

So we've searched through about 1 million rows of data across 300+ columns and only brought about r round(nrow(russ), -1) rows and two columns into memory. Pretty neat! This is how I supply the data to the shiny apps on rbsdm.com without running out of memory on the server. Now there's only one more thing to remember. When you're finished doing what you need with the database:

DBI::dbDisconnect(connection)

For more details on using a database with nflfastR, see Thomas Mock's life-changing post here. More detailed information on dbplyr (the dplyr database back-end) are given in the second edition of Hadley Wickham's R for Data Science (2e).

Example 9: working with the expected yards after catch model

The variables in xyac are as follows:

Some other notes:

Let's create measures for EPA and first downs over expected in 2015:

``` {r ex9-xyac, warning = FALSE, message = FALSE} nflfastR::load_pbp(2015) %>% dplyr::group_by(receiver, receiver_id, posteam) %>% dplyr::mutate(tgt = sum(complete_pass + incomplete_pass)) %>% dplyr::filter(tgt >= 50) %>% dplyr::filter(complete_pass == 1, air_yards < yardline_100, !is.na(xyac_epa)) %>% dplyr::summarize( epa_oe = mean(yac_epa - xyac_epa), actual_fd = mean(first_down), expected_fd = mean(xyac_fd), fd_oe = mean(first_down - xyac_fd), rec = dplyr::n() ) %>% dplyr::ungroup() %>% dplyr::select(receiver, posteam, actual_fd, expected_fd, fd_oe, epa_oe, rec) %>% dplyr::arrange(-epa_oe) %>% utils::head(10) %>% knitr::kable(digits = 3)

The presence of so many running backs on this list suggests that even though it takes into account target depth and pass direction, the model doesn't do a great job capturing space. Alternatively, running backs might be better at generating yards after the catch since running with the football is their primary role.

## Example 10: Working with roster and position data

At long last, there's a way to merge the new play-by-play data with roster information. Use the function to get the rosters:

``` {r roster}
roster <- nflfastR::fast_scraper_roster(2019)

Now let's load play-by-play data from 2019: ``` {r roster_pbp_load} games_2019 <- nflfastR::load_pbp(2019)

Here is what the player IDs look like because `nflfastR` now automatically decodes IDs to look like the old format with GSIS IDs:

``` {r roster_pbp}
games_2019 %>%
  dplyr::filter(rush == 1 | pass == 1, posteam == "SEA") %>%
  dplyr::select(name, id)

Now we're ready to join to the roster data using these IDs: ``` {r decode_join} joined <- games_2019 %>% dplyr::filter(!is.na(receiver_id)) %>% dplyr::select(posteam, season, desc, receiver, receiver_id, epa) %>% dplyr::left_join(roster, by = c("receiver_id" = "gsis_id"))

``` {r decode_table}
# the real work is done, this just makes a table and has it look nice
joined %>%
  dplyr::filter(position %in% c("WR", "TE", "RB")) %>%
  dplyr::group_by(receiver_id, receiver, position) %>%
  dplyr::summarize(tot_epa = sum(epa), n = n()) %>%
  dplyr::arrange(-tot_epa) %>%
  dplyr::ungroup() %>%
  dplyr::group_by(position) %>%
  dplyr::mutate(position_rank = 1:n()) %>%
  dplyr::filter(position_rank <= 5) %>%
  dplyr::rename(Pos_Rank = position_rank, Player = receiver, Pos = position, Tgt = n, EPA = tot_epa) %>%
  dplyr::select(Player, Pos, Pos_Rank, Tgt, EPA) %>%
  knitr::kable(digits = 0)

Not surprisingly, all 5 of the top 5 WRs in terms of EPA added come in ahead of the top RB. Note that the number of targets won't match official stats because we're including plays with penalties.

Example 11: Replicating official stats

The columns like name, passer, fantasy etc are nflfastR-created columns that mimic "real" football: i.e., excluding plays with spikes, counting scrambles and sacks as pass plays, etc. But if you're trying to replicate official statistics -- perhaps for fantasy purposes -- use the *_player_name and *_player_id columns.

Leaderboards

Let's try to replicate this page of passing leaders.

``` {r stats1} nflfastR::load_pbp(2020) %>% dplyr::filter(season_type == "REG", complete_pass == 1 | incomplete_pass == 1 | interception == 1, !is.na(down)) %>% dplyr::group_by(passer_player_name, posteam) %>% dplyr::summarize( yards = sum(passing_yards, na.rm = T), tds = sum(touchdown == 1 & td_team == posteam), ints = sum(interception), att = dplyr::n() ) %>% dplyr::arrange(-yards) %>% utils::head(10) %>% knitr::kable(digits = 0)

These match the official stats on NFL.com (note the filter for `season_type == "REG"` since official stats only count regular season games). Note that we're using `passing_yards` here because `yards_gained` is not equal to passing yards on plays with laterals.

While the above works, we've also provided a function that does this all for you: `calculate_player_stats()`. This function takes an `nflfastR` play-by-play dataframe as an input along with one other argument, `weekly`, which defaults to `FALSE`. When `weekly` is true, a week-by-week dataframe is returned (rather than an aggregate over the whole provided dataframe). Let's again replicate the top 10 players in passing yards:

``` {r stats2}
nflfastR::load_pbp(2020) %>%
  dplyr::filter(season_type == "REG") %>%
  nflfastR::calculate_player_stats() %>%
  dplyr::arrange(-passing_yards) %>%
  dplyr::select(player_name, recent_team, completions, attempts, passing_yards, passing_tds, interceptions) %>%
  utils::head(10) %>%
  knitr::kable(digits = 0)

We can do the same for rush attempts to replicate the NFL leaderboard:

``` {r stats_rush} nflfastR::load_pbp(2020) %>% dplyr::filter(season_type == "REG") %>% nflfastR::calculate_player_stats() %>% dplyr::arrange(-rushing_yards) %>% dplyr::select(player_name, recent_team, carries, rushing_yards, rushing_tds, rushing_fumbles_lost) %>% utils::head(10) %>% knitr::kable(digits = 0)

Again, this matches up exactly.

### Yards from scrimmage

What if we want total yards from scrimmage? We'll demonstrate three methods here. The hardest way is to use the `fantasy_player_name` column, which is the rusher on rush plays and receiver on receiving plays:

``` {r stats_fantasy}
nflfastR::load_pbp(2020) %>%
  dplyr::filter(season_type == "REG", !is.na(down)) %>%
  dplyr::group_by(fantasy_player_name, posteam) %>%
  dplyr::summarize(
    carries = sum(rush_attempt),
    receptions = sum(complete_pass),
    touches = sum(rush_attempt + complete_pass),
    yards = sum(yards_gained),
    tds = sum(touchdown == 1 & td_team == posteam)
  ) %>%
  dplyr::arrange(-yards) %>%
  utils::head(10) %>%
  knitr::kable(digits = 0)

Looking at the PFR scrimmage stats, these columns are an exact match.

But we could also just use calculate_player_stats() again:

``` {r stats_fantasy2} nflfastR::load_pbp(2020) %>% dplyr::filter(season_type == "REG") %>% nflfastR::calculate_player_stats() %>% dplyr::mutate( yards = rushing_yards + receiving_yards, touches = carries + receptions, tds = rushing_tds + receiving_tds ) %>% dplyr::arrange(-yards) %>% dplyr::select(player_name, recent_team, carries, receptions, touches, yards, tds) %>% utils::head(10) %>% knitr::kable(digits = 0)

And we get the same thing.

The third way is to use the `load_player_stats()` function, which can load a data frame of player-level stats for every week since 1999.

``` {r stats_fantasy3}
nflfastR::load_player_stats(seasons = 2020) %>%
  dplyr::filter(season_type == "REG") %>%
  dplyr::group_by(player_id) %>%
  dplyr::summarize(
    player_name = dplyr::first(player_name),
    recent_team = dplyr::first(recent_team),
    yards = sum(rushing_yards + receiving_yards),
    touches = sum(carries + receptions),
    carries = sum(carries),
    receptions = sum(receptions),
    tds = sum(rushing_tds + receiving_tds)
  ) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(-yards) %>%
  dplyr::select(player_name, recent_team, carries, receptions, touches, yards, tds) %>%
  utils::head(10) %>%
  knitr::kable(digits = 0)

And again the output is identical.

Fantasy points

Let's calculate PPR fantasy points per game in the first 16 weeks of the season among wide receivers who appeared in more than 5 games.

{r stats_fantasy4} nflfastR::load_pbp(2020) %>% dplyr::filter(week <= 16) %>% nflfastR::calculate_player_stats() %>% dplyr::mutate( ppg = fantasy_points_ppr / games ) %>% dplyr::filter(games > 5) %>% # only keep the WRs dplyr::inner_join( nflfastR::fast_scraper_roster(2020) %>% dplyr::filter(position == "WR") %>% dplyr::select(player_id = gsis_id), by = "player_id" ) %>% dplyr::arrange(-ppg) %>% dplyr::select(player_name, recent_team, games, fantasy_points_ppr, ppg) %>% utils::head(10) %>% knitr::kable(digits = 1)

Comparing to the FantasyPros website, this is an exact match.

Frequent issues

The drive column looks wacky

Use fixed_drive and fixed_drive_result instead. See [Example 4: Using Drive Information].

Why are there so many win probability columns?

vegas_wp and vegas_home_wp incorporate the pregame spread and are much better models.

I'm trying to do X. Help!

Please ask in the Discord channel.

Links

This section is a helper that holds the hyperlinks to the above chapters. It's a workaround for the missing sections anchor bug in pkgdown which will hopefully be fixed with this pull request at some point in the future.



mrcaseb/nflfastR documentation built on May 11, 2024, 10:31 p.m.