Introduction

The lack of publicly available National Football League (NFL) data has been a major obstacle in the creation of modern, reproducible research in football analytics. While clean play-by-play and season-level data is available via open-source software packages in other sports, the equivalent datasets are not freely available for researchers interested in the statistical analysis of the NFL. We created a publicly available, open-source R package called nflscrapR that allows easy access to NFL data from 2009-2017. Using a JSON API maintained by the NFL, this package downloads, cleans, parses, and outputs datasets at the individual play, player, game, and season levels. In addition, this package uses win probability and expected points models, build using the data from this package, to provide advanced metrics for users. Our package allows for the advancement of NFL research in the public domain by allowing analysts to develop from a common source, enhancing reproducibility of NFL research. This document shows example use cases of the different functions of nflscrapR. These examples explore just a few different areas of analysis in the NFL showing the ease of using this package:

Loading the nflscrapR package

The devtools packages is needed to download nflscrapR as the package is currently hosted on Github:

library(devtools)

devtools::install_github(repo = "maksimhorowitz/nflscrapR")

# Load the package

library(nflscrapR, quietly = TRUE)
library(nflscrapR)

Games and GameID Function

The nflscrapR package provides a function which allows users to create a list of all games in a season along with each games associated GameID. Using the the season_games function, a dataframe with the home team, away team, the game date, and the GameID is created. The teams are denoted by their respective abbreviations. This function allows users to identify the GameIDs for matchups of interest to be used in the other functions of nflscrapR. See the example code below for how to use the season_games function:

Note: The season_games function takes a few minutes to run

# Loading all games from the 2014 season
games2014 <- season_games(Season = 2014)

head(games2014)

Play-by-Play Functions

The nflscrapR package contains two play-by-play functions. The single game function outputs a 99 column dataframe and the season long function outputs an 100 column dataframe, each with detailed information about each play including pre-snap information, play call, and post snap results, expected point values, and win probability measures. To explore the documentation of the function and to learn more about what each column describes, use the following code:

help(game_play_by_play)

Note: There are errors within the API maintained by the NFL. Numerous extra-point attempts are missing due to a bug in the API. Alas, this does not have a major effect on the data or the models.

Using game_play_by_play

Here, we will explore Superbowl XLVII between the Baltimore Ravens and the San Francisco 49ers. The game was won by the Ravens 34-31. Below we explore some interesting elements of the game using the game_play_by_play function:

# Downlaod the game data
superbowl47 <- game_play_by_play(GameID = 2013020300)

# Explore dataframe dimensions
dim(superbowl47) 

We see that Superbowl XLVII had r nrow(superbowl47) plays. Now we will explore whether one team dominated offensive possession over another:

# Counting Offensive Plays using dplyr
suppressMessages(library(dplyr))
superbowl47 %>% dplyr::group_by(posteam) %>% summarize(offensiveplays = n()) %>%
  filter(., posteam != "")

As seen above the Ravens ran more plays than the 49ers. It would be interestin to explore what this play differential means in terms of time of possession, scoring opportunities, and play selections. Some ideas on how to proceed are below:

For the sake of this example the last bullet is explored in more detail. To explore play level statistics we manipulate the following variables to get yards per play, points per play, and play duration:

Below using dplyr we add the aforementioned statistics to our dataframe and use ggplot2 to visualize the summarized data:

# Loading the ggplot2 library
suppressMessages(library(ggplot2))
suppressMessages(library(gridExtra))

# Using dplyr and knitr to find statistics 
sb_team_summary_stats <- superbowl47 %>% group_by(posteam) %>% 
                      summarize(offensiveplays = n(), 
                                avg.yards.gained = mean(Yards.Gained, 
                                                        na.rm = TRUE),
                                pointsperplay = max(PosTeamScore, na.rm = TRUE) / n(),
                                playduration = mean(PlayTimeDiff)) %>%
                        filter(., posteam != "") %>% 
                      as.data.frame() 

# Yards per play plot
plot_yards <- ggplot(sb_team_summary_stats, aes(x = posteam, y = avg.yards.gained)) +
             geom_bar(aes(fill = posteam), stat = "identity") +
             geom_label(aes(x = posteam, y = avg.yards.gained + .3, 
                            label = round(avg.yards.gained,2)),
                        size = 4, fontface = "bold") +
            labs(title = "Superbowl 47: Yards per Play by Team",
                 x = "Teams", y = "Average Yards per Play") +
            scale_fill_manual(values = c("#241773", "#B3995D")) +
  theme(plot.title = element_text(hjust = .5, face = "bold"))

# Points per play plot
plot_points <- ggplot(sb_team_summary_stats, aes(x = posteam, y = pointsperplay)) +
             geom_bar(aes(fill = posteam), stat = "identity") +
             geom_label(aes(x = posteam, y = pointsperplay + .05, 
                            label = round(pointsperplay,5)),
                        size = 4, fontface = "bold") +
            labs(title = "Superbowl 47: Points per Play by Team",
                 x = "Teams", y = "Points per Play") +
            scale_fill_manual(values = c("#241773", "#B3995D")) +
  theme(plot.title = element_text(hjust = .5, face = "bold"))

# Play duration plot
plot_time <- ggplot(sb_team_summary_stats, aes(x = posteam, y = playduration)) +
             geom_bar(aes(fill = posteam), stat = "identity") +
             geom_label(aes(x = posteam, y = playduration + .05, 
                            label = round(playduration,2)),
                        size = 4, fontface = "bold") +
            labs(title = "Superbowl 47: Average Play Time Duration \n by Team",
                 x = "Teams", y = "Average Play Duration") +
            scale_fill_manual(values = c("#241773", "#B3995D"))+
  theme(plot.title = element_text(hjust = .5, face = "bold"))

# Plotting the three charts together 
grid.arrange(plot_yards, plot_points, plot_time, ncol =2)

The season_play_by_play function

The above example is just one way to manipulate play-by-play dataframes to gather insights about a single football game. The season_play_by_play function allows users to gather further insights on the season level. To read an example of season_play_by_play, check out this analysis of Adrian Peterson's running tendancies posted on the CMU Sports Analytics club blog.

Detailed Boxscore Functions

Another set of useful and interesting functions in nflscrapR are the detailed box score functions. These functions output game and season level statistics ranging from passing to defense to kick returning. The season level functions output dataframes with 57 columns while the game level function outputs a dataframe with 56 columns.

There are three different detailed box score functions. The are summarized as follows:

Note: The season_player_game and agg_player_game functions take a few minutes to run.

Using season_player_game

Below, an example is shown using the season_player_game function. Explored in these plots are a number of ways to visualize player data across the seasons. The first plot shows Joe Flacco's passing yards by game across the past 8 seasons. Also, plotted below,s are his pass attempts per game to visualize his evolution from a game manager to a solid (average?) starting quarterback.

data(playerstats09, playerstats10, playerstats11, playerstats12,
     playerstats13, playerstats14, playerstats15)

allplayerstats <- dplyr::bind_rows(playerstats09, playerstats10, playerstats11,
                        playerstats12, playerstats13, playerstats14,
                        playerstats15, playerstats16)
# Loading all the statistics for each player by game from 2009-2016
# Note: The below code takes about 10 minutes to run.

playerstats09 <- season_player_game(Season = 2009)
playerstats10 <- season_player_game(Season = 2010)
playerstats11 <- season_player_game(Season = 2011)
playerstats12 <- season_player_game(Season = 2012)
playerstats13 <- season_player_game(Season = 2013)
playerstats14 <- season_player_game(Season = 2014)
playerstats15 <- season_player_game(Season = 2015)
playerstats16 <- season_player_game(Season = 2016)

# Combining into one dataframe using bind_rows()

allplayerstats <- dplyr::bind_rows(playerstats09, playerstats10, playerstats11,
                        playerstats12, playerstats13, playerstats14,
                        playerstats15, playerstats16)

# Examining dimensions:
dim(allplayerstats)
#### Using ggplot to explore Joe Flacco's passing tends ###

# filter for Flacco and arrange by game data
flacco_data <- dplyr::filter(allplayerstats, name == "J.Flacco") %>%
  arrange(date)

# Add games played. Note in 2015 Flacco was injured in Week 10:
flacco_data <- flacco_data %>%
  dplyr::group_by(Season) %>% 
  dplyr::mutate(gamenumber = row_number(date))

# Creating Passing Yards Plot by game
flacco_passyds_plot <- ggplot(flacco_data, aes(x = gamenumber, y = passyds)) + 
  theme_bw() +
  geom_bar(stat = "identity", aes(alpha = passyds), fill = "#241773") +
    theme(strip.background = element_rect(fill = "black", size = 1.5),
        strip.text = element_text(color = "white", face = "bold"),
        plot.title = element_text(hjust = 0.5, face = "bold")) +
  geom_label(aes(x = gamenumber, y = ifelse(passyds > 70, passyds - 60,
                                            passyds + 10), 
                 label = passyds), 
                 size = 3) + 
  ggtitle("Joe Flacco's Passing Yards by Game \n Across Seasons") +
  ylab("Passing Yards") + xlab("Game Number") + guides(fill = FALSE) +
  facet_wrap(~Season, ncol = 1) 

The plot of Flacco's passing yards across games provides an interesting visual. By examining the plot you can see that he usually has a strong first game followed by a much weaker second game (excluding the 2015 season). You can also see that in 2015 Flacco was on pace to throw for a career high in yards in a season before he was derailed by a torn ACL.

flacco_passyds_plot

Now examined are Flacco's pass attempts by game. This allows visualization of his progression from a game manager to the solid (average?) quarterback that he his today. One can observe the following insights from the bar plots below:

# Creating Passing Attempts Plot by Game across Seasons
ggplot(flacco_data, aes(x = gamenumber, y = pass.att)) + 
  theme_bw() +
  geom_bar(stat = "identity", aes(alpha = pass.att), fill = "#241773") +
    theme(strip.background = element_rect(fill = "black", size = 1.5),
        strip.text = element_text(color = "white", face = "bold"),
        plot.title = element_text(hjust = 0.5, face = "bold")) +
  geom_label(aes(x = gamenumber, y = ifelse(pass.att > 60, pass.att - 5,
                                            pass.att), 
                 label = pass.att), 
                 size = 3) + 
  ggtitle("Joe Flacco's Passing Attempts by Game \n Across Seasons") +
  ylab("Passing Attempts") + xlab("Game Number") + guides(fill = FALSE) +
  facet_wrap(~Season, ncol = 1) 

Using player_game and agg_player_season

Similar exploration of game level statistics can be done with with the player_game function. The nice thing about the player_game function is that it downloads, parses, and cleans the data very quickly so if you are interested in particular games, you can download the data instantaneously.

The agg_player_season function allows users to visualize season total statistics which is beneficial for building running lists of statistics or easily calculating totals from the available range of seasons.

Note, there is also a simple box score function available which separates each of the measured statistics into lists of dataframes. This is much more similar to what is seen in a standard box score. The function is named: simple_boxscore

Roster Function

Users can also download a data set of rosters for each team by season. The rosters include all players who recorded a measured statistic in the raw data. That is, if a player did not record either a passing, rushing, receiving, defensive, punt return, kick return, punting, or kicking statistic they will not be on the roster. To use the function, you must identify a season of interest and also identify the team abbreviations of your team of interest. To find team initials, load the nflteams data set stored in the package. The following code shows the roster for the Tennessee Titans in 2013:

# Find Titans Abbreviation

# Load team name dataset
data(nflteams)

# Find Tennessee's abbreviation
ten_abbr <- filter(nflteams, TeamName == "Tennessee Titans")$Abbr

# Load the Titans Roster from 2013
tenroster2013 <- season_rosters(Season = 2013, TeamInt = ten_abbr)

head(tenroster2013)

Drive Summary Function

The drive_summary function allows users to download datasets with information about each drive in a game. The input requires a GameID, and the outputted dataframe has 18 columns. The help documentation for the function describes each of the columns in more detail.

Using the drive_summary function

Displayed below is a dataset of the drive summaries of the last game of the 2015 season between the Minnesota Viking and the Green Bay Packers.

# Drive summary from final game of 2015 season
min_gb_drivesummary <- drive_summary(GameID = 2016010310)

head(min_gb_drivesummary, 8)


maksimhorowitz/nflscrapR documentation built on April 3, 2020, 7:40 p.m.