The lack of publicly available National Football League (NFL) data has been a major obstacle in the creation of modern, reproducible research in football analytics. While clean play-by-play and season-level data is available via open-source software packages in other sports, the equivalent datasets are not freely available for researchers interested in the statistical analysis of the NFL. We created a publicly available, open-source R package called nflscrapR
that allows easy access to NFL data from 2009-2017. Using a JSON API maintained by the NFL, this package downloads, cleans, parses, and outputs datasets at the individual play, player, game, and season levels. In addition, this package uses win probability and expected points models, build using the data from this package, to provide advanced metrics for users. Our package allows for the advancement of NFL research in the public domain by allowing analysts to develop from a common source, enhancing reproducibility of NFL research. This document shows example use cases of the different functions of nflscrapR. These examples explore just a few different areas of analysis in the NFL showing the ease of using this package:
nflscrapR
packageThe devtools packages is needed to download nflscrapR
as the package is currently hosted on Github:
library(devtools) devtools::install_github(repo = "maksimhorowitz/nflscrapR") # Load the package library(nflscrapR, quietly = TRUE)
library(nflscrapR)
The nflscrapR
package provides a function which allows users to create a list of all games in a season along with each games associated GameID. Using the the season_games
function, a dataframe with the home team, away team, the game date, and the GameID is created. The teams are denoted by their respective abbreviations. This function allows users to identify the GameIDs for matchups of interest to be used in the other functions of nflscrapR
. See the example code below for how to use the season_games function:
Note: The season_games
function takes a few minutes to run
# Loading all games from the 2014 season games2014 <- season_games(Season = 2014) head(games2014)
The nflscrapR
package contains two play-by-play
functions. The single game function outputs a 99 column dataframe and the season long function outputs an 100 column dataframe, each with detailed information about each play including pre-snap information, play call, and post snap results, expected point values, and win probability measures. To explore the documentation of the function and to learn more about what each column describes, use the following code:
help(game_play_by_play)
game_play_by_play
: extracts a single game's play-by-play dataseason_play_by_play
: extracts a full season's play-by-play dataNote: There are errors within the API maintained by the NFL. Numerous extra-point attempts are missing due to a bug in the API. Alas, this does not have a major effect on the data or the models.
game_play_by_play
Here, we will explore Superbowl XLVII between the Baltimore Ravens and the San
Francisco 49ers. The game was won by the Ravens 34-31. Below we explore some interesting elements of the game using the game_play_by_play
function:
# Downlaod the game data superbowl47 <- game_play_by_play(GameID = 2013020300) # Explore dataframe dimensions dim(superbowl47)
We see that Superbowl XLVII had r nrow(superbowl47)
plays. Now we will explore whether one team dominated offensive possession over another:
# Counting Offensive Plays using dplyr suppressMessages(library(dplyr)) superbowl47 %>% dplyr::group_by(posteam) %>% summarize(offensiveplays = n()) %>% filter(., posteam != "")
As seen above the Ravens ran more plays than the 49ers. It would be interestin to explore what this play differential means in terms of time of possession, scoring opportunities, and play selections. Some ideas on how to proceed are below:
For the sake of this example the last bullet is explored in more detail. To explore play level statistics we manipulate the following variables to get yards per play, points per play, and play duration:
Yards.Gained
PosTeamScore
PlayTimeDiff
Below using dplyr we add the aforementioned statistics to our dataframe and use
ggplot2
to visualize the summarized data:
# Loading the ggplot2 library suppressMessages(library(ggplot2)) suppressMessages(library(gridExtra)) # Using dplyr and knitr to find statistics sb_team_summary_stats <- superbowl47 %>% group_by(posteam) %>% summarize(offensiveplays = n(), avg.yards.gained = mean(Yards.Gained, na.rm = TRUE), pointsperplay = max(PosTeamScore, na.rm = TRUE) / n(), playduration = mean(PlayTimeDiff)) %>% filter(., posteam != "") %>% as.data.frame() # Yards per play plot plot_yards <- ggplot(sb_team_summary_stats, aes(x = posteam, y = avg.yards.gained)) + geom_bar(aes(fill = posteam), stat = "identity") + geom_label(aes(x = posteam, y = avg.yards.gained + .3, label = round(avg.yards.gained,2)), size = 4, fontface = "bold") + labs(title = "Superbowl 47: Yards per Play by Team", x = "Teams", y = "Average Yards per Play") + scale_fill_manual(values = c("#241773", "#B3995D")) + theme(plot.title = element_text(hjust = .5, face = "bold")) # Points per play plot plot_points <- ggplot(sb_team_summary_stats, aes(x = posteam, y = pointsperplay)) + geom_bar(aes(fill = posteam), stat = "identity") + geom_label(aes(x = posteam, y = pointsperplay + .05, label = round(pointsperplay,5)), size = 4, fontface = "bold") + labs(title = "Superbowl 47: Points per Play by Team", x = "Teams", y = "Points per Play") + scale_fill_manual(values = c("#241773", "#B3995D")) + theme(plot.title = element_text(hjust = .5, face = "bold")) # Play duration plot plot_time <- ggplot(sb_team_summary_stats, aes(x = posteam, y = playduration)) + geom_bar(aes(fill = posteam), stat = "identity") + geom_label(aes(x = posteam, y = playduration + .05, label = round(playduration,2)), size = 4, fontface = "bold") + labs(title = "Superbowl 47: Average Play Time Duration \n by Team", x = "Teams", y = "Average Play Duration") + scale_fill_manual(values = c("#241773", "#B3995D"))+ theme(plot.title = element_text(hjust = .5, face = "bold")) # Plotting the three charts together grid.arrange(plot_yards, plot_points, plot_time, ncol =2)
The above example is just one way to manipulate play-by-play dataframes to gather insights about a single football game. The season_play_by_play
function allows users to gather further insights on the season level. To read an example of season_play_by_play
, check out this analysis of Adrian Peterson's running tendancies posted on the CMU Sports Analytics club blog.
Another set of useful and interesting functions in nflscrapR are the detailed box score functions. These functions output game and season level statistics ranging from passing to defense to kick returning. The season level functions output dataframes with 57 columns while the game level function outputs a dataframe with 56 columns.
There are three different detailed box score functions. The are summarized as follows:
player_game
: outputs a dataframe for a single game with one line per player who recorded any
measurable statistics ranging from passing to kick returningseason_player_game
: outputs a dataframe for an entire season. One line per player per game. All measured statistics are included.agg_player_season
: outputs a dataframe with aggregate statistics for an entire season. There is one line per player in this dataframe.Note: The season_player_game
and agg_player_game
functions take a few minutes
to run.
Below, an example is shown using the season_player_game
function. Explored in these plots are a number of ways to visualize player data across the seasons. The first plot shows Joe Flacco's passing yards by game across the past 8 seasons. Also, plotted below,s are his pass attempts per game to visualize his evolution from a game manager to a solid (average?) starting quarterback.
data(playerstats09, playerstats10, playerstats11, playerstats12, playerstats13, playerstats14, playerstats15) allplayerstats <- dplyr::bind_rows(playerstats09, playerstats10, playerstats11, playerstats12, playerstats13, playerstats14, playerstats15, playerstats16)
# Loading all the statistics for each player by game from 2009-2016 # Note: The below code takes about 10 minutes to run. playerstats09 <- season_player_game(Season = 2009) playerstats10 <- season_player_game(Season = 2010) playerstats11 <- season_player_game(Season = 2011) playerstats12 <- season_player_game(Season = 2012) playerstats13 <- season_player_game(Season = 2013) playerstats14 <- season_player_game(Season = 2014) playerstats15 <- season_player_game(Season = 2015) playerstats16 <- season_player_game(Season = 2016) # Combining into one dataframe using bind_rows() allplayerstats <- dplyr::bind_rows(playerstats09, playerstats10, playerstats11, playerstats12, playerstats13, playerstats14, playerstats15, playerstats16) # Examining dimensions: dim(allplayerstats)
#### Using ggplot to explore Joe Flacco's passing tends ### # filter for Flacco and arrange by game data flacco_data <- dplyr::filter(allplayerstats, name == "J.Flacco") %>% arrange(date) # Add games played. Note in 2015 Flacco was injured in Week 10: flacco_data <- flacco_data %>% dplyr::group_by(Season) %>% dplyr::mutate(gamenumber = row_number(date)) # Creating Passing Yards Plot by game flacco_passyds_plot <- ggplot(flacco_data, aes(x = gamenumber, y = passyds)) + theme_bw() + geom_bar(stat = "identity", aes(alpha = passyds), fill = "#241773") + theme(strip.background = element_rect(fill = "black", size = 1.5), strip.text = element_text(color = "white", face = "bold"), plot.title = element_text(hjust = 0.5, face = "bold")) + geom_label(aes(x = gamenumber, y = ifelse(passyds > 70, passyds - 60, passyds + 10), label = passyds), size = 3) + ggtitle("Joe Flacco's Passing Yards by Game \n Across Seasons") + ylab("Passing Yards") + xlab("Game Number") + guides(fill = FALSE) + facet_wrap(~Season, ncol = 1)
The plot of Flacco's passing yards across games provides an interesting visual. By examining the plot you can see that he usually has a strong first game followed by a much weaker second game (excluding the 2015 season). You can also see that in 2015 Flacco was on pace to throw for a career high in yards in a season before he was derailed by a torn ACL.
flacco_passyds_plot
Now examined are Flacco's pass attempts by game. This allows visualization of his progression from a game manager to the solid (average?) quarterback that he his today. One can observe the following insights from the bar plots below:
# Creating Passing Attempts Plot by Game across Seasons ggplot(flacco_data, aes(x = gamenumber, y = pass.att)) + theme_bw() + geom_bar(stat = "identity", aes(alpha = pass.att), fill = "#241773") + theme(strip.background = element_rect(fill = "black", size = 1.5), strip.text = element_text(color = "white", face = "bold"), plot.title = element_text(hjust = 0.5, face = "bold")) + geom_label(aes(x = gamenumber, y = ifelse(pass.att > 60, pass.att - 5, pass.att), label = pass.att), size = 3) + ggtitle("Joe Flacco's Passing Attempts by Game \n Across Seasons") + ylab("Passing Attempts") + xlab("Game Number") + guides(fill = FALSE) + facet_wrap(~Season, ncol = 1)
player_game
and agg_player_season
Similar exploration of game level statistics can be done with with the player_game
function. The nice thing about the player_game
function is that it downloads, parses, and cleans the data very quickly so if you are interested in particular games, you can download the data instantaneously.
The agg_player_season
function allows users to visualize season total statistics which is beneficial for building running lists of statistics or easily calculating totals from the available range of seasons.
Note, there is also a simple box score function available which separates each of the
measured statistics into lists of dataframes. This is much more similar to what is seen in a standard box score. The function is named: simple_boxscore
Users can also download a data set of rosters for each team by season. The rosters include all players who recorded a measured statistic in the raw data. That is, if a player did not record either a passing, rushing, receiving, defensive, punt return, kick return, punting, or kicking statistic they will not be on the roster. To use the function, you must identify a season of interest and also identify the team abbreviations of your team of interest. To find team initials, load the nflteams data set stored in the package. The following code shows the roster for the Tennessee Titans in 2013:
# Find Titans Abbreviation # Load team name dataset data(nflteams) # Find Tennessee's abbreviation ten_abbr <- filter(nflteams, TeamName == "Tennessee Titans")$Abbr # Load the Titans Roster from 2013 tenroster2013 <- season_rosters(Season = 2013, TeamInt = ten_abbr) head(tenroster2013)
The drive_summary
function allows users to download datasets with information about each drive in a game. The input requires a GameID, and the outputted dataframe has
18 columns. The help documentation for the function describes each of the columns
in more detail.
Displayed below is a dataset of the drive summaries of the last game of the 2015 season between the Minnesota Viking and the Green Bay Packers.
# Drive summary from final game of 2015 season min_gb_drivesummary <- drive_summary(GameID = 2016010310) head(min_gb_drivesummary, 8)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.