README.md

nflscrapRextra

The goal of nflscrapRextra is to bundle many of the commonly used functions into an R package for easier use and better reproducibility. The original sources are labeled as such from Lee Sharpe on his GitHub. Please note that I did not alter any of the code, but simply made it into a package that can be loaded rather than sourcing individuals files.

Installation

You can install the dev version of nflscrapRextra from GitHub

devtools::install_github("jthomasmock/nflscrapRextra")

PLEASE NOTE THE BELOW TEXT IS CREDITED TO LEE SHARPE

What does this code do?

Fix team abbreviations

There is inconsistent usage for team abbrevations, this attempts to standardize things through the function fix_team_abbreviations(). In particular, it uses JAX to refer to the Jacksonville Jaguars, cleaning up old games that use JAC. Additionally, for some reason, the NFL uses LA to refer to the Los Angeles Rams, even though there are now two teams that play in Los Angeles. This updates all of the Los Angeles Rams abbreviations to LAR. Every nflscrapR column where "team" is somewhere in the name of the column gets updated.

The fix_team_abbreviations() function has an optional argument old_to_new. When set to FALSE (the default), it only does the updates described above. When set to TRUE, it goes back and updates older team abbreviations. So when FALSE, the San Diego Chargers for example are represented as SD. However, sometimes you want to group by franchises across seasons, and want the abbreviations to match for the past. This means the old San Diego Chargers teams will be set to LAC instead. You can make this modification easily with the code:

plays <- plays %>% fix_team_abbreviations(old_to_new=TRUE)

Add in columns about game data

If you don't care about these, you can safely ignore them. However, I think most people will find season and week particularly helpful. I know that I do.

Add in columns from Ben Baldwin's excellent nflscrapR tutorial

If you don't want to add these columns, you can set the input for this to FALSE at the top of the file. It's done through the function apply_baldwin_mutations().

Add team logos and team colors (NEW as of 2019-11-10)

For making NFL plots, often you want logos and colors. I usually just added them individually, but now I made a function apply_colors_and_logos() to just easily add them as follows:

team_epa <- plays %>%
  filter(season == max(season) & !is.na(epa)) %>%
  group_by(posteam) %>%
  summarize(mean_epa=mean(epa)) %>%
  ungroup() %>%
  apply_colors_and_logos()

This will add two columns: - use_color: The hexadecimal color value to use for that team. It will use their primary color unless it is quite dark, in which case it uses their secondary color. - logo: This is a URL that points to a transparent image file of the teams logo. Useful for geom_image plots.

The function has an additional optional argument to tell it which column in the exisitng data to use as the team abbreviation to join against. It will default to (in this order): team, posteam, defteam. It will raise an error if you don't specify and none of those columns are present.

Add in completion probability (NEW as of 2019-11-10)

Note: This requires the Ben Baldwin mutations from above

This uses a model designed by Ben Baldwin to estimate the completion probability of a pass based on air yards (how many yards from the line of scrimmage the receiver is when the pass arrives) and whether the pass is to the left, middle, or right side of the field. The completion probability is stored in a new column called cp. This column will be NA for plays where no pass was thrown, when there was no intended receiver (throaways), or if the the number of air yards is -10 or less (very rare).

Completion probability is used in calculating a metric called Completion Percentage Over Expected (CPOE) which is highly stable metric from year-to-year for a given quarterback. Here's an example of code calculating this which gives you each quarterback's CPOE for the current season.

plays %>%
  filter(season == max(season) & !is.na(cp)) %>%
  group_by(name) %>%
  summarize(cpoe=100*mean(complete_pass-cp),count=n()) %>%
  ungroup() %>%
  filter(count >= 0.25*max(count)) %>%
  arrange(desc(cpoe))

In creating the cp column, I used Ben's model trained as follows: - 2009: This is the earliest season, so no previous training data exists. cp is NA for all 2009 plays. - 2010: This is trained from 2009 data. - 2011: This is trained from 2009 and 2010 data. - 2012+: Moving forward, each season is trained using the prior three seasons.

Add in columns for series data

If you don't want to add these columns, you can set the input for this to FALSE at the top of the file. It's done through the function apply_series_data().

This is something I've been working on for a while. (If you discover bugs, please let me know!) Anyway, this code allows you to examine an individual series play makeup, and look at whether it succeeded.

A series is defined as every time the offense receieves a new first down. This can happen because a team gained enough yards in the last play to advance the sticks, a defensive penalty resulted in a first down, a change in possession, or following a kickoff or punt. Much like the nflscrapR drive column, my new series column starts at 1 for each game, and increments each time there is a new series. Some plays will have NA when they aren't defined as part of a series, such as kickoffs or timeouts.

A series is defined as a success if the team either scores a touchdown, or obtains a new first down in that series (creating another series). The new first down can be obtained through yardage or through defensive penalty, either counts as success. If the series results in a change of possession, a field goal attempt, or a punt, it is considered a failure. The new column series_success will report whether the current series ended up succesful or not. This column will be NA when either there isn't a series (so series is also NA), or when defining success for a series does not make sense. This occurs when the series contains a quarterback spike or kneel, or when the series is ended by the half/game ending rather than a "clean" ending to the series.

  • series: What series number is this for this game? (Starts at 1 and increments.) NA for plays not in a series.
  • series_success: Did this series end in success? 1 when scoring a touchdown, getting enough yards for another first down, or a defensive penalty resulting in a first down. 0 when there is a change in possession, a field goal attempt, or a punt. Is instead NA if the series contains a QB spike or kneel, or if the series ends the half or game and does not have a clean success or failure.

A frequently asked question is why a field goal attempt is counted as a failure. The goal of the series is to either score a touchdown or keep the drive moving so you can score a touchdown on a later series in the drive. A field goal attempt is in this sense a failure. Scoring 3 points is better than scoring 0, but ultimately the goal of a drive is touchdown. This is also why the attempt is a series failure, regardless of whether the field goal attempt results in a score.

When you first execute this, it can take a long time to run. It will report its progress through the vast amount of game data so you know it isn't hung. But after the initial execution, the run time isn't bad when just applying it to new games that have finished since the last execution.

Note: This is broken for the two games the 2013 Browns hosted in Week 12 and Week 13 due to the yards_gained column in nflscrapR not having the necessary data. Both columns just show NA for those games.



jthomasmock/nflscrapRextra documentation built on Dec. 31, 2019, 12:55 a.m.