knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "../man/figures/introduction-", out.width = "100%" ) library(fastRhockey) library(ggplot2) library(dplyr) library(janitor)
fastRhockey
is an R Package that is designed to pull play-by-play (and boxscore) data from the newest version of the Professional Women's Hockey League website. In the past, there have been a few scrapers for the PHF (formerly the NWHL), but they have all been deprecated since the formation of the new PWHL league changed websites.
With the first season of the league kicking off on January 1st, and games being broadcast on ESPN+, this package was created to allow access to play-by-play data to continue pushing women's hockey analytics forward.
In Spring of 2021, the Big Data Cup and the data they made available revolutionized what we were able to thanks to the detailed play-by-play data for the season and the x/y location data. That wave continued with the inaugural WHKYHAC conference in July that produced some amazing conversations and projects in the women's hockey space.
In the past, the lack of data and poor access to data have been the biggest barrier to entry in women's hockey analytics, a barrier that this package intends to alleviate.
{width=70%}
You can install the released version of fastRhockey
from GitHub with:
# You can install using the pacman package using the following code: if (!requireNamespace('pacman', quietly = TRUE)){ install.packages('pacman') } pacman::p_load_current_gh("sportsdataverse/fastRhockey", dependencies = TRUE, update = TRUE)
If you would prefer the devtools
installation:
# if you would prefer devtools installation if (!requireNamespace('devtools', quietly = TRUE)){ install.packages('devtools') } # Alternatively, using the devtools package: devtools::install_github(repo = "sportsdataverse/fastRhockey")
Once the package has been installed, there's a ton of stuff you can do. Let's start by finding a game we're interested in, say, the 2021 Isobel Cup Championship that the Boston Pride won.
# input the season that you're interested in looking up the schedule for phf_schedule(season = 2021) %>% dplyr::filter(game_type == "Playoffs") %>% dplyr::filter(home_team_short == "MIN" & away_team_short == "BOS") %>% dplyr::select(game_id, date_group, facility, game_type, home_team, away_team, home_score, away_score, winner)
A couple of quick filters/selects later and we've pared down the data into a very manageable return. We can see that the Boston Pride beat the Minnesota Whitecaps 4-3 in Warrior Ice Arena on March 27th, 2021. The other important column in this return is the game_id
column.
Let's take that game_id
and plug it into another fastRhockey
function, this time using the phf_team_box
function to pull the boxscore data from this game.
x <- 379254 box <- phf_team_box(game_id = x) box %>% dplyr::select(game_id, team, total_scoring, total_shots, successful_power_play, power_play_opportunities, faceoff_percent, takeaways)
Once again, I've selected some specific columns, but this is an example of the data that is returned by the phf_team_box
function! We have counting stat data on shots/goals, both aggregated and by period, power play data, faceoff data, and how often a team takes/gives away the puck. It's definitely helpful data and I believe that there are some really fun projects that can be done with just the phf_team_box
function, but the really good stuff is still coming.
Turn your attention to phf_pbp
, the function that was created to return PHF play-by-play data for a given game (i.e. the whole reason that fastRhockey
exists). It's a similar format to the boxscore function where the only input necessary is the game_id
that you want.
a <- Sys.time() pbp <- phf_pbp(game_id = x) Sys.time() - a
Loading a single game should take ~ 5 seconds. Once it does, it's time to have some fun. The phf_pbp
function returns r ncol(pbp)
columns, some with "boring" data, like who the teams are, etc. But then you get to the columns that look at how much time is remaining in a quarter, what the home skater vs away skater numbers are, what event occurred, who was involved, and so on.
dplyr::glimpse(pbp)
There's data on who took a shot (if a shot occurs), as well as who the primary (and secondary) assisters were and who the goalie was. Penalties are recorded + the time assigned for a trip to the box.
One of the more interesting findings from the PHF set-up was that they ID all five offensive players on the ice when a goal is scored, so that's available as well. Unfortunately, it's hard to derive any sort of plus/minus stat from this since it's only the offensive players at the time of a goal. If the offensive and defensive lineups were provided we could create a +/-, but that remains out of reach for now.
Here's an example of the things that one can now build with the play-by-play data that is generated from phf_pbp
. This is a quick graph showing cumulative shot attempts by point in the game for Boston and Minnesota.
pbp %>% dplyr::mutate(shot = ifelse(play_type %in% c("PP Goal", "Goal", "Pen Shot", "Shot", "Shot BLK"), 1, 0)) %>% dplyr::group_by(team) %>% dplyr::mutate(total_shots = cumsum(shot)) %>% ggplot() + geom_line(aes(x = sec_from_start, y = total_shots, color = team), size = 2) + scale_color_manual(values = c("Boston Pride" = "#b18c1e", "Minnesota Whitecaps" = "#1c449c")) + labs(y = "Total Shots", title = "Boston Pride vs Minnesota Whitecaps - 3/27/2021", subtitle = "Total Shots by Minute of Game") + theme_minimal() + theme( panel.grid.minor = element_blank(), axis.line = element_line(size = 1), legend.position = "bottom", axis.title.x = element_blank(), axis.text.x = element_text(size = 11), plot.title = element_text(face = "bold", hjust = 0.5, size = 14), plot.subtitle = element_text(face = "italic", hjust = 0.5, size = 12), legend.title = element_blank() ) + scale_x_continuous(breaks = c(1200, 2400, 3600, 3800), labels = c("End 1st", "End 2nd", "End 3rd", " ")) + scale_y_continuous(limits = c(0, 40))
It's a simple graph, but one that can easily help illustrate game flow. The Pride's shots came in bunches, taking a ton about halfway through the first and third periods respectively. Minnesota started the game slowly, but their shots came fairly consistently throughout the game.
There's so much more that can be explored from this play-by-play data, whether you want to explore how winning a faceoff leads to a shot attempt or the chaos that can follow giveaways.
That's a quick primer on the main functions of the package. phf_schedule
returns schedule information and game_ids, which can be used in phf_team_box
or phf_pbp
to return boxscore or play-by-play data. phf_game_all
wraps the boxscore/play-by-play and several other game summary tables into one and returns a list with the dataframes: plays, team_box, skaters, goalies, game_details, scoring_summary, shootout_summary, penalty_summary, officials, team_staff, timeouts.
The last function that may be of some use is phf_league_info
, which essentially pulls a lot of background info on the league and the IDs that are used. The output from this function gets wrapped into the phf_schedule
, which is it's main purpose.
If you look with fastRhockey::
, there are more functions available, but those are helper functions to pull raw data (phf_game_raw
) and then to process the raw data into a usable format (helper_phf____
).
To cite the fastRhockey
R package in publications, use:
BibTex Citation
@misc{howell_gilani_fastRhockey_2021, author = {Ben Howell and Saiem Gilani}, title = {fastRhockey: The SportsDataverse's R Package for Hockey Data.}, url = {https://fastRhockey.sportsdataverse.org/}, year = {2021} }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.