README.md

Introducing fcscrapR

The goal of fcscrapR is to allow R users quick access to the commentary for each soccer game available on ESPN. The commentary data includes basic events such as shot attempts, substitutions, fouls, cards, corners, and video reviews along with information about the players involved. The data can be accessed in-game as ESPN updates their match commentary. This package was created to help get data in the hands of soccer fans to do their own analysis and contribute to reproducible metrics.

Installation

You can install fcscrapR from github with:

# install.packages("devtools")
devtools::install_github("ryurko/fcscrapR")

Example game scraping

Here’s an example of how to scrape a game using fcscrapR. The workhorse function of the package is scrape_commentary() which takes in a game id. This game id is located in the url for a game, such as the group stage match between Serbia and Costa Rica in the 2018 World Cup: http://www.espn.com/soccer/commentary?gameId=498194

Using this game id, we can easily grab the commentary data frame:

library(fcscrapR)
#> Loading required package: magrittr
srb_crc_commentary <- scrape_commentary(498194)

Check out the documentation for scrape_commentary() for a description of all of the columns in the commentary data:

colnames(srb_crc_commentary)
#>  [1] "game_id"                 "commentary"             
#>  [3] "match_time"              "team_one"               
#>  [5] "team_two"                "team_one_score"         
#>  [7] "team_two_score"          "half_end"               
#>  [9] "match_end"               "half_begins"            
#> [11] "shot_attempt"            "penalty_shot"           
#> [13] "shot_result"             "shot_by_player"         
#> [15] "shot_by_team"            "shot_with"              
#> [17] "shot_where"              "net_location"           
#> [19] "assist_by_player"        "foul"                   
#> [21] "foul_by_player"          "foul_by_team"           
#> [23] "follow_set_piece"        "assist_type"            
#> [25] "follow_corner"           "offside"                
#> [27] "offside_team"            "offside_player"         
#> [29] "offside_pass_from"       "shown_card"             
#> [31] "card_type"               "card_player"            
#> [33] "card_team"               "video_review"           
#> [35] "video_review_event"      "video_review_result"    
#> [37] "delay_in_match"          "delay_team"             
#> [39] "free_kick_won"           "free_kick_player"       
#> [41] "free_kick_team"          "free_kick_where"        
#> [43] "corner"                  "corner_team"            
#> [45] "corner_conceded_by"      "substitution"           
#> [47] "sub_injury"              "sub_team"               
#> [49] "sub_player"              "replaced_player"        
#> [51] "penalty"                 "team_drew_penalty"      
#> [53] "player_drew_penalty"     "player_conceded_penalty"
#> [55] "team_conceded_penalty"   "half"                   
#> [57] "comment_id"              "stoppage_time"          
#> [59] "team_one_penalty_score"  "team_two_penalty_score" 
#> [61] "match_time_numeric"

Can quickly make a chart showing the difference in shot attempts for each team by the outcome:

# install.packages("ggplot2")
library(ggplot2)
srb_crc_commentary %>%
  dplyr::filter(!is.na(shot_result)) %>%
  ggplot(aes(x = shot_by_team, fill = shot_result)) +
  geom_bar() + labs(x = "Team", y = "Count", 
                    fill = "Shot result",
                    title = "Distribution of shot attempts for Costa Rica vs Serbia by result",
                    caption = "Data from ESPN, accessed with fcscrapR") +
  scale_fill_manual(values = c("darkorange", "darkblue", "darkred", "darkcyan")) +
  theme_bw()

Gather game ids

The only function available currently to get game ids is scrape_scoreboard_ids() which pulls the game ids for all soccer matches on ESPN’s soccer scoreboard given a league or tournament. You must use a league or tournament that has an associated url in the league_url_data table provided in fcscrapR:

# install.packages(pander)
league_url_data %>%
  head() %>%
  pander::pander()

| name | | :----------------------: | | show all leagues | | fifa world cup | | uefa champions league | | uefa europa league | | english premier league | | spanish primera división |

Table continues below

| url | | :-------------------------------------------------------------: | | http://www.espn.com/soccer/scoreboard/_/league/all | | http://www.espn.com/soccer/scoreboard/_/league/fifa.world | | http://www.espn.com/soccer/scoreboard/_/league/uefa.champions | | http://www.espn.com/soccer/scoreboard/_/league/uefa.europa | | http://www.espn.com/soccer/scoreboard/_/league/eng.1 | | http://www.espn.com/soccer/scoreboard/_/league/esp.1 |

Here’s an example of grabbing the World Cup games from June 20th, 2018:

scrape_scoreboard_ids(scoreboard_name = "fifa world cup", 
                      game_date = "2018-06-20") %>%
  pander::pander()
#> Loading required package: XML
#> Loading required package: RCurl
#> Loading required package: bitops

| game_id | team_one | team_two | | :------: | :-------: | :----------: | | 498185 | Portugal | Morocco | | 498184 | Uruguay | Saudi Arabia | | 498183 | Iran | Spain |

Acknowledgements

Many thanks to the sports analytics community on Twitter for guiding me to various resources of soccer data. Big thanks to Brendan Kent for pointing me to the commentary data.



ryurko/fcscrapR documentation built on Jan. 22, 2020, 1:01 p.m.