README.md

This package is used to analyze Serie A soccer (Calcio) data. It creates an accessible R data-frame with information about match results, as well as team stats, Elo ratings, and overall standings. This data-frame is used to generate visualizations on a Shiny App: https://datavisr.shinyapps.io/calcior/

Source Data

The data is sourced from https://github.com/openfootball which contains the results of all Serie A match since the 2013/14 season. The data is extracted using Ruby with the sportdb gem. Running this will create a local SQLite database sport.db that we can use to read into R.

#> List of 4
#>  $ teams :Classes 'tbl_df', 'tbl' and 'data.frame':  28 obs. of  10 variables:
#>   ..$ id        : int [1:28] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ key       : chr [1:28] "milan" "inter" "lazio" "roma" ...
#>   ..$ title     : chr [1:28] "Milan" "Inter" "Lazio" "Roma" ...
#>   ..$ code      : chr [1:28] "MIL" "INT" "LAZ" "ROM" ...
#>   ..$ synonyms  : chr [1:28] "AC Milan|Associazione Calcio Milan" "Internazionale|FC Internazionale Milano" "SS Lazio|Società Sportiva Lazio|Lazio Roma" "AS Roma|Associazione Sportiva Roma" ...
#>   ..$ country_id: int [1:28] 117 117 117 117 117 117 117 117 117 117 ...
#>   ..$ club      : chr [1:28] "t" "t" "t" "t" ...
#>   ..$ since     : int [1:28] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ web       : chr [1:28] NA NA NA NA ...
#>   ..$ national  : chr [1:28] "f" "f" "f" "f" ...
#>  $ events:Classes 'tbl_df', 'tbl' and 'data.frame':  4 obs. of  8 variables:
#>   ..$ id       : int [1:4] 1 2 3 4
#>   ..$ key      : chr [1:4] "it.2016/17" "it.2015/16" "it.2014/15" "it.2013/14"
#>   ..$ league_id: int [1:4] 1 1 1 1
#>   ..$ season_id: int [1:4] 5 6 7 8
#>   ..$ start_at : chr [1:4] "2016-08-21" "2015-08-22" "2014-08-30" "2013-08-24"
#>   ..$ team3    : chr [1:4] "t" "t" "t" "t"
#>   ..$ sources  : chr [1:4] "seriea-i,seriea-ii" "seriea-i,seriea-ii" "seriea-i,seriea-ii" "seriea-i,seriea-ii"
#>   ..$ config   : chr [1:4] "seriea.yml" "seriea.yml" "seriea.yml" "seriea.yml"
#>  $ games :Classes 'tbl_df', 'tbl' and 'data.frame':  1523 obs. of  13 variables:
#>   ..$ id       : int [1:1523] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ round_id : int [1:1523] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ pos      : int [1:1523] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ team1_id : int [1:1523] 4 7 10 17 11 16 5 1 15 20 ...
#>   ..$ team2_id : int [1:1523] 13 12 3 19 2 6 18 8 14 9 ...
#>   ..$ play_at  : chr [1:1523] "2016-08-20 12:00:00.000000" "2016-08-20 12:00:00.000000" "2016-08-21 12:00:00.000000" "2016-08-21 12:00:00.000000" ...
#>   ..$ postponed: chr [1:1523] "f" "f" "f" "f" ...
#>   ..$ knockout : chr [1:1523] "f" "f" "f" "f" ...
#>   ..$ home     : chr [1:1523] "t" "t" "t" "t" ...
#>   ..$ score1   : int [1:1523] 4 2 3 1 2 0 3 3 0 2 ...
#>   ..$ score2   : int [1:1523] 0 1 4 0 0 1 1 2 1 2 ...
#>   ..$ winner   : int [1:1523] 1 1 2 1 1 2 1 1 2 0 ...
#>   ..$ winner90 : int [1:1523] 1 1 2 1 1 2 1 1 2 0 ...
#>  $ rounds:Classes 'tbl_df', 'tbl' and 'data.frame':  154 obs. of  8 variables:
#>   ..$ id      : int [1:154] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ event_id: int [1:154] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ title   : chr [1:154] "1^ Giornata" "Pescara         1-2 Fiorentina  (19.Giornata)   02.02." "3^ Giornata" "4^ Giornata" ...
#>   ..$ pos     : int [1:154] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ knockout: chr [1:154] "f" "f" "f" "f" ...
#>   ..$ start_at: chr [1:154] "2016-08-20" "2016-08-27" "2016-09-10" "2016-09-16" ...
#>   ..$ end_at  : chr [1:154] "2016-08-21" "2016-08-28" "2016-09-12" "2016-09-18" ...
#>   ..$ auto    : chr [1:154] "t" "t" "t" "t" ...

Processed Data

The source data is transformed from a set of relational tables to a single data-frame serie_a which contains list columns of data-frame to maintain the relationship of teams and matches to match_days (rounds) and season. Summary data and Elo ratings are also calculated (details below).

serie_a
#> # A tibble: 4 × 6
#>   season           results match_days_complete             teams
#>    <dbl>            <list>               <dbl>            <list>
#> 1      1 <tibble [38 × 2]>                  38 <tibble [20 × 1]>
#> 2      2 <tibble [38 × 2]>                  38 <tibble [20 × 1]>
#> 3      3 <tibble [38 × 2]>                  38 <tibble [20 × 1]>
#> 4      4 <tibble [38 × 2]>                  32 <tibble [20 × 1]>
#> # ... with 2 more variables: ratings <list>, standings <list>
season:

Serie A seasons starting from 2013/14 to 2016/17

match_days_complete:

The number of matches completed so far for each season.

teams:

The teams included for each season in Serie A. They change each season as the bottom 3 teams are sent down to Serie B and the top 3 teams from Serie B are promoted.

serie_a %>% select(season, teams) %>% tidyr::unnest(teams) %>% glimpse()
#> Observations: 80
#> Variables: 2
#> $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ p_team <chr> "catania", "lazio", "juventus", "napoli", "chievoverona...
results:

For every season, match_day and team (p_team, for primary team) it shows their score (p_score), their opponents score (o_score), if they were home (p_home) and how many points the p_team earned from the result.

serie_a %>% select(season, results) %>% tidyr::unnest(results) %>% tidyr::unnest(data) %>% glimpse()
#> Observations: 3,040
#> Variables: 8
#> $ season    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ match_day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ p_team    <chr> "hellasverona", "sampdoria", "inter", "cagliari", "l...
#> $ o_team    <chr> "milan", "juventus", "genoa", "atalanta", "udinese",...
#> $ p_score   <int> 2, 0, 2, 2, 2, 0, 3, 0, 2, 2, 1, 1, 0, 1, 1, 2, 0, 0...
#> $ o_score   <int> 1, 1, 0, 1, 1, 2, 0, 0, 0, 1, 2, 0, 2, 2, 2, 0, 3, 0...
#> $ p_home    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
#> $ points    <dbl> 3, 0, 3, 3, 3, 0, 3, 1, 3, 3, 0, 3, 0, 0, 0, 3, 0, 1...
ratings:

For every season, match_day and team (p_team) it shows the teams Elo rating r.

The Elo calculations are mostly based on this site: http://www.eloratings.net/system.html. With k = 20 and a season reverting factor of 0.25.

serie_a %>% select(season, ratings) %>% tidyr::unnest(ratings) %>% tidyr::unnest(data) %>% glimpse()
#> Observations: 3,120
#> Variables: 4
#> $ season    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ match_day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ p_team    <chr> "atalanta", "bologna", "cagliari", "catania", "chiev...
#> $ r         <dbl> 1492.801, 1487.402, 1507.199, 1492.801, 1502.801, 15...
standings:

For every season,match_day and team (p_team) it shows the teams cumulative points, goals_for, goals_against and goal_diff, along with their position in comparison to the other teams.

serie_a %>% select(season, standings) %>% tidyr::unnest(standings) %>% tidyr::unnest(data) %>% glimpse()
#> Observations: 3,120
#> Variables: 9
#> $ season         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ match_day      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ p_team         <chr> "lazio", "juventus", "fiorentina", "cagliari", ...
#> $ position       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
#> $ points         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 0, 0, 0, 0, 0,...
#> $ matches_played <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ goals_for      <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 0, 1,...
#> $ goals_against  <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 3, 2, 2, 2, 2,...
#> $ goal_diff      <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 3, 0, 0, -3, -2, -2, -2...


lromeo/CalcioR documentation built on May 21, 2019, 7:52 a.m.