knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)

The epldata package is a set of nine comprehensive datasets covering players, teams, managers, goals and assists in the English Premier League from its incepton in August 1992 to the final week of the 2017/18 season. It is the intention to update the package annually, shortly after the end of each season

This vignette is a brief introduction to some aspects of the package and how it might be used. Several other packages are utilized. If you are unfamiliar with their functions you will need to refer to their documentation

Basics

# Download package if not on your system
#devtools::install_github("pssguy/epldata")
library(epldata)

## This lists the available datasets with a brief description
data(package="epldata")

Let's look at one of them

library(tidyverse) # for data manipulation

glimpse(players)

The tables are in SQL type with a key variable for linking separate datasets. For this data, it is the unique player_id, which also appears in the player_team dataset. The data provides some basic information on each of the, approaching 5,000, players who have appeared in the League


One-Table Analysis

Let's use the players data to obtain the percentage distribution of players born from 1990 onwards, by birth country

players %>% 
  filter(birth_date>"1989-12-31") %>% 
  group_by(birth_country) %>% 
  tally() %>% 
  mutate(pc=round(100*n/sum(n),2)) %>% 
  arrange(desc(pc))

Predictably, England dominates but Scotland only scrapes into the top 10


Multi-table Analyses

Much more commonly you will need to combine tables to produce interesting information

Which player has scored the most for each team?

player_goals <- players %>% 
  left_join(player_team) %>% 
  left_join(player_game) %>% 
  right_join(goals) %>% 
  mutate(name=paste(first_name,last_name)) %>% 
  group_by(player_id,name,team) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  group_by(team) %>% 
  slice(1) %>% 
  ungroup() %>% 
  filter(!(is.na(team))) %>% 
  select(team,name,goals=n)



player_goals

Derived tables

The above example included quite a few joins which you will probably not wish to do for every analysis For instance, you might want to have available a summary of each match played

## goals by team for individual match

goals_by_team <- game_team %>% 
  left_join(player_game) %>% 
  right_join(goals) %>% 
  # sum goals for each team for each game
  group_by(team,team_game_id,game_id) %>% 
  tally() %>% 
  # need to include games in which no goals were scored by team
  right_join(game_team) %>% 
  mutate(GF=ifelse(is.na(n),0,n)) %>% 
  select(-c(venue,n))

goals_by_team

So we now have the goals scored by each team. The next step is to combine this table with itself to obtain the opposing team and the goals against

goals_by_game <-goals_by_team %>% 
  inner_join(goals_by_team,by="game_id") # specify otherwise it will also use team_game_id

head(goals_by_game)

We have duplication and wish to remove all those where team.x= team.y. as well as tidy up column names and calculate the points accrued for each match. This takes a few seconds to run

match_summary <- goals_by_game %>% 
  filter(team.x!=team.y) %>% 
  select(team=team.x,team_game_id=team_game_id.x,game_id,GF=GF.x,opponents=team.y,GA=GF.y) %>% 
  mutate(points=case_when(
       GF >GA  ~ 3,
       GF==GA ~ 1,
       GF<GA ~ 0

  ))

match_summary

To put the results into context, we need to add the game date, arrange it sequentially and split the results into seasons. This also takes a few seconds to run

years <- c(1992:2018)


library(lubridate) # for date manipulation

match_summary_full <- match_summary %>% 
  left_join(games) %>% 
  mutate(year=year(game_date),month=month(game_date)) %>% 
  mutate(season= case_when(
    month<=7 ~ paste(year-1,year,sep="/"),
    month>7 ~ paste(year,year+1,sep="/")
         )
  ) %>% 
  arrange(game_date) %>% 
  group_by(season,team) %>% 
  mutate(year_game_order=row_number())

 match_summary_full

This might be a useful derived table to save as a basis for further analyses including

  1. Create standings
  2. Team Head-to-Head matchups
  3. Sequences of results and scoring

Standings

We can now create a standings data.frame for each round of matches based on points, Goal difference , and Goals For

standings <- match_summary_full %>% 
  select(team,season,game_date,year_game_order,GF,GA,points) %>% 
  group_by(team,season) %>% 
  mutate(cum_points=cumsum(points),cum_GF=cumsum(GF),cum_GA=cumsum(GA),cum_GD=cum_GF-cum_GA) %>% 
  group_by(season,year_game_order) %>% 
  arrange(desc(cum_points),desc(cum_GD),desc(cum_GF),team) %>% 
  mutate(position=row_number()) %>% 
  select(season,team,round=year_game_order,position,GF=cum_GF,GA=cum_GA,GD=cum_GD,points=cum_points) %>% 
  ungroup() # important otherwise scres up later inc animation

standings

It is then a simple matter to create a function to get a table for any round of any year. e.g after 20 games in 1994/1995

table_year_round <- function(x,y){
  standings %>% 
    filter(season==x,round==y)
}

table_year_round("1994/1995",20)

Obviously you can vary what is in these derived tables to suit your own requirement

For the premiersoccerstats web site, I create around thirty derived tables weekly for speedy user interaction


Output

We have previously covered basic usage of datasets within the package including combining them to produce answers to questions and creating derived tables.

We will next look at more interesting output in the form of

This package is particularly suited to the first two options though there is some geographic data to play around with

You will need the the data.frames created earlier so if it they are not in your environment either load a saved version or re-run the code


Tables

I tend to use the DT package, but there are other options

Head to Head

Let's use the match_summary_full dataframe to calculate each team's head to head record. Over and above the current data, we need to create and sum the results

match_summary_full %>% 
  ungroup() %>%  #match_summary_full is grouped tbl_df
  group_by(team,opponents) %>% 
  mutate(result = case_when(
    GF > GA ~ "W", #win
    GF == GA ~ "D", #draw/tie
    GF < GA ~ "L" # loss
  )) %>% 
  select(team,opponents,result,GF,GA,points) %>%
  mutate(yesno = 1) %>%
  distinct %>%
  spread(result, yesno, fill = 0) %>% 
  summarize(P=n(),W=sum(W),D=sum(D),L=sum(L),ppg=round(sum(points)/P,2))%>%
  arrange(desc(ppg)) %>% 
   DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))

This provides a sortable, searchable table


Charts

Let's turn attention to players. Firstly I will create a data.frame for the goals and assists for a specified player

For ease of use below, I have created it as a function and provided an example player_id

player_game_data <- function(player) {
# collect goal information for specific player
df_goals <- players %>% 
  left_join(player_team) %>% 
  left_join(player_game) %>% 
  left_join(goals) %>% 
  filter(start==TRUE|time_on>0) %>% 
  select(player_id,last_name,player_game_id,goal_id,team_game_id) %>% 
  mutate(goal=ifelse(!is.na(goal_id),1,0)) %>% 
  group_by(player_id,last_name,team_game_id) %>% 
  summarize(tot_goals=sum(goal)) %>% 
   filter(player_id==player)

# likewise with assists
df_assists <- players %>% 
  left_join(player_team) %>% 
  left_join(player_game) %>% 
  left_join(assists) %>% 
  filter(start==TRUE|time_on>0) %>% 
  select(player_id,last_name,team_game_id,assist_id,player_game_id) %>% 
  mutate(assist=ifelse(!is.na(assist_id),1,0)) %>% 
  group_by(player_id,last_name,team_game_id) %>% 
  summarize(tot_assists=sum(assist)) %>% 
   filter(player_id==player)

# combine
df_all <- df_goals %>% 
  inner_join(df_assists) %>% 
# create a game order  
  left_join(game_team) %>%
  left_join(games) %>% 
  arrange(game_date) %>% 
  mutate(player_game_order=row_number()) %>% 
   ungroup() %>% #removes unwanted name and PLAYERID
  select(player_game_order,tot_goals,tot_assists) %>% 
  # gather into narrow format for plotting
  gather(category,count,-player_game_order) 


}

player_df <-player_game_data("SALAHM")
head(player_df)

You can see why you might want to create a derived player table first if you want to do varied detailed analyses particularly where the raw data is only updated annually .saves time and enhances user interactivity experience

Now just choose your plotting package of choice to display the data. I will use plotly as this allows for ease of info-activity including feature such as panning/zooming, hover, tooltips etc.

library(plotly)

player_df %>%
  plot_ly(x=~player_game_order, y= ~count,width = 600) %>%
  add_bars(color= ~category, colors=c("red","blue")) %>%
  layout(barmode="stack")

Lots of customization is available within the package.


Interactivity

Lets use the data to create some interactive output

Shiny

Lets say we use the match_summary_full data to plot a histogram of the goals scored by a team in the Premier League

I have set the eval = FALSE as appshot of Shiny app objects is not yet supported. but the code will run and give a similar output to that of the Crosstalk example below

library(shiny)
library(glue)

shinyApp(



  ui = fluidPage(

    ## calculate an ordered vector of teams to  select from
     teams <- match_summary_full %>%
      pull(team) %>%
      unique() %>%
      sort(),

    selectInput("team", "Select Team:", teams),

    plotlyOutput("goals_for")
  ),
  server = function(input, output) {

    output$goals_for <- renderPlotly({

      match_summary_full %>%
        filter(team == input$team) %>%
        plot_ly %>%
        add_histogram(x =  ~ GF) %>%
        layout(title = glue("Distribution of Goals scored by {input$team}"))

    })
  }
)

Crosstalk

This is an alternative method which does not require access to a server and allows for htmlwidgets to interact with each other

Filtering

Here is the equivalent input selection(minus a default) and chart as the shiny example above produces

library(crosstalk)

msf  <- SharedData$new(match_summary_full)
bscols(
  widths = c(12), # forces components into rows
filter_select(id="team",label="Select a Team",sharedData=msf, group =  ~team, multiple = FALSE),
plot_ly(msf, x = ~GF, showlegend = FALSE, width=600) %>% 
    add_histogram(color = ~team, colors = "red")
)

Animation

If you want eye-candy... We can use the standings dataset prepared earlier Let's look at how arch-rivals, Brighton and Crystal Palace, fared in 2017/18

# function to add cumulative line
# courtesy Carson Sievert

accumulate_by <- function(dat, var) {
  var <- lazyeval::f_eval(var, dat)
  lvls <- plotly:::getLevels(var)
  dats <- lapply(seq_along(lvls), function(x) {
    cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
  })
  dplyr::bind_rows(dats)
}

# select team(s) to display 
teams <- c("Brighton","Crystal P")

# add function to base data and year of interest
df <- standings %>% 
  filter(season=="2017/2018"&team %in% teams) %>% 
  accumulate_by(~round)

# static plot - scatter plot- uncolored
base <- df %>% 
  plot_ly(x=~round,y=~position,width=600) %>% 
layout(
    xaxis=list(title="Games Played"),
   yaxis=list(title="League Standing",range=c(20.5,0.5))
  ) %>% 
  config(displayModeBar = F,showLink = F) 


# add animation options and color-blind safe colors
 base %>%
   add_lines(color = ~team, colors="Set2", frame = ~frame, ids = ~team) %>%
  animation_opts(500, easing = "linear",mode='immediate') %>%
  animation_button(
    x = 1, xanchor = "right", y = 0, yanchor = "middle", font = list(color="red"), bgcolor="lightblue"
  ) %>%
  animation_slider(
    currentvalue = list(prefix = "Game ")
  ) 

Brighton, a promoted club, were expected to be struggle but Crystal Palace spent more of the season in the drop zone. In the end, they both survived relegation by placing higher than 18th



pssguy/epldata documentation built on May 12, 2019, 7:36 a.m.