library(knitr)

opts_chunk$set(echo=TRUE,
               warning=FALSE,
               message=FALSE,
               cache=FALSE)
# load own package
devtools::load_all(here::here())

Load data

Before we can do any tidying let's load the data we previously imported.

events_identifiers <- get_identifiers("events")

list_events <- events_identifiers %>%
  purrr::map(~load_for_same_name(type_of_data = "events", identifier = .))
player_identifiers <- get_identifiers("player")

list_players <- player_identifiers %>%
  purrr::map(~load_for_same_name(type_of_data = "player", identifier = .))
games_identifiers <- get_identifiers("games")

list_games <- games_identifiers %>%
  purrr::map(~load_for_same_name(type_of_data = "games", identifier = .))

Wrangling

The datasets we start off with are already tidy in a basic way:

However, for each dataset there's still some work to be done. E.g. all datasets should at least be bound together (since at the start they all exist of separate dataframes).

events <- dplyr::bind_rows(list_events)
games <- dplyr::bind_rows(list_games)
players <- dplyr::bind_rows(list_players)

We end up with a single dataframe.

For each dataset (events, games and players) we look at two things:

Events

Enough penalty events?

Let's make sure we have enough information about penalty events, if not all additional work is pointless.

types_of_events <- table(events$type)

penalty_events <- dplyr::filter(events,
                                stringr::str_detect(type, "penalty"))

There are a total of r nrow(penalty_events) penalty events which isn't too much but sufficient nonetheless.

Missing values

First have a look at the missing values by variable:

plot_count_bar(count_df = calc_n_na_by_var(events))

We don't have any missing values for the game id or type of event which is nice. However, the player id and time variables seem to be missing in quite a lot of cases.

Missing player id variable

A possible explanation for these missing values is that not all events are linked to a player (for example the start and end of the game are coded as events but don't involve any specific player), Let's see how many of the missing player ids can be explained by this:

type_events_missing_player_ids <- events %>%
  dplyr::filter(is.na(player_id)) %>% 
  .[["type"]] %>% 
  table(.) %>% 
  as.data.frame() %>% 
  setNames(c("type","count"))

plot_count_bar(variable = "type",
               count_df = type_events_missing_player_ids)

Almost all missing player ids occur when the event has nothing to do with a specific player. Makes sense. There's just one exception: free kick won. We imagine a free kick is always won by a player. If not, why say won when there can't be a winner? Let's see if this is just an accident de parcours by checking if there are also free kicks won where the player id is included.

free_kick_won_missing_values <- events %>%
  dplyr::mutate(has_player_id = dplyr::if_else(is.na(player_id), FALSE, TRUE)) %>% 
  dplyr::filter(type == "free kick won") %>% 
  dplyr::group_by(has_player_id) %>%
  dplyr::summarise(count = n())

nr_truly_incorrect_missing_values <- free_kick_won_missing_values %>% 
  dplyr::filter(has_player_id == FALSE) %>% 
  .[["count"]]

plot_count_bar(variable = "has_player_id",
               count_df = free_kick_won_missing_values)

Most free kick won events actually do have a player id linked to them. These r nr_truly_incorrect_missing_values which don't are clearly a mistake. Now we have to decide whether to discard these events or not. Certainly we shouldn't assume that just because the player ids are not there the rest of the event information is faulty as well. Let's zoom in on these free kick won events where the player id is missing.

free_kick_won_missing_player_id <- events %>%
  dplyr::filter(type == "free kick won" & is.na(player_id))

free_kick_won_missing_player_id

There seems nothing wrong with these events except for the fact the player id is missing. For reasons of consistency we decide to keep these events in the dataset.

Missing time variable

The time variable is missing for a number of events as well. What kind of events don't have a timestamp?

time_missing_by_event_type <- events %>% 
  dplyr::group_by(type) %>% 
  dplyr::summarise(count = sum(is.na(time))) %>% 
  dplyr::filter(count != 0)

plot_count_bar(variable = "type",
               count_df = time_missing_by_event_type)  

It's normal there's no time information for these events.

Games

Enough games?

We have info about r nrow(games) games. In itself this doesn't mean anything. The games dataset is only useful if we can link it to the events dataset (the data we're really interested in). The games dataset is linked to the events dataset through the game id. So to verify if our games dataset is large enough we must make sure each game id of events occurs in games as well.

game_ids_events <- unique(events$game_id)
game_ids_games <- unique(games$game_id)

all(game_ids_events %in% game_ids_games)

This is the case so we have enough games.

Missing values

What variables of games have missing values?

missing_values_games <- calc_n_na_by_var(games)

plot_count_bar(count_df = missing_values_games)

There seem to be about r max(missing_values_games$count) games with attendance, away score and home score missing. Luckily we don't have to worry about those missing variables. We can recalculate them based on the events. E.g if for a specific game a home player scores twice we know the score is 2-0.

Players

Enough players?

# nr_empty_rows_players <- complete
#   
# percentage_empty_rows_players <- 

First thing we notice is there are a lot of empty rows in the players dataset ()

Missing values



IsaacVerm/penalty documentation built on May 26, 2019, 7:28 a.m.