Introduction to volleystat datasets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
  library(dplyr)
  library(ggplot2)

  data(matches, package = "volleystat")
  data(players, package = "volleystat")
  data(matchstats, package = "volleystat")
  data(staff, package = "volleystat")
  data(sets, package = "volleystat")

The volleystat package provides five different datasets.These datasets are based on the men's and women's first division volleyball league in Germany for the seasons 2013/2014 to 2017/2018. All data is publicly avalable on http://www.volleyball-bundesliga.de

Matches data

The matches data contains two observations of each match. One observation for the home team and one
for the away team.

knitr::kable(
matches %>% 
group_by(league_gender, match) %>% 
summarise(obs = n())
)

The variable league_gender provides information of the league's gender, the variable season_id marks the starting year of the season and the end year of the season. For example, season_id 14/15 means that the season started in autumn 2014 and ended in spring 2015. The distribution of matches across gender and seasons is depicted in the following table:

knitr::kable(
matches %>% filter(match == "home") %>% 
group_by(league_gender, season_id) %>%  
summarize(n_matches = n())
)

Both leagues have a round robin phase and a play-off phase. The variable competition_stage can be used to split matches by the stage of the competition. In additon, the variable match_day holds information on the number of the matchday in the main round. The table shows that the most matches are played in the main round.

knitr::kable(matches %>% filter(match == "home") %>% 
group_by(league_gender, competition_stage) %>%  summarize(n_matches = n()))

The variable spectators is the officialy reported number of spectators which attended the match. The distribution of the variable for boath leagues is depicted below.

matches %>% filter(match == "home") %>% 
group_by(league_gender) %>% select(league_gender, spectators)  %>%
ggplot(aes(x = factor(league_gender), y = spectators)) + 
geom_violin()

The variable match_duration is the officialy reported duration of the match in minutes. The distribution of the variable for boath leagues is depicted below.

matches %>% filter(match == "home") %>% 
group_by(league_gender) %>% select(league_gender, match_duration, set_won)  %>%
ggplot(aes(x = match_duration)) + 
geom_histogram(position = "dodge") +
facet_grid(.~factor(league_gender))

The variable team_id is the team identifier which can be used together with league_gender and season_id to join team information from the players or staff datasets. Note that while official team name changes over the seasons for some teams but team_id doesn't.

knitr::kable(
matches %>% filter(season_id == 1314) %>% 
group_by(league_gender, team_name) %>%  
summarize(n_matches = n()))

The variable set_won counts the number of sets won by each team. Since volleyball is played in the best-of-five mode this variable can be used to idetify wins and losses. If you want to see how many home team wins occured in the season 2015/2016 for men and women separately you can do it as shown here:

knitr::kable(
matches %>% 
filter(match == "home", season_id == 1516) %>% 
mutate(match_won = ifelse(set_won == 3, "wins", "losses")) %>%
group_by(league_gender, match_won) %>%  
summarize(n_matches = n())
)

Sets data

The sets dataset is similar to the matches dataset but it is on set level, i.e., each set of a match is included from the prespective of the home team and the away team. It contains four identifiers which can be used to join the set information to the match information:

Beside the set number (set) and the team name (team_name), it contains information on the duration of the set in minutes (set_durarion) and the points scored by the team in the set (pt_set). Suppose you want to compare the average length of the all set numbers. This works as following:

knitr::kable(
sets %>% 
filter(match == "home") %>% 
group_by(set) %>%  
summarize(obs = n(),
          mean_duration = mean(set_duration, na.rm = TRUE))
)

Players data

The players dataset contains four identifiers which can be used to link it to the other datasets, e.g., the matches dataset and the matchstats dataset:

The dataset contains the team name (team_name) and player's first and last name (first_name, last_name) and the official shirt number (shirt_number) she or he used to wear in the season. In addition, the dataset conatins several player characteristics:

For example, of you want to compare the height of the players by gender and postion for the season 2017/2018, you can use dplyr to compute relevant figures as following:

knitr::kable(
players %>%
filter(season_id == 1718) %>% 
group_by(gender, position) %>% 
summarise(obs = n(),
          mean_height = mean(height),
          sd_height = sd(height),
          min_height = min(height),
          max_height = max(height)))

Staff data

The staff dataset contains three iidentifiers which can be used to link it to the other datasets, e.g., the players dataset and the mathes dataset:

The dataset contains the team name (team_name) and staff member's first and last name (first_name, last_name) In addition, the dataset conatins several other characteristics:

For example, suppose you want to list all coaches and their nationality in the season 2014/2015. Then you can do it as following:

knitr::kable(
staff %>%
filter(season_id == 1415, position == "Coach") %>% 
select(league_gender, team_name, firstname, lastname, nationality) %>% 
arrange(league_gender, team_name)
)

Matchstats data

The matchstats dataset has been created from the official match reports of each match (if it was available). An example of an official match report created by the author of the volleystat package can be found here:

http://live.volleyball-bundesliga.de/2016-17/Women/&2058.pdf

The dataset contains a series of identifiers which can be used to join the dataset to the teams data and the matches data:

For example, let's say you want to take a look at all statistics on the reception of the libero of VC Wiesbaden in the season 2016/2017. You can use the teams dataset to select the libero and the team and then join it to the matchstats dataset:

knitr::kable(
players %>% 
filter(team_id == 2009, season_id == 1617, position == "Libero") %>% 
left_join(matchstats, by = c("season_id" = "season_id", 
                             "team_id" = "team_id", 
                             "player_id" = "player_id")) %>% 
arrange(match_id) %>% 
select(season_id, team_name, firstname, lastname, match_id, starts_with("rec_")))

The variable vote is extracted only if it is reported as an integer (in some match reports this value is reported using a three-point system which is not comparable to the numeric vote). The remaining variables of the dataset contain the statistics of each player who was fielded in a match for the categories Points (starts with pt_), Serve (starts with serv_), Reception (starts with att_), Attack (starts with att_), BK (starts with att_) (see http://live.volleyball-bundesliga.de/2016-17/Women/&2058.pdf for an example). Note that the starting position of the player is not included (yet) into the datatset (columns set in the example).

For example, if you want to compute how often the libero of VC Wiesbaden received the ball and how many errors she made in all matches in the season 2016/2017 you can modify the code from above:

knitr::kable(
players %>%
filter(team_id == 2009, season_id == 1617, position == "Libero") %>% 
left_join(matchstats, by = c("season_id" = "season_id", 
                             "team_id" = "team_id",
                             "player_id" = "player_id")) %>% 
select(rec_tot, rec_err) %>% 
summarise(rec_tot_sum = sum(rec_tot),
          rec_err_tot = sum(rec_err))
)


Try the volleystat package in your browser

Any scripts or data that you put into this service are public.

volleystat documentation built on May 21, 2019, 1:02 a.m.