Introduction to volleystat datasets

Viktor Bozhinov

2019-05-20

The volleystat package provides five different datasets.These datasets are based on the men’s and women’s first division volleyball league in Germany for the seasons 2013/2014 to 2017/2018. All data is publicly avalable on http://www.volleyball-bundesliga.de

Matches data

The matches data contains two observations of each match. One observation for the home team and one
for the away team.

league_gender match obs
Men away 797
Men home 797
Women away 899
Women home 899

The variable league_gender provides information of the league’s gender, the variable season_id marks the starting year of the season and the end year of the season. For example, season_id 14/15 means that the season started in autumn 2014 and ended in spring 2015. The distribution of matches across gender and seasons is depicted in the following table:

league_gender season_id n_matches
Men 1314 113
Men 1415 134
Men 1516 130
Men 1617 135
Men 1718 130
Men 1819 155
Women 1314 134
Women 1415 154
Women 1516 179
Women 1617 154
Women 1718 126
Women 1819 152

Both leagues have a round robin phase and a play-off phase. The variable competition_stage can be used to split matches by the stage of the competition. In additon, the variable match_day holds information on the number of the matchday in the main round. The table shows that the most matches are played in the main round.

league_gender competition_stage n_matches
Men Finale 25
Men Main round 662
Men Playdown 3
Men Pre-Playoff 17
Men Quarterfinal 56
Men Semifinal 34
Women Finale 22
Women Main round 772
Women Pre-Playoff 17
Women Quarterfinal 57
Women Semifinal 31

The variable spectators is the officialy reported number of spectators which attended the match. The distribution of the variable for boath leagues is depicted below.

matches %>% filter(match == "home") %>% 
group_by(league_gender) %>% select(league_gender, spectators)  %>%
ggplot(aes(x = factor(league_gender), y = spectators)) + 
geom_violin()

The variable match_duration is the officialy reported duration of the match in minutes. The distribution of the variable for boath leagues is depicted below.

matches %>% filter(match == "home") %>% 
group_by(league_gender) %>% select(league_gender, match_duration, set_won)  %>%
ggplot(aes(x = match_duration)) + 
geom_histogram(position = "dodge") +
facet_grid(.~factor(league_gender))

The variable team_id is the team identifier which can be used together with league_gender and season_id to join team information from the players or staff datasets. Note that while official team name changes over the seasons for some teams but team_id doesn’t.

knitr::kable(
matches %>% filter(season_id == 1314) %>% 
group_by(league_gender, team_name) %>%  
summarize(n_matches = n()))
league_gender team_name n_matches
Men BERLIN RECYCLING Volleys 27
Men CV Mitteldeutschland 22
Men evivo Düren 20
Men Generali Haching 24
Men Moerser SC 20
Men TV Ingersoll Bühl 23
Men TV Rottenburg 22
Men VC Dresden 20
Men VfB Friedrichshafen 28
Men VSG Coburg/Grub 20
Women Allianz MTV Stuttgart 23
Women Dresdner SC 28
Women Köpenicker SC Berlin 22
Women Ladies in black AACHEN 26
Women Rote Raben Vilsbiburg 28
Women SC Potsdam 22
Women Schweriner SC 23
Women USC Münster 25
Women VC Wiesbaden 26
Women VolleyStars Thüringen 25
Women VT Aurubis Hamburg 20

The variable set_won counts the number of sets won by each team. Since volleyball is played in the best-of-five mode this variable can be used to idetify wins and losses. If you want to see how many home team wins occured in the season 2015/2016 for men and women separately you can do it as shown here:

knitr::kable(
matches %>% 
filter(match == "home", season_id == 1516) %>% 
mutate(match_won = ifelse(set_won == 3, "wins", "losses")) %>%
group_by(league_gender, match_won) %>%  
summarize(n_matches = n())
)
league_gender match_won n_matches
Men losses 50
Men wins 80
Women losses 76
Women wins 103

Sets data

The sets dataset is similar to the matches dataset but it is on set level, i.e., each set of a match is included from the prespective of the home team and the away team. It contains four identifiers which can be used to join the set information to the match information:

Beside the set number (set) and the team name (team_name), it contains information on the duration of the set in minutes (set_durarion) and the points scored by the team in the set (pt_set). Suppose you want to compare the average length of the all set numbers. This works as following:

knitr::kable(
sets %>% 
filter(match == "home") %>% 
group_by(set) %>%  
summarize(obs = n(),
          mean_duration = mean(set_duration, na.rm = TRUE))
)
set obs mean_duration
1 1696 25.07173
2 1696 25.63960
3 1696 25.80854
4 894 26.13273
5 348 16.27378

Players data

The players dataset contains four identifiers which can be used to link it to the other datasets, e.g., the matches dataset and the matchstats dataset:

The dataset contains the team name (team_name) and player’s first and last name (first_name, last_name) and the official shirt number (shirt_number) she or he used to wear in the season. In addition, the dataset conatins several player characteristics:

For example, of you want to compare the height of the players by gender and postion for the season 2017/2018, you can use dplyr to compute relevant figures as following:

knitr::kable(
players %>%
filter(season_id == 1718) %>% 
group_by(gender, position) %>% 
summarise(obs = n(),
          mean_height = mean(height),
          sd_height = sd(height),
          min_height = min(height),
          max_height = max(height)))
gender position obs mean_height sd_height min_height max_height
female Diagonal 20 186.1000 3.242806 180 192
female Libero 17 171.5294 5.745843 158 183
female Middle block 43 189.3256 3.920161 181 197
female Outside spiker 50 183.9200 4.623984 172 194
female Setter 23 179.4348 4.065666 172 190
female Universal 1 189.0000 NA 189 189
male Diagonal 21 200.6667 4.293406 192 210
male Libero 14 183.6429 5.838956 175 195
male Middle block 36 203.9722 3.359304 196 212
male Outside spiker 47 196.1915 6.077894 180 212
male Setter 23 192.5652 7.709652 179 206
male Universal 6 198.3333 3.932768 192 202

Staff data

The staff dataset contains three iidentifiers which can be used to link it to the other datasets, e.g., the players dataset and the mathes dataset:

The dataset contains the team name (team_name) and staff member’s first and last name (first_name, last_name) In addition, the dataset conatins several other characteristics:

For example, suppose you want to list all coaches and their nationality in the season 2014/2015. Then you can do it as following:

knitr::kable(
staff %>%
filter(season_id == 1415, position == "Coach") %>% 
select(league_gender, team_name, firstname, lastname, nationality) %>% 
arrange(league_gender, team_name)
)
league_gender team_name firstname lastname nationality
Men Berlin Recycling Volleys Mark Lebedew Australia
Men CV Mitteldeutschland Ulf Quell Germany
Men Netzhoppers KW-Bestensee Mirko Culic Germany
Men SVG Lueneburg Stefan Huebner Germany
Men SWD Powervolleys Dueren Michael Muecke Germany
Men TSV Herrsching Maximilian Hauser Germany
Men TV Ingersoll Buehl Ruben Wolochin Argentina
Men TV Rottenburg Hans Peter Mueller-Angstenberger Germany
Men VCO Berlin Johan Verstappen Netherlands
Men VfB Friedrichshafen Stelian Moculescu Germany
Men VSG Coburg/Grub Milan Maric Bosnia & Herzegovina
Women Allianz MTV Stuttgart Guillermo Naranjo Hernandez Spain
Women Dresdner SC Alexander Waibl Germany
Women Koepenicker SC Bjoern Matthes Germany
Women Ladies in Black Aachen Marek Rojko Slovakia
Women Rote Raben Vilsbiburg Jonas Kronseder Germany
Women SC Potsdam Alberto Salomoni Italy
Women SSC Schwerin Felix Koslowski Germany
Women USC Muenster Axel Buering Germany
Women VC Wiesbaden Andreas Vollmer Germany
Women VCO Berlin Jens Tietboehl Germany
Women Volleystars Thueringen Sebastian Leipold Germany
Women VT Aurubis Hamburg Dirk Sauermann Germany

Matchstats data

The matchstats dataset has been created from the official match reports of each match (if it was available). An example of an official match report created by the author of the volleystat package can be found here:

http://live.volleyball-bundesliga.de/2016-17/Women/&2058.pdf

The dataset contains a series of identifiers which can be used to join the dataset to the teams data and the matches data:

For example, let’s say you want to take a look at all statistics on the reception of the libero of VC Wiesbaden in the season 2016/2017. You can use the teams dataset to select the libero and the team and then join it to the matchstats dataset:

knitr::kable(
players %>% 
filter(team_id == 2009, season_id == 1617, position == "Libero") %>% 
left_join(matchstats, by = c("season_id" = "season_id", 
                             "team_id" = "team_id", 
                             "player_id" = "player_id")) %>% 
arrange(match_id) %>% 
select(season_id, team_name, firstname, lastname, match_id, starts_with("rec_")))
season_id team_name firstname lastname match_id rec_tot rec_err rec_pos rec_per
1617 VC Wiesbaden Alyssa Longo 2003 15 0 60 33
1617 VC Wiesbaden Alyssa Longo 2013 31 1 61 45
1617 VC Wiesbaden Alyssa Longo 2017 26 4 38 15
1617 VC Wiesbaden Alyssa Longo 2021 11 1 64 27
1617 VC Wiesbaden Alyssa Longo 2025 23 0 39 26
1617 VC Wiesbaden Alyssa Longo 2029 7 1 29 0
1617 VC Wiesbaden Alyssa Longo 2032 10 0 30 10
1617 VC Wiesbaden Alyssa Longo 2046 11 1 27 18
1617 VC Wiesbaden Alyssa Longo 2053 18 1 17 6
1617 VC Wiesbaden Alyssa Longo 2058 33 1 33 27
1617 VC Wiesbaden Alyssa Longo 2064 18 5 33 11
1617 VC Wiesbaden Alyssa Longo 2069 12 1 42 33
1617 VC Wiesbaden Alyssa Longo 2074 24 7 42 25
1617 VC Wiesbaden Alyssa Longo 2082 22 2 45 18
1617 VC Wiesbaden Alyssa Longo 2092 10 0 70 50
1617 VC Wiesbaden Alyssa Longo 2100 10 0 60 20
1617 VC Wiesbaden Alyssa Longo 2102 11 1 45 9
1617 VC Wiesbaden Alyssa Longo 2106 24 3 33 21
1617 VC Wiesbaden Alyssa Longo 2111 29 5 38 10
1617 VC Wiesbaden Alyssa Longo 2118 21 4 48 29
1617 VC Wiesbaden Alyssa Longo 2122 17 2 41 12
1617 VC Wiesbaden Alyssa Longo 2132 17 1 24 0
1617 VC Wiesbaden Alyssa Longo 2515 24 0 58 21
1617 VC Wiesbaden Alyssa Longo 2516 18 0 50 33
1617 VC Wiesbaden Alyssa Longo 2517 22 4 41 9
1617 VC Wiesbaden Alyssa Longo 2518 22 3 41 18
1617 VC Wiesbaden Alyssa Longo 2519 26 0 58 31

The variable vote is extracted only if it is reported as an integer (in some match reports this value is reported using a three-point system which is not comparable to the numeric vote). The remaining variables of the dataset contain the statistics of each player who was fielded in a match for the categories Points (starts with pt_), Serve (starts with serv_), Reception (starts with att_), Attack (starts with att_), BK (starts with att_) (see http://live.volleyball-bundesliga.de/2016-17/Women/&2058.pdf for an example). Note that the starting position of the player is not included (yet) into the datatset (columns set in the example).

For example, if you want to compute how often the libero of VC Wiesbaden received the ball and how many errors she made in all matches in the season 2016/2017 you can modify the code from above:

knitr::kable(
players %>%
filter(team_id == 2009, season_id == 1617, position == "Libero") %>% 
left_join(matchstats, by = c("season_id" = "season_id", 
                             "team_id" = "team_id",
                             "player_id" = "player_id")) %>% 
select(rec_tot, rec_err) %>% 
summarise(rec_tot_sum = sum(rec_tot),
          rec_err_tot = sum(rec_err))
)
rec_tot_sum rec_err_tot
512 48