library(dplyr)
library(comperes)

comperes package is designed for

This vignette describes supported formats for CR and operations with them.

Formats

It is assumed that competition consists from multiple games (matches, comparisons, etc.). In general games can consist from variable number of players. Inside a game all players are treated equally. In every game every player has some score: the value of arbitrary nature that fully characterizes player's performance in particular game (in most cases it is some numeric value).

Long CR format

Long format is the most general way to represent CR because it naturally allows one game to consist from variable number of players. Results should be in "data.frame-like" format with observational unit (row) being score of particular player in particular game.

In comperes this format is supported with longcr S3 class which inherits from tibble. Data in this format should have at least three columns with the following names:

Extra columns are allowed. Note that if longcr object is converted to widecr (wide format which is described next) they will be dropped. So it is better to store extra information about game-player pair as list-column score which will stay untouched.

As example of longcr object one can use ncaa2005 which is a built-in dataset in comperes package:

print(ncaa2005, n = 6)

There is an S3 method for easy conversion to longcr: to_longcr. Its default version converts argument to tibble and adds longcr to result's class. If argument of to_longcr is a proper longcr object then it stays untouched. In case of widecr the actual conversion to longcr is made preserving all extra columns.

to_longcr has argument repair. If TRUE then the result is ensured to have proper structure: there will be columns game, player, score and there will be no duplicated game-player pairs. If they were not detected in original data then they are created as new columns with NAs and appropriate message is given. In case of imperfect matching of column names there also will be a message. For more details see ?to_longcr.

Wide CR format

Wide format is preferred if all games consist from the constant number of players. Results should be in "data.frame-like" format with observational unit (row) being one particular game. Data should be organized in pairs of columns player-score. Identifier of a pair should go after respective keyword and consist only from digits. For example: player1, score1, player2, score2. Order doesn't matter. Extra columns are allowed.

To account for R standard string ordering, identifiers of pairs should be formatted with possible leading zeros. For example: player01, score01, ..., player10, score10.

Column game for game identifier is optional. If present it will be used in conversion to longcr format via to_longcr.

Here is the widecr version of ncaa2005 dataset:

print(to_widecr(ncaa2005, repair = FALSE), n = 3)

As is with longcr there is an S3 method for easy conversion to widecr: to_widecr. Its default version converts argument to tibble and adds widecr to result's class. If argument of to_widecr is a proper widecr object then it stays untouched. In case of longcr the actual conversion to widecr is made using only columns game, player and score.

to_widecr has argument repair. If TRUE then it detects possible player-score pairs by the identifier of a pair (characters that go after respective keywords). If some column doesn't have pair it is created as new column with NAs. For more details see ?to_widecr.

A useful case of wide CR format is pairgames: CR in which games are held between two players. It is just widecr object with two players. Also it is the most popular case of CR for rating and ranking systems. There is a function to_pairgames to create pairgames from general CR: it drops games with one player and for every game with 3 and more players this function transforms it into set of separate games between unordered pairs of players. It accepts CR in format ready for to_longcr. For more details see ?to_pairgames. The usage example is as follows:

cr_data <- data.frame(game = rep(1, 3), player = 11:13, score = 101:103)
to_pairgames(cr_data)

Operations

After conversion of CR into appropriate format one can use them for several types of operations.

Compute Head-to-Head matrix

Head-to-Head value is a measure of a quality of direct confrontation between two players. It is assumed that this value can be computed based only on the players' scores in their common games. If it is not true for some case then competition results should be changed by transformation or addition of more information (in form of extra columns or extra field in score column(s) making list-column(s)).

Head-to-Head value is computed for an ordered pair of players based on their matchups. It means that Head-to-Head value for "player1"-"player2" may be different from "player2"-"player1". It is done in order to except not symmetrical Head-to-Head values.

There is a function for computing multiple Head-to-Head values in matrix form (Head-to-Head matrix): get_h2h. It accepts CR data and Head-to-Head function h2h_fun (for more details see ?get_h2h). It returns an object of class h2h: square matrix with number of rows (and columns) equal to number of players for which it is computed. The Head-to-Head matrix of ncaa2005 with Head-to-Head value being number of wins of second player in matchups:

get_h2h(ncaa2005, h2h_fun = h2h_num_wins)

For the list of implemented h2h_funs see help page for head-to-head-functions.

get_h2h has argument players. By default it is NULL and it means that Head-to-Head values are computed for all players present in CR. If not NULL then Head-to-Head values are computed only for pairs between players from argument players. Note to be careful with Head-to-Head values of players with themselves: it can be inaccurate if players is not NULL because it will be based on possibly undesirable data. Example for Head-to-Head value being number of games played:

get_h2h(ncaa2005, h2h_fun = h2h_num, players = c("Duke", "Miami"))

The output can be wrongly interpreted as Head-to-Head matrix based on CR from which only games between "Duke" and "Miami" are taken. The correct interpretation is as Head-to-Head matrix based on matchups from the whole ncaa2005 between players from given set. So the number of games played by "Duke" in the supplied CR is 4. It can be rephrased as the number of matchups of "Duke" with itself so the output is conceptually correct.

Argument players can also have values that are not present in CR data. The resulting rows and columns will be filled with NAs. For dealing with absent data in Head-to-Head matrix get_h2h has two more arguments (for detailed information see ?get_h2h):

Examples of usage with extra players:

get_h2h(
  ncaa2005,
  h2h_fun = h2h_num,
  players = c("Duke", "Miami", "Extra"),
  absent_players = skip_action
)

# Use extra argument 'fill' to supply value for 'fill_h2h'
get_h2h(
  ncaa2005,
  h2h_fun = h2h_num,
  players = c("Duke", "Miami", "Extra"),
  absent_players = skip_action,
  absent_h2h = fill_h2h, fill = 0
)

Compute item summary

With given CR it can be interesting to compute its summaries. Of course it can be done pretty easy with combination of dplyrs verbs and grouping but comperes provides a function for that: get_item_summary(cr_data, item, summary_fun = NULL, ...). CR should be ready for to_longcr and every further actions are done based on longcr version of CR.

Argument item defines on which columns grouping is made for computing item summary. Argument summary_fun defines the function which performs summary computation. Basically get_item_summary applies summary_fun to groups of longcr version of CR data defined by item.

summary_fun can be NULL in which case a tibble is returned with columns named as stored in item and which has all unique values of particular item (set of columns) in CR.

Examples:

get_item_summary(ncaa2005, item = "player",
                 summary_fun = summary_min_max_score)
get_item_summary(ncaa2005, item = "game",
                 summary_fun = NULL)

For the list of implemented summary_funs see help page for item-summary-functions.

There are also wrappers around get_item_summary for most common items:

# The same as previous code
get_player_summary(ncaa2005, summary_fun = summary_min_max_score)
get_game_summary(ncaa2005, summary_fun = NULL)

Of course item can define multiple columns:

ncaa2005 %>%
  mutate(season = rep(1:2, each = 10)) %>%
  get_item_summary(item = c("season", "player"),
                   summary_fun = summary_min_max_score)

Add item summary

In order to modify scores in CR so that they fully characterize player's performance in particular game one might need to use item summaries. Instead of manually computing with get_item_summary and applying left_join to result, comperes provides add_item_summary. For example suppose the goal of players in ncaa2005 was not to gain points more than opponent but to gain as close scores to opponent as possible. In this case score doesn't fully describe the player's performance. Instead the distance from the mean score can describe it:

ncaa2005 %>%
  add_item_summary(item = "game",
                   summary_fun = summary_mean_sd_score) %>%
  mutate(score = abs(score - meanScore)) %>%
  print(n = 6)

Use of add_item_summary can be redundant in this example but in case of complex CR with variable number of players it can be quite useful.

There are also wrappers for the most common items: add_game_summary and add_player_summary:

# The same as previous example
ncaa2005 %>%
  add_game_summary(summary_fun = summary_mean_sd_score) %>%
  mutate(score = abs(score - meanScore)) %>%
  print(n = 6)


echasnovski/comperes documentation built on June 21, 2017, 1:17 a.m.