knitr::opts_chunk$set(warning = F, comment = "#>", collapse = TRUE, tidy = F) options(digits = 3)
The primary purpose of this package is the calculation of similarity scores for a focus player. These scores are intended to identify players with similar statistics to the focus player over a specified timeframe. This vignette will walk through an example of calculating similarity scores for Domonic Brown from ages 24 to 26.
After adjusting for park, year, and level effects (see the "How the Updates Work" vignette), Domonic Brown's statline from 2012-2014 (his age 24-26 seasons) looks like
setwd("N:/Apps/simScoresApp") load("b-comparison-app/bat_simscore_data.RData") library(simScores) age_grp %>% filter(name == "Brown; Domonic L.",age >= 24, age <= 26) %>% group_by(name) %>% summarise_each(funs(sum), g:k) %>% mutate(avg = (x1b + x2b + x3b + hr) / ab, obp = (x1b + x2b + x3b + hr + bb + hbp) / pa, slg = (x1b + 2*x2b + 3*x3b + 4*hr) / ab)
The goal of these similarity scores is to find players with similar statlines to Brown. So we aggregate batting statlines for other players from their age 24-26 seasons.
weights <- c(position = .05, pa = .9, bb.perc = .66, k.perc = .84, iso = .84, babip = .23, obp = .92) dat <- age_grp %>% filter(age >= 24, age <= 26) %>% group_by(name) %>% summarise_each(funs(sum), g:k) %>% mutate(avg = (x1b + x2b + x3b + hr) / ab, obp = (x1b + x2b + x3b + hr + bb + hbp) / pa, slg = (x1b + 2*x2b + 3*x3b + 4*hr) / ab) %>% left_join(age_comps("Brown; Domonic L.", age_grp, weights, 3, "bat")$table) %>% filter(pa > 500) dat %>% select(-mlbid, -years, -score)
It is clear from just glancing at these statlines that Dustin Ackley is more similar to Brown than Mitchell Abeita. However, there are over a thousand ballplayers who have played between their age 24 and 26 seasons. Visually comparing each statline to Brown's would take too long and would be imprecise. So we calculate similarity scores which compare each player's statline to Brown. Players with higher scores have statistics that are more similar to Brown.
dat %>% select(name, score)
This package implements two ways of calculating similarity scores.
1) The first aggregates all data over the specified timeframe then finds similarity scores based on the aggregated data. This is the process used by the age_comps
function.
2) The second aggregates data over the timeframe at each level, calculates similarity scores at each level then averages the scores across the different levels. This is implemented by the level_comps
function.
age_comps()
The data used by this method are obtained using other functions in this package (see the "How the Updates Work" vignette). These data are located in the age_grp
data.frame in the "bat_simscore_data.RData" file in the "b-comparison-app" folder.
setwd("N:/Apps/simScoresApp") load("b-comparison-app/bat_simscore_data.RData")
We'll also need to load the simScores
package that has been put together by the Phillies to calculate these scores. If this has not been installed on your computer, contact Scott Freedman for support. This loads the dplyr
package as well. The dplyr
package provides most of the data manipulation functions that are used to calculate the similarity scores. The chaining operator %>%
also comes from the dplyr
package. These functions are explained in the "Introduction to dplyr" vignette. This can be viewed by running vignette("introduction", package = "dplyr")
in R.
library(simScores)
Let's begin by taking a look at the data.
age_grp
There is a bunch of biographical information (the mlbid:year columns), followed by the player's age and the year that the player had these statistics. Next is the player's primary position, his professional experience, and the number of days he spent on the DL in that year. These are followed by the player's batting statistics for that year.
These statistics have been summarized by year so that, even though Bobby Abreu played 98 games for the Phillies and 58 for the Yankees in 2006, there is only one row for Abreu in 2006 which shows he played $98+58=156$ games.
age_grp %>% select(mlbid, name, year, g, pa, hr)
Additionally, note that home runs are a decimal. This is because these statistics have been adjusted for park and league effects.
For this example, we only want to compare player to Domonic Brown based on his age 24 to 26 seasons. So the first step of the age_comps
function is to remove all entries which are not associated with the specified time frame.
filtered_dat <- age_grp %>% filter(age >= 24, age <= 26) filtered_dat %>% select(mlbid, name, year, g, pa, hr)
The original data had r nrow(age_grp)
rows, but the filtered_dat
object only has r nrow(filtered_dat)
rows because we have removed everything where the age wasn't between 24 and 26.
To compare players to Domonic Brown based on their entire performance from 24 to 26, we need to summarize their statistics over that timeframe. In R, this is accomplished by grouping the biographical info (because it is constant for each player), then summarizing the statistics. To group, we use the group_by
function.
grouped_dat <- filtered_dat %>% group_by(mlbid, name, height, weight, bats, throws, roo)
We summarize each players statistics using the summarise
function, then remove the groupings.
summed_dat <- grouped_dat %>% summarise( years = paste(min(year), max(year), sep = ":"), # years they played position = position[which.max(pa)], # the position they played the most experience = last(experience), # most experience dl.days = sum(dl.days), # the sum of all the days on the DL g = sum(g), pa = sum(pa), ab = sum(ab), x1b = sum(x1b), x2b = sum(x2b), x3b = sum(x3b), hr = sum(hr), h = sum(x1b + x2b + x3b + hr), bb = sum(bb + hbp), k = sum(k) ) %>% ungroup() summed_dat %>% select(mlbid, name, years, g, pa, hr)
Felipe Lopez was 25 in 2005 and 26 in 2006 so he had two entries in filtered_dat
. In summed_data
he only has one entry which shows his combined statistics for 2005 and 2006.
We then calculate statistics like OBP, ISO, etc. for each player.
mutated_dat <- summed_dat %>% mutate( pa.g = pa / g, obp = (h + bb) / pa, bb.perc = bb / pa * 100, k.perc = k / pa * 100, iso = (x2b + 2*x3b + 3*hr) / (pa - bb), babip = (x1b + x2b + x3b) / (pa - bb - k - hr) ) mutated_dat %>% select(mlbid, name, years, g, pa, hr, obp, iso)
Now that the data are summarized so there is one row for each player, the statistics for each player can be compared to the focus player. This is done by passing the data, the name of the focus player, and the weights for each statistic to the calcScores2
function. weights
is a named vector which specifies how much each statistics matters when comparing players.
weights <- c(position = .05, pa = .9, bb.perc = .66, k.perc = .84, iso = .84, babip = .23, obp = .92) scores <- calcScores2(mutated_dat, "Brown; Domonic L.", weights)
This function calculates similarity scores by finding the dissimilarity between each player's statistics and the focus player's statistics. This gets us a measure of the difference between each player and the focus player. A similarity score is then defined as $similarity = 1000 * (1-dissimilarity)$.
Dissimilarities are calculated using Gower's distance metric. Gower's distance metric was chosen because it can handle categorical and missing data. The formula is not presented here, but it can be found in the help page for the daisy
function in the cluster
package or in Gower's paper: Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874.
calcScores2
FunctionThis function serves as a wrapper for the findDiffs2
function. calcScores2
processes the data. Processed data are passed to findDiffs2
to calculate the dissimilarities.
stats_fp <- mutated_dat %>% filter(name == "Brown; Domonic L.") stats_fp %>% select(mlbid, name, years, g, pa, hr, obp, iso)
The weights
vector contains the names of all the statistics that are used for the comparison. So we remove all the columns which are not named in weights
.
stats_fp <- stats_fp[, names(weights)] stats_fp stats_comps <- mutated_dat[, names(weights)] stats_comps
This information is then passed to the findDiffs2
function to obtain the dissimilarity between each player and the focus player.
diffs <- findDiffs2(stats_fp, stats_comps, weights)
findDiffs2
FunctionHere, a broad overview of the methodology is provided rather than the specifics. For details of how the function is implemented, see the help page for findDiffs2
.
A data.frame is created with identical rows. Each row contains the statistics for the focus player. The dimensions for this data.frame are equal to the dimensions of the stats_comps
data.frame.
head(stats_fp) dat_fp <- data.frame( matrix(NA, nrow = nrow(stats_comps), ncol = ncol(stats_comps)) ) %>% setNames(names(stats_comps)) for(i in 1:nrow(stats_comps)) dat_fp[i, ] <- stats_fp head(dat_fp) dim(dat_fp) == dim(stats_comps)
It then uses Gower's formula to calculate the dissimilarity between the $i^{th}$ row of dat_fp
and the $i^{th}$ row of stats_comps
. This is the dissimilarity for the $i^{th}$ player. It then returns this vector of dissimilarities.
diffs <- findDiffs2(stats_fp, stats_comps, weights) head(diffs)
The calcScores2
function takes this vector, and turns it into similarity scores. Similarity scores are on a 0-1000 scale where 1000 implies identical statistics (so the focus player's similarity score with himself is 1000).
$$similarity = 1000 * (1 - dissimilarity)$$
scores <- calcScores2(mutated_dat, "Brown; Domonic L.", weights) scores %>% arrange(desc(score))
This is the information returned to the age_comps
function. This information is then merged with mutated_dat
to get the years when each player was between 24 and 26.
scores %>% inner_join(mutated_dat %>% select(mlbid, name, years)) %>% arrange(desc(score)) %>% select(mlbid, name, years, score)
age_comps
FunctionThe age_comps
function performs this entire process. It requires three arguments and has two optional arguments:
1) the name of the focus player (required)
2) the data.frame of statistics (required)
3) the weights to be used (required)
4) the time frame for comparison (defaults to most recent three years)
5) whether the focus player is a batter or pitcher (defaults to batter)
It returns the scores comparing each player in the age_grp
data to the focus player over the timeframe. It also returns the timeframe for later use.
age_comps("Brown; Domonic L.", age_grp, weights, 3, "bat")
The process to calculate similarity scores using the age_comps
function can be broken into the following steps
1) filter the data over the specified timeframe 2) summarize the statistics for each player 3) add additional statistics for each player 4) calculate dissimilarities comparing the statistics for each player to the statistics for the focus player 5) calculate similarity scores from the dissimilarities
level_comps
This function follows the same general steps as the age_comps
function. However, in age_comps
, similarity scores were calculated based on the focus player's entire performance over the specified timeframe. In the level_comps
function, similarity scores are calculated at level the focus player played at during the specified timeframe. Then each player's scores are averaged together.
For this function, the level_grp
data are used. These data are in a similar format to the age_grp
data. The difference is the level
column. Each row in this data.frame contains the statistics for a player at a single level in a year.
In 2012, Domonic Brown played at the AAA and MLB levels. In the age_grp
data, his statistics across these levels were summed together.
age_grp %>% filter(name == "Brown; Domonic L.", year == 2012) %>% select(mlbid, name, year, g, pa, hr)
In the level_grp
data however, these are recorded as two separate entries.
level_grp %>% filter(name == "Brown; Domonic L.", year == 2012) %>% select(mlbid, name, year, level, g, pa, hr)
As before, the data are filtered by age. We also remove all levels at which the focus player did not play. For example, between 2012 and 2014, Domonic Brown only played at AAA and MLB so rows corresponding to all other levels are removed.
filtered_dat2 <- level_grp %>% filter(age >= 24, age <= 26, level %in% c("AAA", "MLB")) filtered_dat2 %>% select(mlbid, name, year, level, g, pa, hr)
This time, when grouping, we want to make sure that we group by level as well.
grouped_dat2 <- filtered_dat2 %>% group_by(mlbid, name, height, weight, bats, throws, roo, level)
Summarizing counting statistics and creating new statistics at each level occurs in the same manner as before.
mutated_levels <- grouped_dat2 %>% summarise( years = paste(min(year), max(year), sep = ":"), position = position[which.max(pa)], # the position they played the most experience = last(experience), # most experience dl.days = sum(dl.days), # the sum of all the days on the DL g = sum(g), pa = sum(pa), ab = sum(ab), x1b = sum(x1b), x2b = sum(x2b), x3b = sum(x3b), hr = sum(hr), h = sum(x1b + x2b + x3b + hr), bb = sum(bb + hbp), k = sum(k) ) %>% ungroup() %>% mutate( pa.g = pa / g, obp = (h + bb) / pa, bb.perc = bb / pa * 100, k.perc = k / pa * 100, iso = (x2b + 2*x3b + 3*hr) / (pa - bb), babip = (x1b + x2b + x3b) / (pa - bb - k - hr) ) mutated_levels %>% select(mlbid, name, level, years, g, pa, hr, obp, iso)
We are going to calculate similarity scores at each level for each player in the mutated_levels
data. This means that because Michael Cuddyer played at both AAA and MLB between ages 24 and 26, we will have two similarity scores for Michael Cuddyer: one comparing his AAA stats to Domonic Brown's AAA stats, and one comparing their MLB stats. To get one overall similarity score comparing Brown and Cuddyer across levels, we will take a weighted average of these two scores.
The weights for averaging score by level will be the amount of playing time for the focus player at each level. Since Domonic Brown spent much more time in the majors from age 24 to 26, similarity will be much more important in the majors.
level_wts <- mutated_levels %>% filter(name == "Brown; Domonic L.") %>% select(level, pa) %>% setNames(c("level", "wt")) level_wts
So similarity in the majors will be about 5 times as important as similarity in the minors.
The calcScores2
function can be used to calculate scores once again. This time we have to make sure to calculate scores at each level. This can be done by using the group_by
and do
functions in R.
level_scores <- mutated_levels %>% group_by(level) %>% do(calcScores2(., "Brown; Domonic L.", weights)) %>% ungroup() level_scores %>% arrange(mlbid)
Some players, such as Adrian Beltre, did not play at both levels. Beltre only played at MLB so, to make the comparison simple, his AAA score is set as the smallest observed AAA score. This says that Adrian Beltre's AAA stats are as different as possible from Domonic Brown's AAA stats.
605 was the lowest similarity score we obtained at the AAA level so Beltre's AAA score is set to 605.
level_scores <- level_scores %>% tidyr::spread(level, score) %>% mutate_each(funs(x = ifelse(is.na(.), min(., na.rm = T), .)), -name) %>% tidyr::gather(level, score, -mlbid, -name) level_scores %>% arrange(mlbid)
The scores are joined with the level_wts
, then the scores are averaged by player.
avg_scores <- level_scores %>% inner_join(level_wts) %>% group_by(mlbid, name) %>% summarise(score = weighted.mean(score, wt)) %>% ungroup() %>% arrange(desc(score)) avg_scores
Finally, the years each player was 24-26 are added for completeness.
filtered_dat2 %>% filter(mlbid %in% avg_scores$mlbid) %>% #removing players who don't have a score group_by(mlbid) %>% summarise(years = paste(min(year), max(year), sep = ":")) %>% inner_join(avg_scores) %>% select(name, years, score) %>% arrange(desc(score))
level_comps
function.This entire processed is implemented in the level_comps
function. The function uses the same arguments as age_comps
. It only returns the similarity scores.
level_comps("Brown; Domonic L.", level_grp, weights)
The steps for this function are very similar to the steps for the age_comps()
function. The differences are noted in bold
1) filter the data over the specified timeframe and over the focus player's levels 2) summarize the statistics for each player 3) add additional statistics for each player 3.5) calculate focus player's playing time at each level 4) calculate dissimilarities comparing the statistics for each player to the statistics for the focus player at each level 5) calculate similarity scores from the dissimilarities 6) average the similarity scores by level
Similarity scores for Chase Utley based on the last 5 years.
level_comps("Utley; Chase C.", level_grp, weights, 5)
Similarity scores for Chase Utley from age 27 to 29.
level_comps("Utley; Chase C.", level_grp, weights, c(27, 29))
To calculate similarity scores, the pit_simscores_data.RData
file in the p-comparison-app folder must be loaded. The weights for pitchers must also be specified.
setwd("H:/simScoresApp") # or wherever the apps folder is located load("p-comparison-app/pit_simscore_data.RData") p_weights <- c(throws = .01, ip = .62, k.9 = .83, bb.9 = .83, hr.9 = .91, babip = .55, fip = .35)
Similarity scores for David Buchanan based on the last three years. It must be specified that we are now comparing pitchers.
level_comps("Buchanan; David A.", level_grp, p_weights, type = "pit")
Using age_comps
for Buchanan
age_comps("Buchanan; David A.", level_grp, p_weights, type = "pit")
Similarity scores for Cole Hamels based on the last two years.
level_comps("Hamels; Cole M.", level_grp, p_weights, years = 2, type = "pit")
Similarity scores for Cole Hamels from age 24 to 26
level_comps("Hamels; Cole M.", level_grp, p_weights, years = c(24, 26), type = "pit")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.