Here are a few useful sanity checks for the data compiled by the openWAR
package.
library(openWAR) library(openWARData) library(mosaic) data(MLBAM2013)
First, some basic accounting to make sure that our data is accurate. Compare these with Baseball-reference.com
sort(tally(~event, data = MLBAM2013), decreasing = TRUE)
MLBAM2013 %>% mutate(bat_team = ifelse(half == "top", as.character(away_team), as.character(home_team))) %>% group_by(bat_team) %>% summarise(G = length(unique(gameId)), PA = sum(isPA), R = sum(runsOnPlay) , H = sum(isHit), HR = sum(event == "Home Run") , K = sum(event %in% c("Strikeout", "Strikeout - DP")) , OBP = sum(isHit | event %in% c("Walk", "Intent Walk", "Hit By Pitch")) / sum(isPA & !event %in% c("Sac Bunt", "Sacrifice Bunt DP"))) %>% arrange(desc(OBP)) %>% print.data.frame()
These numbers are very close.
In this case we have r nrow(MLBAM2013)
observations. We can use this model to recover a Run Expectancy Matrix, similar to this one computed by Baseball Prospectus from 2012.
Note that our notation for the baseCode is different from theirs.
fit.rem <- getRunEx(MLBAM2013) rem <- outer(0:7, 0:2, fit.rem) rem
The coefficients are the same, but the regression framework gives us a principled way to address the error in these estimates. We can also visualize our model.
xyplot(jitter(runsFuture) ~ jitter(startCode), groups = startOuts, data = MLBAM2013, alpha = 0.3, pch = 19, auto.key = list(columns = 3), xlab = "Base Code before Plate Appearance", ylab = "Runs Following Plate Appearance") ladd(panel.lines(row.names(rem), rem[, 1], col = 1, lwd = 3)) ladd(panel.lines(row.names(rem), rem[, 2], col = 2, lwd = 3)) ladd(panel.lines(row.names(rem), rem[, 3], col = 3, lwd = 3))
We note that in general, each higher baseCode
is associated with a larger number of expected $futureRuns$, holding the number of outs
constant. We can see this more clearly by looking simply at the model estimates.
xyplot(runs ~ startCode, groups = startOuts, data = states, type = c("p", "l"), auto.key = list(columns = 3), xlab = "Base Code before Plate Appearance", ylab = "Expected Runs Following Plate Appearance")
It is also instructive to examine the distribution of runs scored on (not futureRuns
, but rather runsOnPlay
) a particular plate appearance.
xyplot(jitter(runsOnPlay) ~ jitter(startCode), groups = startOuts, data = MLBAM2013, auto.key = list(columns = 3), xlab = "Base Code before Plate Appearance", ylab = "Runs Generated by Plate Appearance")
We employ the notion of conservation of runs. That is, every run gained by the offense is a run lost by the defense. Thus, our first task is to determine, for each play, how many expected runs were gained by the offensive team. The defensive team must have lost that same number of expected runs. Let $r_i$ be actual number of runs scored on the $i^{th}$ play, and $\rho(b,o)$ be the true expected value of the state $(b,o)$. Then we define $$ \delta_i = \rho(b_{i+1},o_{i+1}) - \rho(b_i,o_i) + r_{i} $$ to be the true number of expected runs gained on the $i^{th}$ play. We stress that this quantity is unknown, since we can only estimate the true expected value associated with each $(base, out)$ state.
One of the nice features of this type of model is that in $n$ completed innings, the number of expected runs generated should be exactly equal to the number of runs scored.
$$ \sum_{j \in innings} \sum_{i \in plays} \delta_{i}^j = \sum_{j \in innings} \sum_{i \in plays} \rho^j(b_{i+1}, o_{i+1}) - \rho^j(b_{i}, o_{i}) + r_{i}^j $$ $$ = \sum_{j \in innings} \rho (0, 3) - \rho (0,0) + \sum_{i \in plays} r_i^j = - n \cdot \rho(0,0) + \sum_{j \in innings} \sum_{i \in plays} r_i^j $$ This leads us to the estimate that $$ \hat{\rho}(0,0) = \frac{1}{n} \sum_{j \in innings} \sum_{i \in plays} r_i^j $$
If runs are conserved, then the sum of all of the $\delta_i$'s should be zero.
ds = makeWAR(ds)
ds.complete = subset(ds, outsInInning == 3) sum(ds.complete$delta)
Moreover, the total number of runs scored should match the product of the number of completed innings times $\hat{\rho}(0,0)$.
sum(ds.complete$runsOnPlay)
## [1] 11783
N * fit.rem(0, 0)
## 1 ## 11767
Note that the runs are not equally distributed across innings. The first inning is by far the most favorable to the offense (positive expected runs created), while the 9th inning is most favorable to the defense (closer pitching).
summarise(group_by(MLBAM2013, inning), N = length(inning), G = length(unique(gameId)), D = sum(delta, na.rm = TRUE))
## inning N G D ## 1 1 12096 1414 140.36844 ## 2 2 11955 1414 -16.63156 ## 3 3 11893 1414 9.36844 ## 4 4 12042 1414 103.36844 ## 5 5 12060 1414 111.36844 ## 6 6 12081 1414 117.83295 ## 7 7 12021 1412 8.69098 ## 8 8 11850 1410 -109.91549 ## 9 9 8934 1410 -180.96177 ## 10 10 1267 150 23.95861 ## 11 11 580 70 -7.78690 ## 12 12 316 39 -16.93594 ## 13 13 256 31 -4.45114 ## 14 14 182 19 13.68642 ## 15 15 91 10 -3.19916 ## 16 16 62 8 -3.18540 ## 17 17 35 5 -3.55407 ## 18 18 31 4 -1.37831 ## 19 19 14 2 0.23299 ## 20 20 8 1 0.07098
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.