Here are a few useful sanity checks for the data compiled by the openWAR package.

library(openWAR)
library(openWARData)
library(mosaic)
data(MLBAM2013)

Data Accuracy

First, some basic accounting to make sure that our data is accurate. Compare these with Baseball-reference.com

sort(tally(~event, data = MLBAM2013), decreasing = TRUE)
MLBAM2013 %>%
  mutate(bat_team = ifelse(half == "top", as.character(away_team), as.character(home_team))) %>%
  group_by(bat_team) %>%
  summarise(G = length(unique(gameId)), PA = sum(isPA), R = sum(runsOnPlay)
            , H = sum(isHit), HR = sum(event == "Home Run")
            , K = sum(event %in% c("Strikeout", "Strikeout - DP"))
            , OBP = sum(isHit | event %in% c("Walk", "Intent Walk", "Hit By Pitch")) / sum(isPA & !event %in% c("Sac Bunt", "Sacrifice Bunt DP"))) %>%
  arrange(desc(OBP)) %>%
  print.data.frame()

These numbers are very close.

Modeling the Expected Runs Matrix (REM)

In this case we have r nrow(MLBAM2013) observations. We can use this model to recover a Run Expectancy Matrix, similar to this one computed by Baseball Prospectus from 2012.

Note that our notation for the baseCode is different from theirs.

fit.rem <- getRunEx(MLBAM2013)
rem <- outer(0:7, 0:2, fit.rem)
rem

The coefficients are the same, but the regression framework gives us a principled way to address the error in these estimates. We can also visualize our model.

xyplot(jitter(runsFuture) ~ jitter(startCode), groups = startOuts, data = MLBAM2013, 
    alpha = 0.3, pch = 19, auto.key = list(columns = 3), xlab = "Base Code before Plate Appearance", 
    ylab = "Runs Following Plate Appearance")
ladd(panel.lines(row.names(rem), rem[, 1], col = 1, lwd = 3))
ladd(panel.lines(row.names(rem), rem[, 2], col = 2, lwd = 3))
ladd(panel.lines(row.names(rem), rem[, 3], col = 3, lwd = 3))

We note that in general, each higher baseCode is associated with a larger number of expected $futureRuns$, holding the number of outs constant. We can see this more clearly by looking simply at the model estimates.

xyplot(runs ~ startCode, groups = startOuts, data = states, type = c("p", "l"), 
    auto.key = list(columns = 3), xlab = "Base Code before Plate Appearance", 
    ylab = "Expected Runs Following Plate Appearance")

It is also instructive to examine the distribution of runs scored on (not futureRuns, but rather runsOnPlay) a particular plate appearance.

xyplot(jitter(runsOnPlay) ~ jitter(startCode), groups = startOuts, data = MLBAM2013, 
    auto.key = list(columns = 3), xlab = "Base Code before Plate Appearance", 
    ylab = "Runs Generated by Plate Appearance")

Conservation of Expected Runs

We employ the notion of conservation of runs. That is, every run gained by the offense is a run lost by the defense. Thus, our first task is to determine, for each play, how many expected runs were gained by the offensive team. The defensive team must have lost that same number of expected runs. Let $r_i$ be actual number of runs scored on the $i^{th}$ play, and $\rho(b,o)$ be the true expected value of the state $(b,o)$. Then we define $$ \delta_i = \rho(b_{i+1},o_{i+1}) - \rho(b_i,o_i) + r_{i} $$ to be the true number of expected runs gained on the $i^{th}$ play. We stress that this quantity is unknown, since we can only estimate the true expected value associated with each $(base, out)$ state.

One of the nice features of this type of model is that in $n$ completed innings, the number of expected runs generated should be exactly equal to the number of runs scored.

$$ \sum_{j \in innings} \sum_{i \in plays} \delta_{i}^j = \sum_{j \in innings} \sum_{i \in plays} \rho^j(b_{i+1}, o_{i+1}) - \rho^j(b_{i}, o_{i}) + r_{i}^j $$ $$ = \sum_{j \in innings} \rho (0, 3) - \rho (0,0) + \sum_{i \in plays} r_i^j = - n \cdot \rho(0,0) + \sum_{j \in innings} \sum_{i \in plays} r_i^j $$ This leads us to the estimate that $$ \hat{\rho}(0,0) = \frac{1}{n} \sum_{j \in innings} \sum_{i \in plays} r_i^j $$

If runs are conserved, then the sum of all of the $\delta_i$'s should be zero.

ds = makeWAR(ds)
ds.complete = subset(ds, outsInInning == 3)
sum(ds.complete$delta)

Moreover, the total number of runs scored should match the product of the number of completed innings times $\hat{\rho}(0,0)$.

sum(ds.complete$runsOnPlay)
## [1] 11783
N * fit.rem(0, 0)
##     1 
## 11767

Note that the runs are not equally distributed across innings. The first inning is by far the most favorable to the offense (positive expected runs created), while the 9th inning is most favorable to the defense (closer pitching).

summarise(group_by(MLBAM2013, inning), N = length(inning), G = length(unique(gameId)), 
    D = sum(delta, na.rm = TRUE))
##    inning     N    G          D
## 1       1 12096 1414  140.36844
## 2       2 11955 1414  -16.63156
## 3       3 11893 1414    9.36844
## 4       4 12042 1414  103.36844
## 5       5 12060 1414  111.36844
## 6       6 12081 1414  117.83295
## 7       7 12021 1412    8.69098
## 8       8 11850 1410 -109.91549
## 9       9  8934 1410 -180.96177
## 10     10  1267  150   23.95861
## 11     11   580   70   -7.78690
## 12     12   316   39  -16.93594
## 13     13   256   31   -4.45114
## 14     14   182   19   13.68642
## 15     15    91   10   -3.19916
## 16     16    62    8   -3.18540
## 17     17    35    5   -3.55407
## 18     18    31    4   -1.37831
## 19     19    14    2    0.23299
## 20     20     8    1    0.07098


beanumber/openWARData documentation built on May 12, 2019, 9:47 a.m.