library(knitr)

Data Acquisition

We have developed an R package to compute our implementation of WAR. The first step in that process is to get meaningful play-by-play data. We have written parsers that will download and extract data from the Major League Baseball Advanced Media (MLBAM) GameDay server. This data is updated live, and available to the public. Thus, although this data is not "free as in freedom", it is "free as in beer."

Using our openWAR package, a single game's worth of play-by-play data can be retrieved from the GameDay servers and processed into a data frame. The R class gameday contains an object that includes the relevant URLs and XML files, as well as the processed data. The XML files and processed data are cached locally for faster retrieval. Note that since the MLBAM data is not transferrable, all data must be downloaded by the user at least once.

require(openWAR)
gd = gameday()
# Equivalently
data(MetsBraves)

Single Game data

The default game was played on August 12th, 2012 between the New York Mets and the Atlanta Braves.

gd$gameId

The directory on the GameDay server that contains that actual XML files is located here.

gd$base

In this game, the Braves beat the Mets, 6-5.

summary(gd)

Our primary interest will be in analyzing the play-by-play data that we have processed for this game. This data contains a complete record of what happened in the game. For example, this game started with Michael Bourn leading off with a double. After a walk to Martin Prado and a strikeout of Jason Heyward, Chipper Jones grounded into an inning-ending 5-4-3 double play.

head(gd$ds)

Many games

More often, we'll be interested in investigated data from many games. The function getData() will load (or download) data over any time interval in which you are interested. Let's figure out how many home runs were hit on May 14th, 2013.

ds = getData(start = "2013-05-14")
subset(ds, event == "Home Run", select = c("gameId", "batterId", "description"))

Visualizing the data

The best part about the MLBAM data is that it contains an $(x,y)$-coordinate indicated the location of each batted ball hit into play. We can visualize this.

plot(data = ds)

Modeling

In order to compute openWAR, we need to model several quantities. The first thing we need to understand is the relative value of each "state" of a half-inning. Since there are three bases, each of which can be either occupied or unoccupied, and there are three possible numbers of outs, each plate appearance begins with the half-inning in one of 25 possible states (the 24 states, plus one last state for three outs). We would like to assign a value to each one of these states that indicates the expected number of runs that will be scored in the remainder of that half-inning. We have precomputed the states and the number of futureRuns associated with each play.

Thus, we want to fit the model $$ futureRuns \sim baseCode + outs + baseCode \cdot outs, $$ where $baseCode$ is a description of the configuration of the baserunners, and $outs$ is the number of outs in the half-inning.

For example, consider the bottom of the 1st inning of our game:

subset(gd$ds, inning == 1 & half == "bottom", select=c("runsFuture", "runsOnPlay", "startCode", "startOuts", "description"))

The Mets scored two runs in the inning, and thus, when Ruben Tejada opened the inning, there were no runners on base, no outs, but two $futureRuns$ were associated with this play. After Tejada flew out, there was one out, but still no one on base and two $futureRuns$. After Mike Baxter singles, David Wright came to the plate with a runner on first (bc_before = 1), one out, and two $futureRuns$. His double scored one run, so Ike Davis followed with a runner on third, one out, and now only one $futureRuns$. By the time Daniel Murphy bats, there are no further $futureRuns$ in the inning.

Every inning begins with no one on and no one out. In this example, two runs scored in the inning. By averaging over all innings, we create an estimate of the expected $futureRuns$ for the state $(0,0)$. But we can just as easily do the same for all states.

Building a model for expected runs

The simplest way to build a model for $futureRuns$ is to take the average over all observations. To do this, we'll need more data.

# Will take a loooong time -- the first time
# ds = getDataWeekly("2013-04-01")
# ds = getDataWeekly("2013-04-08")
# ds = getDataWeekly("2013-04-15")
# ds = getDataWeekly("2013-04-22")
# ds = getData("2013-03-31")
# 2013 first half
# ds = getData("2013-03-31", end="2013-07-14")

# ds = getDataMonthly(2013, 6)
# MLBAM2013 = ds
# save(MLBAM2013, file="data/MLBAM2013.rda")
data(MLBAM2013)
ds = MLBAM2013

For example, consider the half inning we visited previously.

subset(gd$ds, inning == 1 & half == "bottom", select=c("runsFuture", "runsOnPlay", "startCode", "startOuts", "description"))

The inning began in the state $(0,0)$. Our estimate $\hat{\rho}(0,0)$ of the expected value (in runs) of that state is:

fit.rem = getRunEx(ds)
fit.rem(baseCode = 0, outs = 0)
# Note this is equivalent to 
# rem[1,1]

On the first play of the inning, Ruben Tejada flied out. This moved the inning into the state $(0,1)$, since there were still no runners on base, but now there was one out. The value of this state is

fit.rem(0,1)

The difference between these two states is $\hat{\delta}_i$:

fit.rem(0,1) - fit.rem(0,0)

In modeling this play, our goal is to apportion the value of $\hat{\delta}_i$ to each of the offensive players. In this case, Tejada was the only offensive player involved, so he gets the full amount. Moreover, $-\hat{\delta}_i$ must also be attributed to the defense. In this case, some of that credit will go to the pitcher, and some will go to the centerfielder. The details of this apportionment scheme will be revealed later.

The second batter, Mike Baxter, singled. This moved the inning from $(0,1)$ to $(1, 1)$. Accordingly, Baxter would receive:

fit.rem(1,1) - fit.rem(0,1)

So far, so good. The next play is particularly complicated. David Wright doubles homes Baxter, and then advances to third on a throwing error by the rightfielder. Let's assume for a moment that the error didn't happen, and that Wright end the play on second base. In this case, the ending state is $(2,1)$, but in addition, one run scored. Thus, the change in expected runs is:

fit.rem(2,1) - fit.rem(1,1) + 1

Clearly, much of the credit here should go to Wright, for hitting the double. But what about Baxter, who scored from first on a double? Our plan is to assume "ghostrunner" rules, wherein the number of bases advanced by each baserunner is determined by the type of hit. Since Wright hit a double, Baxter should have advanced two bases, leaving the inning in the state $(6,1)$. The additional base that he advanced (from third to home) should then be given to Baxter. Thus, as a batter, Wright accrues:

fit.rem(6,1) - fit.rem(1,1)

While Baxter accrues the remainder:

fit.rem(2,1) - fit.rem(6,1) + 1

But now let's revisit what actually happened. Heyward's error allowed Wright to move to third. Thus, the state before the error occurred was $(2,1)$ and it led to $(4,1)$. The difference

fit.rem(4,1) - fit.rem(2,1)

goes to Heyward as a rightfielder, and Wright as a baserunner.

Reference for the openWAR model

For a complete description of the openWAR model, please see our paper on the subject:



frogman141/openWAR documentation built on Dec. 20, 2021, 8:52 a.m.