We have developed an R package to compute our implementation of WAR. The first step in that process is to get meaningful play-by-play data. We have written parsers that will download and extract data from the Major League Baseball Advanced Media (MLBAM) GameDay server. This data is updated live, and available to the public. Thus, although this data is not "free as in freedom", it is "free as in beer."
Using our openWAR package, a single game's worth of play-by-play data can be retrieved from the GameDay servers and processed into a data frame. The R class gameday contains an object that includes the relevant URLs and XML files, as well as the processed data. The XML files and processed data are cached locally for faster retrieval. Note that since the MLBAM data is not transferrable, all data must be downloaded by the user at least once.
require(openWAR)
## Warning: replacing previous import by 'mosaic::do' when loading 'openWAR'
## Warning: replacing previous import by 'mosaic::tally' when loading
## 'openWAR'
## Warning: replacing previous import by 'mosaic::count' when loading
## 'openWAR'
## Warning: replacing previous import by 'stringr::%>%' when loading 'openWAR'
## Warning: replacing previous import by 'dplyr::do' when loading 'openWAR'
## Warning: replacing previous import by 'dplyr::group_by' when loading
## 'openWAR'
## Warning: replacing previous import by 'dplyr::mutate' when loading
## 'openWAR'
## Warning: replacing previous import by 'dplyr::summarize' when loading
## 'openWAR'
## Warning: replacing previous import by 'stringr::str_count' when loading
## 'openWAR'
## Warning: replacing previous import by 'stringr::str_split' when loading
## 'openWAR'
gd = gameday()
# Equivalently
data(MetsBraves)
The default game was played on August 12th, 2012 between the New York Mets and the Atlanta Braves.
gd$gameId
## [1] "gid_2012_08_12_atlmlb_nynmlb_1"
The directory on the GameDay server that contains that actual XML files is located here.
gd$base
## [1] "http://gd2.mlb.com/components/game/mlb/year_2012/month_08/day_12/"
In this game, the Braves beat the Mets, 6-5.
summary(gd)
## Length Class Mode
## gameId 1 -none- character
## base 1 -none- character
## url 5 -none- character
## ds 62 data.frame list
Our primary interest will be in analyzing the play-by-play data that we have processed for this game. This data contains a complete record of what happened in the game. For example, this game started with Michael Bourn leading off with a double. After a walk to Martin Prado and a strikeout of Jason Heyward, Chipper Jones grounded into an inning-ending 5-4-3 double play.
head(gd$ds)
## pitcherId batterId field_teamId ab_num inning half balls strikes
## 1 477003 456422 121 1 1 top 0 1
## 2 477003 445988 121 2 1 top 4 0
## 3 477003 518792 121 3 1 top 1 3
## 4 477003 116706 121 4 1 top 1 2
## 5 282656 514913 144 5 1 bottom 2 2
## 6 282656 488689 144 6 1 bottom 1 1
## endOuts event actionId
## 1 0 Double NA
## 2 0 Walk NA
## 3 1 Strikeout NA
## 4 3 Grounded Into DP NA
## 5 1 Flyout NA
## 6 1 Single NA
## description
## 1 Michael Bourn doubles (21) on a line drive to left fielder Jordany Valdespin.
## 2 Martin Prado walks.
## 3 Jason Heyward strikes out swinging.
## 4 Chipper Jones grounds into a double play, third baseman David Wright to second baseman Daniel Murphy to first baseman Ike Davis. Martin Prado out at 2nd.
## 5 Ruben Tejada flies out to center fielder Michael Bourn.
## 6 Mike Baxter singles on a line drive to right fielder Jason Heyward.
## stand throws
## 1 L L
## 2 R L
## 3 L L
## 4 R L
## 5 R R
## 6 L R
## runnerMovement
## 1 [456422::2B::Double]
## 2 [445988::1B::Walk]
## 3
## 4 [456422:2B:3B::Stolen Base 3B][445988:1B:::Grounded Into DP][456422:3B:::Grounded Into DP]
## 5
## 6 [488689::1B::Single]
## x y game_type home_team home_teamId home_lg away_team
## 1 61.24 120.48 R nyn 121 NL atl
## 2 NA NA R nyn 121 NL atl
## 3 NA NA R nyn 121 NL atl
## 4 NA NA R nyn 121 NL atl
## 5 111.45 80.32 R nyn 121 NL atl
## 6 158.63 106.43 R nyn 121 NL atl
## away_teamId away_lg venueId stadium timestamp playerId.C
## 1 144 NL 3289 Citi Field 2012-08-13 00:06:39 453531
## 2 144 NL 3289 Citi Field 2012-08-13 00:07:43 453531
## 3 144 NL 3289 Citi Field 2012-08-13 00:09:29 453531
## 4 144 NL 3289 Citi Field 2012-08-13 00:11:16 453531
## 5 144 NL 3289 Citi Field 2012-08-13 00:15:11 435263
## 6 144 NL 3289 Citi Field 2012-08-13 00:17:48 435263
## playerId.1B playerId.2B playerId.3B playerId.SS playerId.LF playerId.CF
## 1 477195 502517 431151 514913 518170 400083
## 2 477195 502517 431151 514913 518170 400083
## 3 477195 502517 431151 514913 518170 400083
## 4 477195 502517 431151 514913 518170 400083
## 5 518692 462564 116706 457926 445988 456422
## 6 518692 462564 116706 457926 445988 456422
## playerId.RF batterPos batterName pitcherName runsOnPlay startOuts
## 1 488689 CF Bourn Niese 0 0
## 2 488689 LF Prado Niese 0 0
## 3 488689 RF Heyward Niese 0 0
## 4 488689 3B Jones, C Niese 0 1
## 5 518792 SS Tejada, R Sheets 0 0
## 6 518792 RF Baxter Sheets 0 1
## runsInInning runsITD runsFuture start1B start2B start3B end1B end2B
## 1 0 0 0 <NA> <NA> <NA> <NA> 456422
## 2 0 0 0 <NA> 456422 <NA> 445988 456422
## 3 0 0 0 445988 456422 <NA> 445988 456422
## 4 0 0 0 445988 456422 <NA> <NA> <NA>
## 5 2 0 2 <NA> <NA> <NA> <NA> <NA>
## 6 2 0 2 <NA> <NA> <NA> 488689 <NA>
## end3B outsInInning startCode endCode fielderId
## 1 <NA> 3 0 2 NA
## 2 <NA> 3 2 3 NA
## 3 <NA> 3 3 3 NA
## 4 <NA> 3 3 0 431151
## 5 <NA> 3 0 0 456422
## 6 <NA> 3 0 1 NA
## gameId isPA isAB isHit isBIP our.x
## 1 gid_2012_08_12_atlmlb_nynmlb_1 TRUE TRUE TRUE TRUE -159.12398
## 2 gid_2012_08_12_atlmlb_nynmlb_1 TRUE FALSE FALSE FALSE NA
## 3 gid_2012_08_12_atlmlb_nynmlb_1 TRUE TRUE FALSE FALSE NA
## 4 gid_2012_08_12_atlmlb_nynmlb_1 TRUE TRUE FALSE FALSE NA
## 5 gid_2012_08_12_atlmlb_nynmlb_1 TRUE TRUE FALSE TRUE -33.81634
## 6 gid_2012_08_12_atlmlb_nynmlb_1 TRUE TRUE TRUE TRUE 83.92942
## our.y r theta
## 1 195.9601 252.4298 2.252825
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 296.1862 298.1104 1.684477
## 6 231.0243 245.7974 1.222329
More often, we'll be interested in investigated data from many games. The function getData() will load (or download) data over any time interval in which you are interested. Let's figure out how many home runs were hit on May 14th, 2013.
ds = getData(start = "2013-05-14")
##
## Retrieving data from 2013-05-14 ...
## ...found 15 games
subset(ds, event == "Home Run", select = c("gameId", "batterId", "description"))
## gameId batterId
## 70 gid_2013_05_14_bosmlb_tbamlb_1 120074
## 144 gid_2013_05_14_chamlb_minmlb_1 276055
## 145 gid_2013_05_14_chamlb_minmlb_1 493364
## 286 gid_2013_05_14_clemlb_phimlb_1 435623
## 330 gid_2013_05_14_clemlb_phimlb_1 502126
## 369 gid_2013_05_14_colmlb_chnmlb_1 458913
## 374 gid_2013_05_14_colmlb_chnmlb_1 471865
## 398 gid_2013_05_14_colmlb_chnmlb_1 446381
## 427 gid_2013_05_14_colmlb_chnmlb_1 471865
## 493 gid_2013_05_14_houmlb_detmlb_1 408234
## 533 gid_2013_05_14_kcamlb_anamlb_1 405395
## 536 gid_2013_05_14_kcamlb_anamlb_1 435062
## 549 gid_2013_05_14_kcamlb_anamlb_1 456714
## 552 gid_2013_05_14_kcamlb_anamlb_1 285078
## 562 gid_2013_05_14_kcamlb_anamlb_1 545361
## 579 gid_2013_05_14_milmlb_pitmlb_1 516416
## 674 gid_2013_05_14_milmlb_pitmlb_1 457705
## 721 gid_2013_05_14_nynmlb_slnmlb_1 136860
## 729 gid_2013_05_14_nynmlb_slnmlb_1 407781
## 732 gid_2013_05_14_nynmlb_slnmlb_1 445055
## 759 gid_2013_05_14_sdnmlb_balmlb_1 435041
## 773 gid_2013_05_14_sdnmlb_balmlb_1 475247
## 865 gid_2013_05_14_seamlb_nyamlb_1 116380
## 930 gid_2013_05_14_sfnmlb_tormlb_1 474832
## 980 gid_2013_05_14_sfnmlb_tormlb_1 467055
## 1007 gid_2013_05_14_texmlb_oakmlb_1 519048
## 1060 gid_2013_05_14_texmlb_oakmlb_1 134181
## 1062 gid_2013_05_14_texmlb_oakmlb_1 519048
## description
## 70 David Ortiz homers (5) on a line drive to right field. Jacoby Ellsbury scores. Dustin Pedroia scores.
## 144 Adam Dunn homers (7) on a fly ball to left center field.
## 145 Dayan Viciedo homers (3) on a fly ball to left field.
## 286 Kevin Frandsen homers (2) on a fly ball to left field.
## 330 Domonic Brown homers (7) on a line drive to right field.
## 369 Eric Young Jr. homers (1) on a fly ball to center field. Josh Rutledge scores.
## 374 Carlos Gonzalez homers (8) on a line drive to right field.
## 398 Darwin Barney homers (2) on a fly ball to left field.
## 427 Carlos Gonzalez homers (9) on a fly ball to right field.
## 493 Miguel Cabrera homers (8) on a fly ball to left field.
## 533 Albert Pujols homers (6) on a fly ball to left field.
## 536 Howie Kendrick homers (6) on a fly ball to center field.
## 549 Billy Butler homers (5) on a fly ball to right center field.
## 552 Josh Hamilton homers (5) on a fly ball to center field.
## 562 Mike Trout homers (7) on a fly ball to left center field.
## 579 Jean Segura homers (7) on a fly ball to left field.
## 674 Andrew McCutchen homers (5) on a fly ball to center field.
## 721 Carlos Beltran homers (10) on a fly ball to left field. Pete Kozma scores. John Gast scores.
## 729 Marlon Byrd homers (3) on a fly ball to left center field. John Buck scores.
## 732 Jon Jay homers (4) on a fly ball to center field.
## 759 Carlos Quentin homers (4) on a fly ball to left field.
## 773 Ryan Flaherty homers (2) on a line drive to left field.
## 865 Raul Ibanez homers (4) on a line drive to right field. Kelly Shoppach scores.
## 930 Brandon Belt homers (5) on a fly ball to right center field.
## 980 Pablo Sandoval homers (7) on a fly ball to left center field. Andres Torres scores. Marco Scutaro scores.
## 1007 Mitch Moreland homers (8) on a fly ball to right center field. Adrian Beltre scores.
## 1060 Adrian Beltre homers (9) on a fly ball to left center field.
## 1062 Mitch Moreland homers (9) on a fly ball to center field.
The best part about the MLBAM data is that it contains an $(x,y)$-coordinate indicated the location of each batted ball hit into play. We can visualize this.
plot(data = ds)
In order to compute openWAR, we need to model several quantities. The first thing we need to understand is the relative value of each "state" of a half-inning. Since there are three bases, each of which can be either occupied or unoccupied, and there are three possible numbers of outs, each plate appearance begins with the half-inning in one of 25 possible states (the 24 states, plus one last state for three outs). We would like to assign a value to each one of these states that indicates the expected number of runs that will be scored in the remainder of that half-inning. We have precomputed the states and the number of futureRuns associated with each play.
Thus, we want to fit the model $$ futureRuns \sim baseCode + outs + baseCode \cdot outs, $$ where $baseCode$ is a description of the configuration of the baserunners, and $outs$ is the number of outs in the half-inning.
For example, consider the bottom of the 1st inning of our game:
subset(gd$ds, inning == 1 & half == "bottom", select=c("runsFuture", "runsOnPlay", "startCode", "startOuts", "description"))
## runsFuture runsOnPlay startCode startOuts
## 5 2 0 0 0
## 6 2 0 0 1
## 7 2 1 1 1
## 8 1 1 4 1
## 9 0 0 1 1
## 10 0 0 1 2
## description
## 5 Ruben Tejada flies out to center fielder Michael Bourn.
## 6 Mike Baxter singles on a line drive to right fielder Jason Heyward.
## 7 David Wright doubles (34) on a line drive to right fielder Jason Heyward. Mike Baxter scores. David Wright advances to 3rd, on a throwing error by right fielder Jason Heyward.
## 8 Ike Davis singles on a line drive to right fielder Jason Heyward. David Wright scores.
## 9 Daniel Murphy lines out to left fielder Martin Prado.
## 10 Jordany Valdespin strikes out on a foul tip.
The Mets scored two runs in the inning, and thus, when Ruben Tejada opened the inning, there were no runners on base, no outs, but two $futureRuns$ were associated with this play. After Tejada flew out, there was one out, but still no one on base and two $futureRuns$. After Mike Baxter singles, David Wright came to the plate with a runner on first (bc_before = 1), one out, and two $futureRuns$. His double scored one run, so Ike Davis followed with a runner on third, one out, and now only one $futureRuns$. By the time Daniel Murphy bats, there are no further $futureRuns$ in the inning.
Every inning begins with no one on and no one out. In this example, two runs scored in the inning. By averaging over all innings, we create an estimate of the expected $futureRuns$ for the state $(0,0)$. But we can just as easily do the same for all states.
The simplest way to build a model for $futureRuns$ is to take the average over all observations. To do this, we'll need more data.
# Will take a loooong time -- the first time
# ds = getDataWeekly("2013-04-01")
# ds = getDataWeekly("2013-04-08")
# ds = getDataWeekly("2013-04-15")
# ds = getDataWeekly("2013-04-22")
# ds = getData("2013-03-31")
# 2013 first half
# ds = getData("2013-03-31", end="2013-07-14")
# ds = getDataMonthly(2013, 6)
# MLBAM2013 = ds
# save(MLBAM2013, file="data/MLBAM2013.rda")
data(MLBAM2013)
ds = MLBAM2013
For example, consider the half inning we visited previously.
subset(gd$ds, inning == 1 & half == "bottom", select=c("runsFuture", "runsOnPlay", "startCode", "startOuts", "description"))
## runsFuture runsOnPlay startCode startOuts
## 5 2 0 0 0
## 6 2 0 0 1
## 7 2 1 1 1
## 8 1 1 4 1
## 9 0 0 1 1
## 10 0 0 1 2
## description
## 5 Ruben Tejada flies out to center fielder Michael Bourn.
## 6 Mike Baxter singles on a line drive to right fielder Jason Heyward.
## 7 David Wright doubles (34) on a line drive to right fielder Jason Heyward. Mike Baxter scores. David Wright advances to 3rd, on a throwing error by right fielder Jason Heyward.
## 8 Ike Davis singles on a line drive to right fielder Jason Heyward. David Wright scores.
## 9 Daniel Murphy lines out to left fielder Martin Prado.
## 10 Jordany Valdespin strikes out on a foul tip.
The inning began in the state $(0,0)$. Our estimate $\hat{\rho}(0,0)$ of the expected value (in runs) of that state is:
fit.rem = getRunEx(ds)
fit.rem(baseCode = 0, outs = 0)
## [1] 0.4565559
# Note this is equivalent to
# rem[1,1]
On the first play of the inning, Ruben Tejada flied out. This moved the inning into the state $(0,1)$, since there were still no runners on base, but now there was one out. The value of this state is
fit.rem(0,1)
## [1] 0.2359775
The difference between these two states is $\hat{\delta}_i$:
fit.rem(0,1) - fit.rem(0,0)
## [1] -0.2205784
In modeling this play, our goal is to apportion the value of $\hat{\delta}_i$ to each of the offensive players. In this case, Tejada was the only offensive player involved, so he gets the full amount. Moreover, $-\hat{\delta}_i$ must also be attributed to the defense. In this case, some of that credit will go to the pitcher, and some will go to the centerfielder. The details of this apportionment scheme will be revealed later.
The second batter, Mike Baxter, singled. This moved the inning from $(0,1)$ to $(1, 1)$. Accordingly, Baxter would receive:
fit.rem(1,1) - fit.rem(0,1)
## [1] 0.287121
So far, so good. The next play is particularly complicated. David Wright doubles homes Baxter, and then advances to third on a throwing error by the rightfielder. Let's assume for a moment that the error didn't happen, and that Wright end the play on second base. In this case, the ending state is $(2,1)$, but in addition, one run scored. Thus, the change in expected runs is:
fit.rem(2,1) - fit.rem(1,1) + 1
## [1] 1.115999
Clearly, much of the credit here should go to Wright, for hitting the double. But what about Baxter, who scored from first on a double? Our plan is to assume "ghostrunner" rules, wherein the number of bases advanced by each baserunner is determined by the type of hit. Since Wright hit a double, Baxter should have advanced two bases, leaving the inning in the state $(6,1)$. The additional base that he advanced (from third to home) should then be given to Baxter. Thus, as a batter, Wright accrues:
fit.rem(6,1) - fit.rem(1,1)
## [1] 0.7966985
While Baxter accrues the remainder:
fit.rem(2,1) - fit.rem(6,1) + 1
## [1] 0.3193008
But now let's revisit what actually happened. Heyward's error allowed Wright to move to third. Thus, the state before the error occurred was $(2,1)$ and it led to $(4,1)$. The difference
fit.rem(4,1) - fit.rem(2,1)
## [1] 0.2706767
goes to Heyward as a rightfielder, and Wright as a baserunner.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.