In nxskok/poistan: Fit Poisson model to goal-scoring data via Stan

Illustrating the use of `Poistan`

We illustrate how the package works using the two included data sets.

The fictitious data

The data set m is made-up, but illustrates the kind of data that the package uses as input:

library(poistan)
data(m)
m

The data frame has four columns: the names of the home and away teams, and the number of goals scored by each in the game where they met. Thus, each row represents a game.

If necessary, the Stan model object must be compiled first:

poisson.sc=compiled_model()
poisson.sc

As long as the object poisson.sc is kept around, this only ever needs to be done once (eg.\ when the package is first installed). Fitting the same model to different data does not require recompiling the model.

The model is this: each team has an offensive rating o and a defensive rating d that measures how likely the team is to score and concede goals respectively. There is also a home field advantage h (constant for all teams). Each has a prior normal distribution with mean 0 and SD 3. The number of goals scored by team $i$ playing at home against team $j$ has a Poisson distribution with mean $\exp(o_i-d_j+h)$, and the number of goals scored by team $j$ playing away against team $i$ has an independent Poisson distribution with mean $\exp(o_j-d_i)$. Stan is used to sample from the joint posterior distribution of the $o_i$, the $d_i$ and $h$.

Before fitting the model, we have to create a design matrix X and response matrix y that represent the teams by numbers (and maintain a separate list of team names). This is done using makexy:

xy=makexy(m)
xy

Then we fit the model using the object poisson.xc and the xy object that we just created:

z=sample_model(poisson.sc,xy)
z

Our object z shows the (simulated) marginal posterior distributions for each of the parameters, along with some convergence information. This isn't so easy to read (since the team names have been replaced by numbers). To list the ratings of the teams, we do this:

show_rating(z)

This shows the offensive and defensive ratings for each team, along with their sum (indicating the team's overall strength) and their difference (indicating whether each team is offensively or defensively minded given their overall strength). By default, the teams are sorted by their overall strength. Here, a is the strongest team and c is the weakest. Team c is also the most defensively minded: their offensive rating is clearly worst of all three teams, while their defensive rating is not that much worse than team b's.

The ratings are on a log scale and are not that easy to interpret. What makes more sense is a predictive distribution. This is obtained as below, for team a at home against b. For each random sample from the posterior distribution, the Poisson means are calculated and one random sample is drawn from the appropriate Poisson distributions:

do_predict(z,1,2)

The most likely result is 1-0, with a having a 72% chance of winning and a 19% chance of drawing against b.

English Premier League

The other data set included in this package is from the English Premier League soccer, 2014-15 season. There are 349 rows of data (games), rather than the ${20 \choose 2}=380$ games of the complete season, because (at the time of writing) the season is not complete yet:

data(england)
dim(england)
knitr::kable(head(england))

The procedure to fit the Poisson model is as before:

xy=makexy(england)
z=sample_model(poisson.sc,xy)
england.1=show_rating(z)
knitr::kable(england.1$r)
england.1$h

Chelsea are the strongest team and Burnley the weakest. The most offensively-minded teams are Tottenham and Manchester City; two of the most defensively-minded are Aston Villa and Burnley.

Note that the row names of the output data frame are the numbers of the teams as required by the predictive distribution.

We would expect Tottenham and Manchester City to have a high-scoring game and Aston Villa and Burnley a low-scoring one. Let's see whether the predictive distributions bear this out. First Tottenham and Manchester City:

do_predict(z,18,10)

Some of the most likely scores are 1-1, 2-1 (both ways) and 2-2.

Now Aston Villa and Burnley (with Burnley at home):

do_predict(z,3,2)

The most likely scores here are 0-0 and 1-0 (both ways), as we'd expect.

Southampton are even more defensively-minded than Aston Villa, but they are quite a bit stronger, so we would expect a few more goals when they play Burnley:

do_predict(z,3,14)

Low scores are still the most likely, but Burnley are not likely to win.

Let's return to the home field advantage:

england.1$h

This seems modest. What does its posterior distribution look like?

print(z$z,pars="h")

A 95% credible interval goes from 0.13 to 0.31, so we are near enough certain that the home field advantage is positive, even though it is not very big.

nxskok/poistan documentation built on May 24, 2019, 11:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com