Part 2: Poisson regression – Eliteserien 2018

library(reshape)
library(ggplot2)
filepath <- "https://www.math.ntnu.no/emner/TMA4315/2018h/eliteserien2018"
tippeliga <- read.table(file = filepath, header = TRUE, colClasses = c("character",
                                                                       "character", "numeric", "numeric"))

In this exercise we want to answer the following question for the 2018 Eliteserien: How likely or unlikely are Rosenborg to become champions? Our data consists of the results of all played football matches of the 2018 Eliteserien, with one row per match, containing the home-team: tippeliga$home, the away-team: tippeliga$away, and the number of goals scored by each team, tippeliga$yh and tippeliga$ya. Using this data, we assume that the score of the home team is independent of the away team and that each team has a single parameter measuring their strength. These strength parameters are denoted by $\beta$, such that $\beta_{Tromsoe}$ is the strength parameter of Tromsoe, $\beta_{Molde}$ is the strength parameter for Molde, etc. Then, for a match between two teams A and B where A is the home team, the score for team A will have a Poisson distribution with mean $\lambda$ where $\ln\lambda = \beta_0+\beta_{home}+\beta_A-\beta_B$, and team B's score will have a Poisson distribution with mean $\lambda$ where $\ln\lambda = \beta_0-\beta_A+\beta_B$. Here, $\beta_0$ is our intercept and $\beta_{home}$ is a home advantage parameter.

a) Testing independence between home and away team with contingency test

We wish to test the assumption of independence between the goals made by the home and away team. For testing this independence, we use construct a contingency table and preform a Pearson's $\chi^2$ test.

We construct the contingency table in R, defining the groups (where the groups are the number of goals scored) so that at least 80% of the cells in the table have count 5 or more, and none of the cells have count 0 (these are assumptions for the Pearson's chi-squared test to be valid).

temp_ya <- tippeliga$ya
temp_yh <- tippeliga$yh
temp_ya[temp_ya>3] <- rep(3, length(temp_ya[temp_ya>3]))
temp_yh[temp_yh>3] <- rep(3, length(temp_yh[temp_yh>3]))
goals_table <- table(temp_yh, temp_ya)
goals_table

The column names of this table denote the number of goals scored by the home team, while the row names denote the number of goals scored by the away team. Then the element (0,0) in the table denotes the number of games where both teams scored zero goals (eight games had this result), and the element (2,3) denotes the number of games where the away team scored 2 goals and the away team scored 3 goals or more (ten games had this result), as examples.

Let $H_i$ be the event that the home team scores $i$ goals and similarly $A_j$ be the event that the away team scores $j$ goals, and let $p_i = P(H_i)$, $q_j = P(A_j)$ and $p_{ij} = P(H_i, A_j)$. Under our assumption that $H_i$ and $A_j$ are independent, $p_{ij} = p_i\cdot q_j$. So for each pair $(i,j)$, we compare $np_iq_i$ to the actual observed number of games with the combination $(i,j)$, which we denote $k_{ij}$. This is what is done in the Pearsons $\chi^2$ test. Our test statistic is the sum over all rows and columns of the table of $\frac{(k_{ij}-np_iq_i)}{np_{ij}}$. If this test statistic is larger that $\chi^2_{\alpha, (4-1)(4-1)}$, we reject $H_0$ at the $\alpha$ level of significance. These calculations could be done manually, or we can use the R function chisq.test().

chisq.test(goals_table)
test = chisq.test(goals_table) 
df = (nrow(goals_table)-1)*(ncol(goals_table)-1)
d = test$statistic
X2 = qchisq(1-0.1, df)
p = pchisq(d,df, lower.tail = FALSE)

We see that our observed test statistic has a value of r d, while $\chi^2_{0.1,9}$ = r X2, along with a p-value of r p, which means that the assumption that the score of the home team and the score of the away team are independent seems to be quite reasonable.

b) Team rankings

After a match, the winning team gets 3 points and the loser gets 0, and if it is a draw both teams get 1 point each. We want to calculate the ranking for each team, as well as the goal differences.

tippeliga_table <- data.frame("team" = unique(tippeliga$home), "points"=vector("numeric", length(unique(tippeliga$home))), "goal_diff" <- vector("numeric", length(unique(tippeliga$home))))
colnames(tippeliga_table) <- c("team", "points", "goal_diff")
for(i in 1:length(tippeliga_table$team)){
  tippeliga_table$points[i] <-( sum(3*( tippeliga$yh[tippeliga$home==tippeliga_table$team[i]] > tippeliga$ya[tippeliga$home==tippeliga_table$team[i]] )) 
                                + sum(3*( tippeliga$ya[tippeliga$away==tippeliga_table$team[i]] > tippeliga$yh[tippeliga$away==tippeliga_table$team[i]] ))
                                + sum(1*( tippeliga$ya[tippeliga$away==tippeliga_table$team[i]] == tippeliga$yh[tippeliga$away==tippeliga_table$team[i]] ))
                                + sum(1*( tippeliga$yh[tippeliga$home==tippeliga_table$team[i]] == tippeliga$ya[tippeliga$home==tippeliga_table$team[i]] ))
  )


  tippeliga_table$goal_diff[i] <- (sum(tippeliga$yh[tippeliga$home==tippeliga_table$team[i]]) + sum(tippeliga$ya[tippeliga$away==tippeliga_table$team[i]]) 
                                   - sum(tippeliga$ya[tippeliga$home==tippeliga_table$team[i]]) - sum(tippeliga$yh[tippeliga$away==tippeliga_table$team[i]])
  )
}

tippeliga_table <- tippeliga_table[order(-tippeliga_table$points, -tippeliga_table$goal_diff),]
rownames(tippeliga_table) <- 1:length(tippeliga_table$team)
tippeliga_table

c) Preforming Poisson regression

We now implement the actual Poisson regression to estimate the intercept, home advantage and strength parameter for each team.

Our linear predictors are of the form

$$ \ln\text{E}(Y) = \beta_0+\beta_{\text{home}}x_{\text{home}}+\beta_{\text{BodoeGlimt}}x_{\text{BodoeGlimt}}+\dots+\beta_{\text{Vaalerenga}}x_{\text{Vaalerenga}}. $$ Here, $x_{home}$ is equal to 1 if the score $Y$ is for the home team, and 0 if it is for the away team. If A is the home team and B is the away team, then if the score $Y$ is for team A, then $x_A = 1$, $x_B = -1$ and the covariates for all other teams are 0.

We begin by constructing the design matrix $X$.

# Design matrix
# y = X beta
X <- matrix(0, nrow <- 384, ncol <- 18)
X[,1] <- rep(1, 384)
colnames(X) <- c("Intercept", "HomeAdvantage", as.character(tippeliga_table$team))
for(i in 1:length(tippeliga$home)){
  X[i,2] <- 1
  X[i, colnames(X) == tippeliga$home[i]] <- 1
  X[i, colnames(X) == tippeliga$away[i]] <- -1
}

for(i in 1:length(tippeliga$home)){
  X[i+length(tippeliga$home),2] <- 0
  X[i+length(tippeliga$home), colnames(X) == tippeliga$home[i]] <- -1
  X[i+length(tippeliga$home), colnames(X) == tippeliga$away[i]] <- 1
}

Y <- c(tippeliga$yh, tippeliga$ya)

Next we implement the loglikelihood function, in order to estimate the parameters. The likelihood function is $$ L(\beta) = \prod_{i=1}^n\frac{\lambda_i^{y_i}}{y_i!}e^{-\lambda_i}, $$ yielding the loglikelihood $$ l(\beta) = \sum_{i=1}^n[y_i\ln\lambda_i-\lambda_i-\ln(y_i!)], $$ where $\lambda_i = e^{\eta_i} = e^{x_i^T\beta}$.

loglik <- function(beta_optim, Y, X, beta_SF){
  beta <- c(beta_optim, beta_SF)
  sum <- 0
  for(i in 1:length(Y)){
    lambda_i <- exp(X[i,] %*% beta)
    sum <- sum + Y[i]*log(lambda_i)-lambda_i-log(factorial(Y[i]))
  }
  return(-sum)
}

By now using optim()in R, we obtain the estimates for the intercept, home advantage, and strength parameters for every team.

optim_ret <- optim(par=(1:(dim(X)[2]-1)), loglik, X=X, Y=Y, beta_SF=1, method="BFGS")
beta <- c(optim_ret$par, 1)
names(beta) <- colnames(X)

All the parameter estimates are:

beta

Based on these estimates we also make a ranking for all the teams:

t(t(sort(tail(beta, n=16), decreasing=TRUE)))

Comparing this ranking to the one produced in b), we note that the top and bottom ranking are the same, but that there is some variation among the middle rankings. The first ranking is based only on the results up until October 1st, while the second ranking is based on the estimated strengths of the different teams. The reason that they are slightly different may be due to the matching of the teams in the last part of the year, that is, the teams that do worse on the first ranking than the second one are maybe matched up against a lot of "bad" teams in the last part of the year, and thus their ranking is estimated to be better in the ranking in 2c.

d) Simulation

We now want to investigate the results of future matches using our estimated strengths. We simulate the remainding matches in this years "Eliteserie". This we do by constructing the design matrix $X_{up}$ of these matches, and compute the corresponding $\lambda$ for each team for each match by $$ \hat\lambda_{up} = X_{up} \hat\beta. $$ The simulations is to draw $N=1000$ realizations of the scores based on a Poisson distribution with $\lambda$ as specified above.

filepath <- "https://www.math.ntnu.no/emner/TMA4315/2018h/unplayed2018"
unplayed <- read.table(file = filepath, header = TRUE, colClasses = c("character", "character"))

X_up <- matrix(0, nrow <- 2*length(unplayed$home), ncol <- 18)
X_up[,1] <- rep(1, 2*length(unplayed$home))
colnames(X_up) <- c("Intercept", "HomeAdvantage", as.character(tippeliga_table$team))
for(i in 1:length(unplayed$home)){
  X_up[i,2] <- 1
  X_up[i, colnames(X_up) == unplayed$home[i]] <- 1
  X_up[i, colnames(X_up) == unplayed$away[i]] <- -1
}

for(i in 1:length(unplayed$home)){
  X_up[i+length(unplayed$home),2] <- 0
  X_up[i+length(unplayed$home), colnames(X_up) == unplayed$home[i]] <- -1
  X_up[i+length(unplayed$home), colnames(X_up) == unplayed$away[i]] <- 1
}

lambda <- exp(X_up %*% beta)
n_matches <- length(lambda)
N=100 #simulations 
#TODO: N=1000

position_simulation <- matrix(0, nrow <- N, ncol <- 16)
colnames(position_simulation) <- tippeliga_table$team
rownames(position_simulation) <- 1:N


for(j in 1:N){
  tippeliga_table_sim <- tippeliga_table
  unplayed$yh <- rpois(n_matches/2, lambda = head(lambda, n=(n_matches/2)))
  unplayed$ya <- rpois(n_matches/2, lambda = tail(lambda, n=(n_matches/2)))



  for(i in 1:length(tippeliga_table_sim$team)){
    tippeliga_table_sim$points[i] <- (tippeliga_table_sim$points[i] 
                                      + sum(3*( unplayed$yh[unplayed$home==tippeliga_table_sim$team[i]] > unplayed$ya[unplayed$home==tippeliga_table_sim$team[i]] )) 
                                      + sum(3*( unplayed$ya[unplayed$away==tippeliga_table_sim$team[i]] > unplayed$yh[unplayed$away==tippeliga_table_sim$team[i]] ))
                                      + sum(1*( unplayed$ya[unplayed$away==tippeliga_table_sim$team[i]] == unplayed$yh[unplayed$away==tippeliga_table_sim$team[i]] ))
                                      + sum(1*( unplayed$yh[unplayed$home==tippeliga_table_sim$team[i]] == unplayed$ya[unplayed$home==tippeliga_table_sim$team[i]] ))
    )


    tippeliga_table_sim$goal_diff[i] <- (tippeliga_table_sim$goal_diff[i]
                                         + sum(unplayed$yh[unplayed$home==tippeliga_table_sim$team[i]]) + sum(unplayed$ya[unplayed$away==tippeliga_table_sim$team[i]]) 
                                         - sum(unplayed$ya[unplayed$home==tippeliga_table_sim$team[i]]) - sum(unplayed$yh[unplayed$away==tippeliga_table_sim$team[i]])
    )
  }

  tippeliga_table_sim <- tippeliga_table_sim[order(-tippeliga_table_sim$points, -tippeliga_table_sim$goal_diff),]
  rownames(tippeliga_table_sim) <- 1:length(tippeliga_table$team)

  for(k in 1:length(tippeliga_table_sim$team)){
    position_simulation[j,tippeliga_table_sim[k,]$team==colnames(position_simulation)] <- k
  }
}

table_sim <- colMeans(position_simulation)
table_sim <- t(t(table_sim))
colnames(table_sim) <- "Mean position"
table_sim <- data.frame(table_sim)
table_sim$old.position <- 1:16
lowest <- vector("numeric", 16)
highest <- vector("numeric", 16)
for (i in 1:16){
  lowest[i] <- max(position_simulation[,i])
  highest[i] <- min(position_simulation[,i])
}
table_sim$highest.position <- highest
table_sim$lowest.position <- lowest
table_sim <- table_sim[order(table_sim$Mean.position), , drop = FALSE]
table_sim$team <- rownames(table_sim)
table_sim <- table_sim[,c(5,1,2,3,4)]
rownames(table_sim) <- 1:16
table_sim

The result of the $N=1000$ simulations is displayed in the matrix table_sim, where the column old.position indicate where the team was placed as of October 1st, while the table is sorted by the mean placement in the simulation. It is clear Rosenborg end up as the winner in most of the simulations, while Sandefjord ends up as the loser in most. The remainding teams have a somewhat larger spread in their placement, but most teams finish at the same position as they had at October 1st, with some exceptions.

To get a clearer view of the spread in the placement, we consider the histogram of each teams placement.

position_simulation <- melt(data.frame(position_simulation))
ggplot(data = position_simulation) + geom_histogram(aes(position_simulation$value), binwidth=0.5) + facet_wrap(~ variable) + xlab("")

It is clear from the histograms that the teams at the top and bottom of the table have a clear tendency to stay where they are, with large spikes around their October 1st position, while the simulated position of the teams at the middle is more spread. This is consistent with our observations concering the strength parameters, $\beta$, as there are a lot of teams around the middle with very similar strength, while the teams at the top and at the bottom stand out with significantly higher and lower strength respectivley.



JakobGM/mylm documentation built on May 7, 2019, 2:27 p.m.