In RobinHankin/hyper2: The Hyperdirichlet Distribution, Mark 2

set.seed(0)
knitr::opts_chunk$set(echo = TRUE)
library("hyper2")
library("magrittr")
options("digits" = 5)

Formula 1

(takes about two hours to run without cache)

To cite the hyper2 package in publications, please use @hankin2017_rmd. Formula 1 racing is an important and prestigious motor sport. Here I analyse a series of racing results using the hyper2 package which implements the Plackett-Luce likelihood function.

Consider the following dataset, taken from Wikipedia:

f2017 <- read.table("formula1_2017.txt",header=TRUE)
f2017 <- f2017[,1:20]  # final column is points
head(f2017)

(in the package this is object F1_table_2017). Each row is a driver and each column (after the first) a venue. We see that Hamilton, the first row, came second in Australia, first in China, second in Bahrain, fourth in Russia, and so on. In the first column, we see the result from Australia (AUS) in which Hamilton came second, Vettel first, Bottas third, and so on. Notation used also includes six different classes of no-score such as "Ret" for retired, "WD" for withdrawn, and so on.

We can represent this data in a different form using wikitable_to_ranktable():

wikitable_to_ranktable(f2017)

Looking at the first column ("c1" is read "came first") we see who came first at each venue: Vettel in Australia (AUS), Hamilton in China (CHN), Vettel in Bahrain (BHR), and so on. I now use bespoke R function \code{ordertable2supp()}, that translates data of this type into a log-likelihood. The function is similar to, but slightly different from, the analysis in eurovision.Rmd.

L2017 <- ordertable2supp(f2017)
head(L2017,3)

Above, we see a small part of the Placket-Luce likelihood function for the 2017 results. But from now on we will use built-in dataset formula1, which contains extra information including non-finishing competitors. We can find the maximum likelihood estimator by using the maxp() function:

rm(L2017)
mL2017 <- maxp(formula1)
mL2017
mL2017 <- mL2017[-which(names(mL2017) %in% c("Resta","Button"))]
dotchart(mL2017,pch=16,main='2017 Formula 1')
dotchart(log(mL2017)[1:23],pch=16,main='2017 Formula 1, log scale')

(in the diagram above, we remove the bottom-ranked two players, Resta and Button, whose estimated strength is zero).

Zipf's law

Do the players follow Zipf's law?

zipf <- function(n){
    out <- 1/seq_len(n)
    out/sum(out)
}
zipf(9)

Then there are two natural ways to proceed:

mf <- maxp(formula1)
l0 <- loglik(indep(mf),formula1)
l1 <- loglik(indep(zipf(size(formula1))),formula1)
c(l0,l1)
pchisq(2*(l0-l1),df=size(formula1)-1,lower.tail=FALSE)

However, this uses Zipf's law according to the players' points rating, and it might be better to rank the players according to their likelihood rating.

mf
mf[rank(-mf)] <- zipf(size(formula1))
mf
l2 <- loglik(indep(mf),formula1)
l2
pchisq(2*(l0-l1),df=size(formula1)-1,lower.tail=FALSE)

which is even worse.

Analysis of nulls:

Hypothesis: all players have equal strengths

Here we write $H_0\colon p_1=p_2=\cdots =p_n=\frac{1}{n}$.

equalp.test(formula1)

highly signficant!

Null of Hamilton being average

We can consider Hamilton and ask various questions about his playing strength. One natural hypothesis would be that Hamilton's strength is equal to the mean strength, or $\frac{1}{25}$.

$H_A\colon\sum p_i=1$
$H_0\colon p_\mathrm{Hamilton} < \frac{1}{25}$

(we attempt to reject the null $H_0$). Noting that the evaluate for Hamilton's strength $\widehat{p_\mathrm{Hamilton}}$ is about 0.29, we ask how much support is lost, relative to the support at the unconstrained evaluate.

specificp.test(formula1,"Hamilton")  # default value is 1/25=0.04

(note that the constrained maximum is on the boundary of the admissible region with $p_\mathrm{Hamilton}\approx\frac{1}{25}$).

Null of Hamilton being better than all other players

We may now test

$H_4\colon p_\mathrm{Hamilton} > \max_{i\neq\mathrm{Hamilton}} p_i$

This is quite difficult to evaluate formally but we can evaluate it with no loss in accuracy by observing that the second strongest competitor is Bottas and considering instead

$H_4'\colon p_\mathrm{Hamilton} > p_\mathrm{Bottas}$.

samep.test(formula1,c("Hamilton","Bottas"))

All years data

We will consider four years data:

jj <- read.table("formula1_2014.txt",header=TRUE)
F1_2014 <- ordertable2supp(jj[,seq_len(ncol(jj)-1)], incomplete = TRUE)

jj <- read.table("formula1_2015.txt",header=TRUE)
F1_2015 <- ordertable2supp(jj[,seq_len(ncol(jj)-1)], incomplete = TRUE)

jj <- read.table("formula1_2016.txt",header=TRUE)
F1_2016 <- ordertable2supp(jj[,seq_len(ncol(jj)-1)], incomplete = TRUE)

jj <- read.table("formula1_2017.txt",header=TRUE)
F1_2017 <- ordertable2supp(jj[,seq_len(ncol(jj)-1)], incomplete = TRUE)



a <- list(F1_2014, F1_2015, F1_2016, F1_2017)
alldrivers <- unique(sort(unlist(lapply(a,pnames))))



F1_total <- F1_2014 + F1_2015 + F1_2016 + F1_2017

head(F1_total)

So object F1_total is a likelihood function for all four years data, on the assumption of independent observations and in this case driver strength being independent of year. We can find the evaluate in the same way:

mallyears <- maxp(F1_total)

and the plot:

dotchart(mallyears,pch=16,main='Formula 1, 2014-7')

We see that Hamilton has the highest estimated strength.

Conditions

(details <- read.table("formula1_conditions.txt",header=TRUE))
drivers <- pnames(formula1)

We cannot use ordertable2supp() here because we somehow need to incorporate pole position and conditions into our likelihood function. The first step is to modify f2017 so that non-finishers have a zero position:

At this point, we introduce a new player: Ham_wet, for "Hamilton in the wet", and we can test the hypothesis that the Hamilton == Ham_wet. We will cycle through all the venues and, if it is wet (according to dataframe conditions) we will credit Hamilton's placing to Ham_wet.

f2017 <- read.table("formula1_2017.txt",header=TRUE)
f2017 <- f2017[,-ncol(f2017)]
details <- read.table("formula1_conditions.txt",header=TRUE)

drivers <- rownames(f2017)
alldrivers <- c("Ham_dry","Ham_wet",drivers[-1])

L2017 <- hyper2(pnames=alldrivers) # uninformative
for(i in seq_len(ncol(f2017))){
    venue <- colnames(f2017)[i]
    conditions <- details[which(details$venue == venue),4]
    o <- f2017[,i]
        suppressWarnings(o <- as.numeric(levels(o))[o])
        o[is.na(o)] <- 0
    names(o) <- rownames(f2017)
    if(conditions == "wet"){
       names(o)[names(o) == "Hamilton"] <- "Ham_wet"
    } else if(conditions == "dry"){
       names(o)[names(o) == "Hamilton"] <- "Ham_dry"
    } else {
       stop("neither wet nor dry?")
    }
    L2017 <- L2017 + suppfun(o)
}  # for(i) loop closes
head(L2017)

We can now test the hypothesis that Hamilton's performance in the wet is the same as his performance in the dry:

options(width = 80)
samep.test(L2017,c("Ham_wet","Ham_dry"))

Try the same thing but with function pwa()

f2017 <- read.table("formula1_2017.txt",header=TRUE)
details <- read.table("formula1_conditions.txt",header=TRUE)

H <- hyper2(pnames=c(rownames(f2017),"S"))
noscore <- c("Ret", "WD", "DNS", "DSQ", "DNP", "NC")

for(i in seq_len(ncol(f2017)-1)){
    y <- f2017[,i]
    if(any(y %in% noscore)){y[y%in%noscore] <- 0}
    y <- as.numeric(y)
    names(y) <- rownames(f2017)
    cond <- details$conditions[i]
        jj <- suppfun(y)
    if(cond == "dry"){
            pnames(jj) <- pnames(H)  # do nothing!
    } else if (cond == "wet"){
            jj <- pwa(jj,"Hamilton")
    } else {
            stop("neither wet nor dry?")
    }  # if wet closes
        H <- H + jj
} # i loop closes
head(H)
maxp(H)

Now in principle we can test the hypothesis that Hamilton's wet strength S is zero but I have not done this yet.

Pole position

We now introduce another reified player, pole, whose strength helps the driver in pole position.

f2017 <- read.table("formula1_2017.txt",header=TRUE)
details <- read.table("formula1_conditions.txt",header=TRUE)
HP <- hyper2(pnames=c(rownames(f2017),"pole"))
noscore <- c("Ret", "WD", "DNS", "DSQ", "DNP", "NC")

for(i in seq_len(ncol(f2017)-1)){
    y <- f2017[,i]
    if(any(y %in% noscore)){y[y%in%noscore] <- 0}
    y <- as.numeric(y)
    names(y) <- rownames(f2017)
    HP <- HP + pwa(suppfun(y),details$pole[i],"pole")
} # i loop closes
head(HP)
(mHP <- maxp(HP))
pie(mHP)
mHP <- mHP[mHP > 1e-2]
mHP
pie(mHP)

test Hamilton == bottas:

samep.test(HP,c("Hamilton","Vettel"))
samep.test(HP,c("Hamilton","Bottas"))

Now profile likelihood for pole strength:

p <- seq(from=0.15,to=0.25,len=10)
u07 <- profsupp(HP,"pole",p,n=7)
u07
plot(p,u07-max(u07))
u27 <- profsupp(HP,"pole",p,n=27)
u27
plot(p,u27-max(u27))