In ss15859/tennis.wta: Predicts Tennis Matches

knitr::opts_chunk$set(echo = TRUE)

Introduction

The aim of this document is to outline how the functions and data in the package tennis.wta were used to produce the results in our report.

Let us begin by loading the package.

library(tennis.wta)
library(tidyverse)
set.seed(50)

Now we read in the .csv data found within the package. Each file contains a year of tennis matches from the WTA tour and we have years 2000-2019.

path_allfiles <- system.file("extdata", package = "tennis.wta")
file_list = list.files(path_allfiles)
#variables is the list of which columns we will be using
variables <- c('tourney_date','surface', 'winner_id', 'loser_id', 'winner_age', 'loser_age')
my_data <- load_data(file_list, variables)

Each row in our data contains information about a single tennis match from the WTA tour beginning in the year 2000. It contains the date of the tournament in which the match was played, the surface the match was played on, then the IDs and ages of the winner and loser.

head(my_data)

Bradley-Terry Model

We shall first seek to fit a Bradley-Terry model to our data, to try and predict the outcome of the matches in 2019.

We begin by splitting the data into a testing and training set

train_data <- my_data %>% filter(tourney_date < "2019-01-01")
test_data <- my_data %>% filter(tourney_date > "2019-01-01")

# find the IDs and number of players in the training set

players <- unique(c(train_data$winner_id, train_data$loser_id))
no_players <- length(players)

Next we must create a matrix $w_{i,j}$ containg the number of times a player $i$ beat player $j$ throughout all the training matches. Furthermore, we require a matrix $n_{i,j}$ containing the number of matches between player $i$ and $j$.

#initialise the variables

w <- array(0, c(no_players, no_players))
n <- array(0, c(no_players, no_players))
row.names(w) <- players
colnames(w) <- row.names(w)

We call the function w_fill() to create this matrix from the data. Let us look at the size of this matrix.

w <- w_fill(train_data, players)
n <- w + t(w)
dim(w)

As mentioned in the report. There is a sufficient condition for the convergence of the EM algorithm. We ensure that the directed graph, where a node goes from $i$ to $j$ is present if player $i$ beat player $j$ at least once, is strongly connected.

Let us call the function strongly_connected() to see whether the graph generated by the matrix $w$ is strongly connected.

strongly_connected(w)

Since it is not, we shall reduce the set of matches to those containing the nodes of the largest strongly connected component. Let us also see how much this reduces our data.

reduce <- reduce_to_connected(w)
W <- reduce$matrix
reduced_indexes <- reduce$new_index
dim(W)

Finally let us verify that this new matrix defines a strongly connected graph.

strongly_connected(W)

Let us create a subset of the original players that are found within this reduced matrix.

reduced_players <- players[reduced_indexes]

We now call the EM function to calculate the estimated scores of all the players with prior parameters a=1, b=0.

lambda_hat <- EM(W,150,1,0,0.00001)

Let's quickly look at which player we have assigned the highest skill parameter to.

reduced_players[which(lambda_hat==max(lambda_hat))]

This is in fact the player ID of Serena Williams. Somewhat promising...

Let us now attempt to predict the winner in the test dataset and compute what proportion of times it predicts it as the most likely outcome.

probs <- predict_winner(test_data,lambda_hat,reduced_players)
sum(probs>=0.5)/length(probs)

However, as mentioned in the report, this is not necessarily a useful metric of performance since it does not distinguish between a correct prediction with varying probabilities. We shall therefore adopt the log scoring rule, and use that to compare our models.

log_sr <- sum(log(probs))/length(probs)
log_sr

Let us also look at the distribution of the predicted probabilities of the true winner winning. As well, we shall split this into when it makes a correct binary prediction and when it does not. Let us also look at the log probabilities that are used to generate the average proper score.

df <- tibble(probs)
df1 <- filter(df,probs>0.5)
df2 <- filter(df,probs<0.5)
df3 <- tibble(log(probs))
ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

Here we add the line marking the point of a uniformly random guess.

ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept =  log(0.5), color='red',size=1.5)

Logistic Regression

We shall now move on to fitting a logistic regression to the data and attempt to classify the outcome of the test matches. Let us first call our function that constructs the feature matrix.

features <- feature_matrix_train(my_data)
feature_matrix <- features$matrix

Here we have one hot encoded the form of each player on the surface that the match is being played on. We have the age, the percentage wins of each player, the head to head percentage wins and we attach a row of ones that will constitute the intercept. A 1 in the winner row corresponds to player 1 winning the match, and a -1 for the opposite result.

feature_matrix[,1:5]

Let us also construct the test matrix.

test_matrix <- feature_matrix_test(my_data)

We are now in a position to call our logistic regression function.

# We split the feature matrix into input and output.

x_dat <- feature_matrix[-length(feature_matrix[,1]),]

y_dat <- feature_matrix[length(feature_matrix[,1]),]

# We perform logistic regression.

what <- log_reg(x_dat,y_dat)

x_test <- test_matrix$matrix

f <- t(what)%*%as.matrix(x_test)
pred <- sign(sigma_f(f)-0.5)

Let us compute the prediction accuracy. That is the proportion of matches in the test set that we predict correctly.

sum(pred==1)/length(pred)

This seems an improvement on our Bradley-Terry model, however this metric could be misleading. Again let us consider the log proper scoring rule.

probs <- as.numeric(sigma_f(f))
log_sr1 <- sum(log(probs))/length(probs)
log_sr1

df <- tibble(probs)
df1 <- filter(df,probs>0.5)
df2 <- filter(df,probs<0.5)
df3 <- tibble(log(probs))
ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept =  log(0.5), color='red',size=1.5)

We observe a cluster of predictions achieving a significantly lower score. This must be affecting the total log score of this model.

Let us look at the magnitudes of the weights to see what features the model finds most important

tib <- tibble(Features=rownames(x_dat),Magnitude=(what))
ggplot(data=tib, aes(x=Features, y=Magnitude)) +
  geom_bar(stat="identity",fill= 'limegreen')

We observe a significantly larger magnitude of the weight on the p1_headtohead feature. This might be the cause of the poorer performance.

Logistic Regression with Bradley-Terry

Let us construct the new feature matrix. To do this we need to extract the Bradley-Terry scores of the players playing in each match.

# We create a table of the players in each of the matches

train_players <- t(features$players)
train_players <- data.frame(winner_id=train_players[,1],loser_id=train_players[,2])

# And we pass this into our predict winner function, which returns the Bradley-Terry probabilities.

BTprob <- predict_winner(train_players,lambda_hat,reduced_players)

# We then attach this to the feature matrix.

winner <- feature_matrix[length(feature_matrix[,1]),]
feature_matrix_BT <- feature_matrix[-length(feature_matrix[,1]),]
feature_matrix_BT <- rbind(feature_matrix_BT,BTprob,winner)

# We then repeat these steps to the test matrix.

test_players <- t(feature_matrix_test(my_data)$players)
test_players <- data.frame(winner_id=test_players[,1],loser_id=test_players[,2])


BTprob_test <- predict_winner(test_players,lambda_hat,reduced_players)
test_matrix_BT <- rbind(test_matrix$matrix,BTprob_test)

Now again performing logistic regression.

x_dat <- feature_matrix_BT[-length(feature_matrix_BT[,1]),]

y_dat <- feature_matrix_BT[length(feature_matrix_BT[,1]),]

what <- log_reg(x_dat,y_dat)

x_test <- test_matrix_BT

f <- t(what)%*%as.matrix(x_test)
pred <- sign(sigma_f(f)-0.5)

Here again is our prediction accuracy.

sum(pred==1)/length(pred)

And the log proper scoring rule.

probs <- as.numeric(sigma_f(f))
log_sr2 <- sum(log(probs))/length(probs)
log_sr2

df <- tibble(probs)
df1 <- filter(df,probs>0.5)
df2 <- filter(df,probs<0.5)
df3 <- tibble(log(probs))
ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept =  log(0.5), color='red',size=1.5)

tib <- tibble(Features=rownames(x_dat),Magnitude=(what))
ggplot(data=tib, aes(x=Features, y=Magnitude)) +
  geom_bar(stat="identity",fill= 'limegreen')

Logistic Regression without `p1_headtohead`

Let us precede, without p1_headtohead in our model. Repeating each step we did earlier.

features <- feature_matrix_train(my_data, match_date = "2019-01-01",features = c('surface_form', 'age', 'match_form'))
feature_matrix <- features$matrix

test_matrix <- feature_matrix_test(my_data,features = c('surface_form', 'age', 'match_form'))

x_dat <- feature_matrix[-length(feature_matrix[,1]),]

y_dat <- feature_matrix[length(feature_matrix[,1]),]

what <- log_reg(x_dat,y_dat)

x_test <- test_matrix$matrix

f <- t(what)%*%as.matrix(x_test)
pred <- sign(sigma_f(f)-0.5)

Here is our new prediction accuracy.

sum(pred==1)/length(pred)

Our new log score:

probs <- as.numeric(sigma_f(f))
log_sr3 <- sum(log(probs))/length(probs)
log_sr3

As well as the same plots:

df <- tibble(probs)
df1 <- filter(df,probs>0.5)
df2 <- filter(df,probs<0.5)
df3 <- tibble(log(probs))
ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept =  log(0.5), color='red',size=1.5)

We no longer have the cluster of poorly predicted probabilities.

tib <- tibble(Features=rownames(x_dat),Magnitude=(what))
ggplot(data=tib, aes(x=Features, y=Magnitude)) +
  geom_bar(stat="identity",fill= 'limegreen')

The model seems to be assigning most importance to the form of the players. Not surprisingly...

Logistic Regression with Bradley-Terry probabilities but without `p1_headtohead`

Finally, let us add the Bradley-Terry probabilities to the logistic regression. We shall add these as the final feature to our model and we shall continue without the p1_headtohead feature.

We need to attach the Bradley-Terry probabilities once again.

winner <- feature_matrix[length(feature_matrix[,1]),]
feature_matrix_BT <- feature_matrix[-length(feature_matrix[,1]),]
feature_matrix_BT <- rbind(feature_matrix_BT,BTprob,winner)

test_matrix_BT <- rbind(test_matrix$matrix,BTprob_test)

Now again performing logistic regression.

x_dat <- feature_matrix_BT[-length(feature_matrix_BT[,1]),]

y_dat <- feature_matrix_BT[length(feature_matrix_BT[,1]),]

what <- log_reg(x_dat,y_dat)

x_test <- test_matrix_BT

f <- t(what)%*%as.matrix(x_test)
pred <- sign(sigma_f(f)-0.5)

Here again is our prediction accuracy.

sum(pred==1)/length(pred)

And the log proper scoring rule.

probs <- as.numeric(sigma_f(f))
log_sr4 <- sum(log(probs))/length(probs)
log_sr4

df <- tibble(probs)
df1 <- filter(df,probs>0.5)
df2 <- filter(df,probs<0.5)
df3 <- tibble(log(probs))
ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')

ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept =  log(0.5), color='red',size=1.5)

tib <- tibble(Features=rownames(x_dat),Magnitude=(what))
ggplot(data=tib, aes(x=Features, y=Magnitude)) +
  geom_bar(stat="identity",fill= 'limegreen')

Finally let us compare the log score of all our models.

table <- data.frame(Model=c('VBT','VLR','LRBT','VLR-H2H','LRBT-H2H'),Log_Score = c(log_sr,log_sr1,log_sr2,log_sr3,log_sr4))
table

ss15859/tennis.wta documentation built on Jan. 24, 2021, 12:46 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ss15859/tennis.wta
Predicts Tennis Matches

In ss15859/tennis.wta: Predicts Tennis Matches

Introduction

Bradley-Terry Model

Logistic Regression

Logistic Regression with Bradley-Terry

Logistic Regression without `p1_headtohead`

Logistic Regression with Bradley-Terry probabilities but without `p1_headtohead`

R Package Documentation

Browse R Packages

We want your feedback!

ss15859/tennis.wta Predicts Tennis Matches

In ss15859/tennis.wta: Predicts Tennis Matches

Introduction

Bradley-Terry Model

Logistic Regression

Logistic Regression with Bradley-Terry

Logistic Regression without p1_headtohead

Logistic Regression with Bradley-Terry probabilities but without p1_headtohead

R Package Documentation

Browse R Packages

We want your feedback!

ss15859/tennis.wta
Predicts Tennis Matches

Logistic Regression without `p1_headtohead`

Logistic Regression with Bradley-Terry probabilities but without `p1_headtohead`