knitr::opts_chunk$set(echo = TRUE)
The aim of this document is to outline how the functions and data in the package tennis.wta
were used to produce the results in our report.
Let us begin by loading the package.
library(tennis.wta) library(tidyverse) set.seed(50)
Now we read in the .csv data found within the package. Each file contains a year of tennis matches from the WTA tour and we have years 2000-2019.
path_allfiles <- system.file("extdata", package = "tennis.wta") file_list = list.files(path_allfiles) #variables is the list of which columns we will be using variables <- c('tourney_date','surface', 'winner_id', 'loser_id', 'winner_age', 'loser_age') my_data <- load_data(file_list, variables)
Each row in our data contains information about a single tennis match from the WTA tour beginning in the year 2000. It contains the date of the tournament in which the match was played, the surface the match was played on, then the IDs and ages of the winner and loser.
head(my_data)
We shall first seek to fit a Bradley-Terry model to our data, to try and predict the outcome of the matches in 2019.
We begin by splitting the data into a testing and training set
train_data <- my_data %>% filter(tourney_date < "2019-01-01") test_data <- my_data %>% filter(tourney_date > "2019-01-01") # find the IDs and number of players in the training set players <- unique(c(train_data$winner_id, train_data$loser_id)) no_players <- length(players)
Next we must create a matrix $w_{i,j}$ containg the number of times a player $i$ beat player $j$ throughout all the training matches. Furthermore, we require a matrix $n_{i,j}$ containing the number of matches between player $i$ and $j$.
#initialise the variables w <- array(0, c(no_players, no_players)) n <- array(0, c(no_players, no_players)) row.names(w) <- players colnames(w) <- row.names(w)
We call the function w_fill()
to create this matrix from the data. Let us look at the size of this matrix.
w <- w_fill(train_data, players) n <- w + t(w) dim(w)
As mentioned in the report. There is a sufficient condition for the convergence of the EM algorithm. We ensure that the directed graph, where a node goes from $i$ to $j$ is present if player $i$ beat player $j$ at least once, is strongly connected.
Let us call the function strongly_connected()
to see whether the graph generated by the matrix $w$ is strongly connected.
strongly_connected(w)
Since it is not, we shall reduce the set of matches to those containing the nodes of the largest strongly connected component. Let us also see how much this reduces our data.
reduce <- reduce_to_connected(w) W <- reduce$matrix reduced_indexes <- reduce$new_index dim(W)
Finally let us verify that this new matrix defines a strongly connected graph.
strongly_connected(W)
Let us create a subset of the original players that are found within this reduced matrix.
reduced_players <- players[reduced_indexes]
We now call the EM function to calculate the estimated scores of all the players with prior parameters a=1, b=0.
lambda_hat <- EM(W,150,1,0,0.00001)
Let's quickly look at which player we have assigned the highest skill parameter to.
reduced_players[which(lambda_hat==max(lambda_hat))]
This is in fact the player ID of Serena Williams. Somewhat promising...
Let us now attempt to predict the winner in the test dataset and compute what proportion of times it predicts it as the most likely outcome.
probs <- predict_winner(test_data,lambda_hat,reduced_players) sum(probs>=0.5)/length(probs)
However, as mentioned in the report, this is not necessarily a useful metric of performance since it does not distinguish between a correct prediction with varying probabilities. We shall therefore adopt the log scoring rule, and use that to compare our models.
log_sr <- sum(log(probs))/length(probs) log_sr
Let us also look at the distribution of the predicted probabilities of the true winner winning. As well, we shall split this into when it makes a correct binary prediction and when it does not. Let us also look at the log probabilities that are used to generate the average proper score.
df <- tibble(probs) df1 <- filter(df,probs>0.5) df2 <- filter(df,probs<0.5) df3 <- tibble(log(probs)) ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
Here we add the line marking the point of a uniformly random guess.
ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept = log(0.5), color='red',size=1.5)
We shall now move on to fitting a logistic regression to the data and attempt to classify the outcome of the test matches. Let us first call our function that constructs the feature matrix.
features <- feature_matrix_train(my_data) feature_matrix <- features$matrix
Here we have one hot encoded the form of each player on the surface that the match is being played on. We have the age, the percentage wins of each player, the head to head percentage wins and we attach a row of ones that will constitute the intercept. A 1 in the winner row corresponds to player 1 winning the match, and a -1 for the opposite result.
feature_matrix[,1:5]
Let us also construct the test matrix.
test_matrix <- feature_matrix_test(my_data)
We are now in a position to call our logistic regression function.
# We split the feature matrix into input and output. x_dat <- feature_matrix[-length(feature_matrix[,1]),] y_dat <- feature_matrix[length(feature_matrix[,1]),] # We perform logistic regression. what <- log_reg(x_dat,y_dat) x_test <- test_matrix$matrix f <- t(what)%*%as.matrix(x_test) pred <- sign(sigma_f(f)-0.5)
Let us compute the prediction accuracy. That is the proportion of matches in the test set that we predict correctly.
sum(pred==1)/length(pred)
This seems an improvement on our Bradley-Terry model, however this metric could be misleading. Again let us consider the log proper scoring rule.
probs <- as.numeric(sigma_f(f)) log_sr1 <- sum(log(probs))/length(probs) log_sr1
df <- tibble(probs) df1 <- filter(df,probs>0.5) df2 <- filter(df,probs<0.5) df3 <- tibble(log(probs)) ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept = log(0.5), color='red',size=1.5)
We observe a cluster of predictions achieving a significantly lower score. This must be affecting the total log score of this model.
Let us look at the magnitudes of the weights to see what features the model finds most important
tib <- tibble(Features=rownames(x_dat),Magnitude=(what)) ggplot(data=tib, aes(x=Features, y=Magnitude)) + geom_bar(stat="identity",fill= 'limegreen')
We observe a significantly larger magnitude of the weight on the p1_headtohead
feature. This might be the cause of the poorer performance.
Let us construct the new feature matrix. To do this we need to extract the Bradley-Terry scores of the players playing in each match.
# We create a table of the players in each of the matches train_players <- t(features$players) train_players <- data.frame(winner_id=train_players[,1],loser_id=train_players[,2]) # And we pass this into our predict winner function, which returns the Bradley-Terry probabilities. BTprob <- predict_winner(train_players,lambda_hat,reduced_players) # We then attach this to the feature matrix. winner <- feature_matrix[length(feature_matrix[,1]),] feature_matrix_BT <- feature_matrix[-length(feature_matrix[,1]),] feature_matrix_BT <- rbind(feature_matrix_BT,BTprob,winner) # We then repeat these steps to the test matrix. test_players <- t(feature_matrix_test(my_data)$players) test_players <- data.frame(winner_id=test_players[,1],loser_id=test_players[,2]) BTprob_test <- predict_winner(test_players,lambda_hat,reduced_players) test_matrix_BT <- rbind(test_matrix$matrix,BTprob_test)
Now again performing logistic regression.
x_dat <- feature_matrix_BT[-length(feature_matrix_BT[,1]),] y_dat <- feature_matrix_BT[length(feature_matrix_BT[,1]),] what <- log_reg(x_dat,y_dat) x_test <- test_matrix_BT f <- t(what)%*%as.matrix(x_test) pred <- sign(sigma_f(f)-0.5)
Here again is our prediction accuracy.
sum(pred==1)/length(pred)
And the log proper scoring rule.
probs <- as.numeric(sigma_f(f)) log_sr2 <- sum(log(probs))/length(probs) log_sr2
df <- tibble(probs) df1 <- filter(df,probs>0.5) df2 <- filter(df,probs<0.5) df3 <- tibble(log(probs)) ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept = log(0.5), color='red',size=1.5)
tib <- tibble(Features=rownames(x_dat),Magnitude=(what)) ggplot(data=tib, aes(x=Features, y=Magnitude)) + geom_bar(stat="identity",fill= 'limegreen')
p1_headtohead
Let us precede, without p1_headtohead
in our model. Repeating each step we did earlier.
features <- feature_matrix_train(my_data, match_date = "2019-01-01",features = c('surface_form', 'age', 'match_form')) feature_matrix <- features$matrix
test_matrix <- feature_matrix_test(my_data,features = c('surface_form', 'age', 'match_form'))
x_dat <- feature_matrix[-length(feature_matrix[,1]),] y_dat <- feature_matrix[length(feature_matrix[,1]),] what <- log_reg(x_dat,y_dat) x_test <- test_matrix$matrix f <- t(what)%*%as.matrix(x_test) pred <- sign(sigma_f(f)-0.5)
Here is our new prediction accuracy.
sum(pred==1)/length(pred)
Our new log score:
probs <- as.numeric(sigma_f(f)) log_sr3 <- sum(log(probs))/length(probs) log_sr3
As well as the same plots:
df <- tibble(probs) df1 <- filter(df,probs>0.5) df2 <- filter(df,probs<0.5) df3 <- tibble(log(probs)) ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept = log(0.5), color='red',size=1.5)
We no longer have the cluster of poorly predicted probabilities.
tib <- tibble(Features=rownames(x_dat),Magnitude=(what)) ggplot(data=tib, aes(x=Features, y=Magnitude)) + geom_bar(stat="identity",fill= 'limegreen')
The model seems to be assigning most importance to the form of the players. Not surprisingly...
p1_headtohead
Finally, let us add the Bradley-Terry probabilities to the logistic regression. We shall add these as the final feature to our model and we shall continue without the p1_headtohead
feature.
We need to attach the Bradley-Terry probabilities once again.
winner <- feature_matrix[length(feature_matrix[,1]),] feature_matrix_BT <- feature_matrix[-length(feature_matrix[,1]),] feature_matrix_BT <- rbind(feature_matrix_BT,BTprob,winner) test_matrix_BT <- rbind(test_matrix$matrix,BTprob_test)
Now again performing logistic regression.
x_dat <- feature_matrix_BT[-length(feature_matrix_BT[,1]),] y_dat <- feature_matrix_BT[length(feature_matrix_BT[,1]),] what <- log_reg(x_dat,y_dat) x_test <- test_matrix_BT f <- t(what)%*%as.matrix(x_test) pred <- sign(sigma_f(f)-0.5)
Here again is our prediction accuracy.
sum(pred==1)/length(pred)
And the log proper scoring rule.
probs <- as.numeric(sigma_f(f)) log_sr4 <- sum(log(probs))/length(probs) log_sr4
df <- tibble(probs) df1 <- filter(df,probs>0.5) df2 <- filter(df,probs<0.5) df3 <- tibble(log(probs)) ggplot(df, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df1, aes(x=probs)) +geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df2, aes(x=probs)) + geom_histogram(binwidth = 0.05,fill= 'limegreen')
ggplot(df3, aes(x=log(probs))) + geom_histogram(binwidth = 0.05,fill= 'limegreen') + geom_vline(xintercept = log(0.5), color='red',size=1.5)
tib <- tibble(Features=rownames(x_dat),Magnitude=(what)) ggplot(data=tib, aes(x=Features, y=Magnitude)) + geom_bar(stat="identity",fill= 'limegreen')
Finally let us compare the log score of all our models.
table <- data.frame(Model=c('VBT','VLR','LRBT','VLR-H2H','LRBT-H2H'),Log_Score = c(log_sr,log_sr1,log_sr2,log_sr3,log_sr4)) table
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.