1 Introduction

Ever since watching Super Bowl XXXIV in my friend's basement in January 2000, I have become obsessed with the National Football League. Much like the rest of America, Sundays in the fall in my apartment are spent doing nothing but watching football. Over the years, I began to realize I can combine my two passions, football and math, to determine when teams clinch playoff spots, division titles, first-round byes, or home-field advantage. Despite having the computing power to make these determinations, I still like to do the math on my own, because I find it to be fun. However, this doesn't mean I can't create a function in R to simulate playoff seedings based on game outcomes for both myself and other fans. As I wrote this function, and other to break the possible ties in the standings, I took into account that certain simulators are available online such as ESPN.com or NFL.com. However, these only allow teams to enter winners to simulate the results, while the function I wrote allows teams to enter specific game scores, simulate the scores of games not specified, and allows for the possibility of tie games, which the other functions do not. This paper is broken down into three sections about the R package ProjectNFL that I created for this simulation: Section 2 discusses the methods of determining the teams for the NFL playoffs, Section 3 goes into the functions to scrape data from NFL.com and break any ties in the standings, and Section 4 will go into the main function of the package NFLSim, which creates multiple simulations based on specified game results along with a function created to visulaize the data.

2 NFL Background

The National Football League (NFL) consists of 32 teams across 30 cities in the United States. The teams are broken down into 2 conferences of 16 teams each based on traditional alignment (before NFL merger with the American Football League (AFL)). Those conferences are each broken down into 4 divisions of 4 teams based on both geography and to maintain long-standing rivalries. The breakdown of the teams is shown in Table 1 . Each of the 32 teams plays a total of 16 games over 17 weeks. 6 of those games are within the division, and 12 are within the conference, with opponents determined by a rotating schedule. The playoff teams are determined based on a team's overall win-loss record followed by a meticulous set of tiebreakers to determine the six teams from each conference that make the playoffs. The official definition of seedings in each conference are listed below:

\begin{enumerate} \item The division champion with the best record \item The division champion with the second-best record \item The division champion with the third-best record \item The division champion with the fourth-best record \item The non-division champion with the best record \item The non-division champion with the second-best record \end{enumerate}

\newpage \begin{center} \textit{Table 1: List of NFL Teams by Conference and Division} \end{center} \begin{center}\begin{tabular}{l|l} American Football Conference (AFC) & National Football Conference (NFC) \ \hline \hline AFC East & NFC East \ \hline Buffalo Bills & Dallas Cowboys \ Miami Dolphins & New York Giants \ New England Patriots & Philadelphia Eagles \ New York Jets & Washington Redskins \ \hline AFC North & NFC North \ \hline Baltimore Ravens & Chicago Bears \ Cincinnati Bengals & Detroit Lions \ Cleveland Browns & Green Bay Packers \ Pittsburgh Steelers & Minnesota Vikings \ \hline AFC South & NFC South \ \hline Houston Texans & Atlanta Falcons \ Indianapolis Colts & Carolina Panthers \ Jacksonville Jaguars & New Orleans Saints \ Tennessee Titans & Tampa Bay Buccaneers \ \hline AFC West & NFC West \ \hline Denver Broncos & Arizona Cardinals \ Kansas City Chiefs & Los Angeles Rams \ Los Angeles Chargers & San Francisco 49ers \ Oakland Raiders & Seattle Seahawks \end{tabular}\end{center}.

The non-division champions that make the playoffs are officially known as Wild Card teams. In breaking tied win-loss records in the National Football League, a different set of tiebreakers exists for teams in the same division versus teams not in the same division. To break ties within divisions, the following procedure is used, regardless of the number of tied teams:

\begin{enumerate} \item Head-to-Head (best win-loss record in games among the tied teams) \item Best win-loss record in games within the division \item Best win-loss record in games between common opponents \item Best win-loss record in games within the conference \item Strength of victory (win-loss record of all teams defeated) \item Strength of Schedule (win-loss record of all teams played) \item Best combined ranking among conference teams in terms of points scored and points allowed \item Best combined ranking all conference teams in terms of points scored and points allowed \item Best net points in games between common opponents \item Best net points in all games \item Best net touchdowns in all games \item Coin toss \end{enumerate}

In cases of multiple tied teams, once a team is eliminated due to an lower statistic than the others, the tiebreaking procedure starts over for all the remaining teams. Also, once the top team is determined from the tiebreaking procedure, the procedure starts again for the remaining teams to break the ties. In the cases of breaking ties between teams within the conference, but in different divisions, the process is slightly different.

\begin{enumerate} \item Apply division tiebreakers to eliminate all but the highest ranked club in each division \item Head-to-Head (only applies if one team beat all the other tied teams or one team lost to all the other tied teams) \item Best win-loss record in games within the conference \item Best win-loss record in games between common opponents \item Strength of victory \item Strength of Schedule \item Best combined ranking among conference teams in terms of points scored and points allowed \item Best combined ranking all conference teams in terms of points scored and points allowed \item Best net points in conference games \item Best net points in all games \item Best net touchdowns in all games \item Coin toss \end{enumerate}

If the number of common opponents between the tied teams is less than four, that step is skipped. The resetting of the procedure after teams have been eliminated or a top team is determined is done the same as with divisions. Once all of this procedure has been done, all 12 playoff teams will have been determined, and the playoffs are ready to begin!

3 Tiebreaking Code

The following is information about functions and data frames in the package ProjectNFL to obtain current seedings. To be able to break ties, we first need to obtain the scores of the games to determine the seedings. Those scores are scraped from NFL.com using the following two formulas.

WeeklyScores <- function(Week=1){
  require(rvest)
  require(tidyverse)
  url <- paste("http://www.nfl.com/schedules/2016/REG",Week,sep="")
  html <- read_html(url)
  NFL <- html %>% html_nodes(".schedules-list-date,.time,.away,.home") %>% html_text()
  NFL <- gsub("\r","",gsub("\n","",gsub("\t","",NFL)))
  Games <- length(which(NFL=="FINAL"))
  for (i in 1:Games){
    Date <- ifelse(NFL[(8*(i-1)+1)]!="FINAL",NFL[(8*(i-1)+1)],Date)
    NFL <- if(i==1) {NFL} else if(NFL[(8*(i-1)+1)]==Date) {NFL} else {c(NFL[1:(8*(i-1))],Date,NFL[(8*(i-1)+1):(length(NFL))])}
  }
  tmp <- data.frame(matrix(NFL,nrow=Games,byrow=TRUE))
  tmp <- tmp %>%
    mutate(Date=as.character(X1),AwayTeam=as.character(X3),AwayScore=as.numeric(as.character(X5)),HomeTeam=as.character(X8),HomeScore=as.numeric(as.character(X6))) %>%
    select(Date,AwayTeam,AwayScore,HomeTeam,HomeScore)
  tmp
}

WeeklyGames <- function(Week=1){
  require(rvest)
  require(tidyverse)
  url <- paste("http://www.nfl.com/schedules/2016/REG",Week,sep="")
  html <- read_html(url)
  NFL <- html %>% html_nodes(".schedules-list-date,.time,.team-name") %>% html_text()
  NFL <- gsub("\r","",gsub("\n","",gsub("\t","",NFL)))
  Games <- length(which(NFL=="FINAL"))
  for (i in 1:Games){
    Date <- ifelse(NFL[(4*(i-1)+1)]!="FINAL",NFL[(4*(i-1)+1)],Date)
    NFL <- if(i==1) {NFL} else if(NFL[(4*(i-1)+1)]==Date) {NFL} else {c(NFL[1:(4*(i-1))],Date,NFL[(4*(i-1)+1):(length(NFL))])}
  }
  tmp <- data.frame(matrix(NFL,nrow=Games,byrow=TRUE))
  tmp <- tmp %>%
    mutate(Date=as.character(X1),AwayTeam=as.character(X3),AwayScore=rep("",Games),HomeTeam=as.character(X4),HomeScore=rep("",Games)) %>%
    select(Date,AwayTeam,AwayScore,HomeTeam,HomeScore)
  tmp
}

Each of these output a dataframe of scores or games, and all 17 weeks data are stored in one dataset using the WeeklyUpdate function, where the input is the Week that was just completed. These scores are then aggregated to obtain necessary statistics for breaking the ties using the function UpdateTeams. Three other tiebreaking statistics, head-to-head, best win-loss record in games between common opponents, and best net points in games between common opponents, are determined based on the teams for whom the ties will be broken, so additional functions to accomodate these tiebreaking statistics must be added. These are available in HeadtoHead, CommonGames, and `CommonGamesPts respectively. Note, the output for HeadtoHead is a team's wins-losses versus the other teams selected, CommonGames is a winning percentage, and CommonGamesPts is a net point total. Because the tiebreaking procedure is different for division vs. conference, and the tiebreakers restart once teams break the tie or are eliminated from the tiebreaker, multiple functions have been created to account for this structure. TwoTieDiv, ThreeTieDiv, FourTieDiv, TwoTieConf, ThreeTieConf, and FourTieConf are used to break ties between random teams entered into the function, and these are then aggregated to create the final division rankings and conference rankings using FinalDivRank and FinalRank. The output in both FinalDivRank and FinalRank is the same as in UpdateTeams, but additional columns are added to show a teams ranking in the division and ranking in the conference.

4 Simulation Function

The main function used in this package in NFLSim, which incorporates FinalRank with simulated game results to project what a team's final seeding in the conference playoffs might be. The special part of this function is that it allows users to enter specific game results so that they may perform more specific anaylses than may otherwise be available. Not only can users enter result of the games in terms of winners or losers, but specific game scores are available to be entered as well. The full function is shown below, and more about how the function works in terms of simulating other games will be discussed as well.

NFLSim <- function(Games=NULL,sims=100,data=WeeklyUpdate()){
  require(rvest)
  require(tidyverse)
  rawscore <- data %>% gather(key="ID",value="Team",c(3,5)) %>% mutate(Score=ifelse(ID=="AwayTeam",AwayScore,HomeScore)) %>% select(Team,Score)
  rawagainst <- data %>% gather(key="ID",value="Team",c(3,5)) %>% mutate(Score=ifelse(ID=="AwayTeam",HomeScore,AwayScore)) %>% select(Team,Score)
  statsscore <- rawscore %>% group_by(Team) %>% summarise(mean=mean(Score,na.rm=TRUE),sd=sd(Score,na.rm=TRUE))
  statsagainst <- rawagainst %>% group_by(Team) %>% summarise(mean=mean(Score,na.rm=TRUE),sd=sd(Score,na.rm=TRUE))
  tmp <- data
  x <- Games
  if (is.null(x)==FALSE){
    y <- gregexpr("s[^A-Za-z]*[0-9]+",x)
    Spot <- as.vector(y[[1]])
    Lens <- attr(y[[1]],"match.length")
    z <- matrix(c(Spot,Lens),nrow=2,byrow=TRUE)
    z[2,] <- apply(z,2,function(x){x[2]=x[1]+x[2]})
    z[2,ncol(z)] <- nchar(x)
    a <- as.vector(c(0,z))
    b <- 0
    for (i in 1:(length(a)-1)) {
      b[i] <- substr(x,a[i]+1,a[i+1])
    }
    b[seq_along(b)[seq_along(b) %% 2 ==0]] <- gsub("[^0-9]","",b[seq_along(b) %% 2 == 0])
    c <- data.frame(matrix(b,ncol=4,byrow=TRUE))
    colnames(c) <- c("AwayTeam","AwayScore","HomeTeam","HomeScore")
    for (j in 1:nrow(c)){
      if (length(which(data$AwayTeam==c$AwayTeam[1] & data$HomeTeam==c$HomeTeam[1] & is.na(data$AwayScore))) == 0) {
        stop(paste(c$AwayTeam[1]," at ",c$HomeTeam[1]," is not a game still remaining to be played.",sep=""))
      }
      tmp$AwayScore[tmp$AwayTeam==c$AwayTeam[j] & tmp$HomeTeam==c$HomeTeam[j] & is.na(tmp$AwayScore)] <- c$AwayScore[j]
      tmp$HomeScore[tmp$AwayTeam==c$AwayTeam[j] & tmp$HomeTeam==c$HomeTeam[j] & is.na(tmp$HomeScore)] <- c$HomeScore[j]
    }
  }
  for (s in 1:sims){
    tmp$AwayScore <- as.numeric(apply(dat[,3:6],1,function(x){
      ifelse(is.na(x[2]),
        ifelse(max(round(rnorm(1,
          mean=mean(c(statsscore$mean[statsscore$Team==x[1]],statsagainst$mean[statsagainst$Team==x[3]])),
          sd=sqrt((statsscore$sd[statsscore$Team==x[1]]^2+(statsagainst$sd[statsagainst$Team==x[3]])^2)/4))),0)==1,
          2,
          max(round(rnorm(1,
          mean=mean(c(statsscore$mean[statsscore$Team==x[1]],statsagainst$mean[statsagainst$Team==x[3]])),
          sd=sqrt((statsscore$sd[statsscore$Team==x[1]]^2+(statsagainst$sd[statsagainst$Team==x[3]])^2)/4))),0)),x[2])}))
    tmp$HomeScore <- as.numeric(apply(dat[,3:6],1,function(x){
      ifelse(is.na(x[4]),
        ifelse(max(round(rnorm(1,
          mean=mean(c(statsscore$mean[statsscore$Team==x[3]],statsagainst$mean[statsagainst$Team==x[1]])),
          sd=sqrt((statsscore$sd[statsscore$Team==x[3]]^2+(statsagainst$sd[statsagainst$Team==x[1]])^2)/4))),0)==1,
          2,
          max(round(rnorm(1,
          mean=mean(c(statsscore$mean[statsscore$Team==x[3]],statsagainst$mean[statsagainst$Team==x[1]])),
          sd=sqrt((statsscore$sd[statsscore$Team==x[3]]^2+(statsagainst$sd[statsagainst$Team==x[1]])^2)/4))),0)),x[4])}))
    tm3 <- FinalRank(scores=tmp)
    y <- tm3 %>% select(Team,ConfRank)
    if (s==1) {x<-y} else {x<-full_join(x,y,by="Team")}
    tmp <- data
  }
  n <- c(1:sims)
  colnames(x) <- c("Team",sapply(n,function(x){paste("Sim",x,sep="")}))
  x
}

As can be seen, this function has three inputs: Games, sims, and data. Games is the place where users can enter the specific game scores they want to see happen. The way to enter the scores is to put in order Road team, road score, home team, home score separated by any character (or nothing) other then a letter or a number. Also, if multiple scores want to be inputted, the user simply needs to enter the next game result in the same string as the first along with all other entered scores. The function will place the specified scores into a new data frame. Sims tells the function how many simulations will be run and data specifies the scores of the games already played from the WeeklyUpdate function.

The next part of this function to explain is how the scores of the non-user defined games are simulated by the function. As I thought about this, two statistics from the game scores stood out to me to determine how the game scores should be projected: points scored and points allowed. So the following steps are used to simulate the team scores. Let $X_1$ be a random variable for points scored by team 1 with mean and standard deviation $\mu_1$ and $\sigma_1$ respectively and $X_2$ be a random variable for points scored by team 2, who is playing team 1, with mean and standard deviation $\mu_2$ and $\sigma_2$ respectively. Now, assume that both $X_1$ and $X_2$ are independent and normally distributed. Them we can say $\frac{X_1 + X_2}{2} \sim N(\frac{\mu_1+\mu_2}{2},\frac{\sigma_1^2+\sigma_2^2}{4})$, and this is the distribution used by this function to simulate the score for team 1 in a game against team 2, where the $\mu$ and $\sigma$ parameters are estimated by the mean and standard deviations of the necessary scores of actual games up to that point. One last note, the score are rounded to whole numbers and rounded up to 2 if need be, since it is mathematically impossible to score exactly 1 point in an NFL game.

The output of this function is a data frame where the first column (Team) specifies one of the 32 NFL teams, and the remaining columns are the Conference Rank columns from each FinalRank() for each of the Simulations for the specified number of sims in the function input. While this data is useful, it would be nice to be able to visualize the data. For this issue, I have created a function to take the output of NFLSim and create a barplot, using both ggplot and plotly, showing the frequency of conference seeds for a specific team.

SeedPlot <- function(TeamX="Bills",Sim=NFLSim(data=WeeklyUpdate()),Plotly=TRUE){
  require(tidyverse)
  require(plotly)
  t <- Sim %>% filter(Team==TeamX) %>% select(-Team)
  count <- sapply(c(1:16),function(x){length(which(t==x))})
  num <- data.frame(count)
  num$seed <- c(1:16)
  gg <- ggplot(num,aes(x=seed,y=count))+geom_bar(stat="identity")+
    xlab("Seed")+ylab("Frequency")+ggtitle(paste("Seeding Plot of",TeamX))+
    theme(plot.title = element_text(hjust = 0.5))
  if(Plotly==TRUE) {ggplotly(gg)} else {gg}
}

The inputs for this function are TeamX, which specifies the team whose simulation results will be shown in the graph and Sim, specifying the simulation results from NFLSim. An example of both NFLSim and SeedPlot are shown below, though the plotly aspect of the graph will be unavailable.

set.seed(513)
library(devtools)
install_github("mstuart2097/STAT585Final")
library(ProjectNFL)
dat <- WeeklyUpdate(16)
tmp <- NFLSim(data=dat)
head(tmp[,c(1:11)])
pl <- SeedPlot(TeamX="Raiders",Sim=tmp,Plotly=FALSE)
pl

5 Discussion

One flaw regarding the simulation study is the model structure for determining a team's score. First of all, this assumes that the points scored and points against are independent of one another, which may not be true depending on the opponents one team played and whether or not they are the same. Second, because there are only 16 games in a regular season, the central limit theorem does not necessarily apply to this data and the normality assumption of the scores. Finally, the model fails to take into account other confounding variables, such as player injuries or suspensions, how well a team has done recently, or whether or not teams have decided to rest their players because they have already secured whatever seed they will achieve. Any of these points can be accounted for in the model going forward, but for the purposes of this paper, the goal was to right a package allowing a user to simulate an NFL season using previous game scores and vizualize the data in an appropriate manner. Other graphs or charts can be added as well, but this gives a basic glimpse into simulating an NFL package and how nifty an R package can really be.



mstuart2097/STAT585Final documentation built on May 23, 2019, 8:16 a.m.