old <- options(rmarkdown.html_vignette.check_title = FALSE)

Hello, friends.

You, like me, might be a big college football fan who is interested in statistics and analytics, wanting to get into the game for yourself. This simple tutorial is just the way to do that!

The cfbscrapR package is just the way to get started with analytics. cfbscrapR is a wrapper for the CollegeFootballData.com API, which lets you pull data directly from the site into the statistical software R for analysis. What kind of data can you get with cfbscrapR?

We'll learn how to do three things in this document:

  1. Pull and clean play-level and game-level data for current and past seasons.
  2. Calculate yards per play, success rate, and rushing and passing EPA.
  3. Learn some basics of visualizing those stats.

I hope you enjoy! Know that there is a learning curve to all this! It's difficult, and at times can be frustrating. The best way to learn this stuff is to play around with it, understand the syntax, and beat your head against the wall for a little bit until it works.

-Parker Fleming | @statsowar

Setting up R and R Studio

First things first, you need to install R and download R Studio. I'll direct you to this link. Go do that now, then come back.

Now that we have R and R Studio all put together, it's time to get organized. That's often the first step of any project working with data, and the more organized you can be, the easier of a time you'll have trying to wrangle and analyze the data.

Notice there are four quadrants to your R Studio setup: the top right is the Environment tab, which shows you the data, values, and functions you currently have loaded. The bottom right is where you'll see your charts and graphs in the viewer or plots tab. The bottom left is the console, where output goes. When you execute code, the terminal responds and shows you some stuff down there. The top left is the place you'll live; it's where you can edit scripts. We'll load a new project and then get started.

In the top right of your R Studio window, there will be a small drop-down menu that says "None". Click it, and select "New Project." Title your new folder something like "CFB Analysis".

This takes care of setting your working directory, so everything you create and output will be saved in that folder.

What should pop up then is a completely fresh R project. Click the white square with a green plus sign in the top left of the window and select "New R Script". A script is the place you will write your commands before you execute them. You can work in the console, which is just below the scripts and where output is displayed, but I find it easier to stick to scripts - that way, you can save what you've done and replicate it easily, like I do with the EPA rankings, for example.

Throughout this document, I'll have lines of code interspersed with commentary. What I'd advise is that you read the commentary, then copy the code into your script and then run the code from there. Then, at the end, you'll have a complete script that will guide you through CFB Data analysis.

Getting the Data

What we'll do here is a template for an initial data ingest. This will let you pull the play-by-play data, combine it with the game-level data, and then filter out FCS games and garbage time, should you so choose.

Install the necessary packages.

To use the packages, we need to do two things. First, one time only on your machine, you need to install the pacakges. Then, every time you open up R Studio, you will need to load the packages.

To install the packages, simply run each line of code below once (without the '#' at the beginning):

#install.packages('tidyverse')
#installpackages('easypackages')
#install.packages("devtools")
#devtools::install_github("meysubb/cfbscrapR")
#install.packages('ggimage')
#install.packages('gt')

Don't worry about what all that means for now. Now, after running those lines of code, you have all the pacakges you need to get started with college football analytics. Let's load the pacakges and download the play-by-play data for 2019. First we load the three necessary packages, then we're going to run a loop over the number of weeks in the season and use the cfb_pbp_data() function, which pulls the play-by-play.

We're going to pull the data, clean it, then select some relevant variables, and then we'll go forward with moving the data.

Pull the data from the API

Note in the code how we defined what pbp_2019 is: we used the little arrow. If you're on mac, you can press option and the dash key to pull one of those little guys up. That's just R's fun way of saying "name that thing this".

easypackages::libraries('tidyverse', 'cfbscrapR', 'ggimage', 'gt')
pbp_2019 <- data.frame()
seasons <- 2019
pbp_2019 <- purrr::map_df(seasons, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/saiemgilani/cfbscrapR-data/master/data/rds/pbp_players_pos_{x}.rds")
    )
  )
})

This code will take a few minutes (~3:10) to run - it's pulling a lot of data from the API at once. Give your R console a break until it finishes! You'll know it's done running when you can see "pbp_2019" in the "Environment" window in the top right of your R console (and when the little carot > pops up in your terminal, at the bottom left).

Now, we have the entire season's worth of play-by-play data stored in an object called "pbp_2019". All of that text in the cfb_pbp_data() is customizable: you could select only a specific week, you could pull data from bowl games by changing season_type to "both". Finally, the epa_wpa = TRUE adds cfbscrapR's EPA model and a bunch of useful stats and codes on the data!

Let's pull the game data, using the cfb_game_info() function, and merge the two. Then, we'll make use of the filter and mutate functions from the tidyverse.

#Game level data
games_19 <- cfb_game_info(2019)

#Join Games and Play-by-Play
plays19 <- left_join(pbp_2019, games_19, by = c("game_id" = "game_id"))

Above, we pulled 2019 game data from the API into an object called "games_19", and then we used left_join to merge each play to the information of the game it happened in, an object called "plays19". We matched the plays to the games by telling R that the column "game_id" in pbp_2019 was the same as the column "id" in the games_19 object.

Clean the Data

Before we break down what we did to create the pbp object, we need to discuss two things:

  1. Filtering on conditions: We will use the double equality (==) to identify conditions we need to meet for our filter function. We can also use !=, "not equals", to select values on that condition (i.e. offense_play != "TCU" would remove TCU from the data), and we can use the standard greater than > and less than <. You might have to stop and think a bit before you write your filter function, so just make sure you know what it is you are actually selecting.

  2. the pipe: %>% is the magrittr pipe. The pipe is extremely handy. Instead of us having to type a function and select the data we want to apply that function to over and over again, the pipe tells R what to do in the structure of : "get this thing and then do this to it."

I'm naming our dataset of play by play data filtered "pbp". I don't like to overwrite my data once I've imported it, so I'm going to give it a new name. That way, if I mess something up along the way, I can go back to the start and clean up without too much hassle.

We'll use the indicator variables "rush" and "pass" to select every play categorized as rushes or passes. (Yes, this leaves out special teams and yes this leaves out penalties. We're starting small, and can always go back later.)

Also note we use the | symbol to denote "or" in our filter function. We'll filter the data for any observations that are runs (rush == 1) or passes (pass == 1), and store it in a new object called plays.

But we won't stop there! The beauty of the pipe is that we can do multiple operations, all strung together. So, in the code below, we'll create a new object called "pbp". To create this object, we'll do three things:

  1. Take plays19, remove all plays that aren't rushes or passes: filter(rush == 1 | pass == 1) AND THEN
  2. Remove all FCS teams. This is a personal preference of mine, and if you'd like to leave those in the data, you can certainly remove this line! It's a good way to illustrate how you'd do this, though. By using filter(), I tell R to select all observations that are not (!) missing the variable home_conference or away_conference. As all FBS teams have a conference in the data, this will remove all FCS teams. Note: is.na(x) means "is missing variable x", so !is.na(x) means "is not missing variable x."
  3. Create new variables! mutate() is R's weird way of saying "make a new variable!" So, we'll create three new variables:
    i. abs_diff is the absolute value of the score difference, created using the abs() function. This variable helps us define the garbage time filter.
    ii. garbage is an indicator variable for whether a play is in garbage time, as defined by our friend Bill Connelly. Garbage time is when the score difference is greater than 43 in the first quarter, 37 in the second quarter, 27 in the third quarter, and 22 in the fourth quarter. To create this variable, we use the ifelse(), which takes three arguments: a condition, yes, and no. ifelse basically says "if this condition is met, then make this variable this value, or do something else." We've strung multiple ifelse() conditions together to cover all of our bases.
    iii. success: We use the same ifelse() logic to create a binary variable that indicates whether a play was successful or not, again, as defined by Bill C.
#Create Garbage time filter, eliminate FCS games, 
#filter for only rushes and passes, create success variable
pbp <- plays19 %>% filter(rush == 1| pass == 1) %>% 
  filter(!is.na(home_conference) & !is.na(away_conference)) %>%
  mutate(abs_diff = abs(score_diff),
         garbage = ifelse(period == 1 & abs_diff > 43, 1, 
                   ifelse(period == 2 & abs_diff > 37, 1,
                   ifelse(period == 3 & abs_diff > 27, 1,
                   ifelse(period == 4 & abs_diff > 22, 1, 0)))),
         success = ifelse(down == 1 & yards_gained > .5*distance, 1,
                   ifelse(down == 2 & yards_gained > .7*distance, 1,
                   ifelse((down == 3 | down == 4) & yards_gained >=distance, 1, 0))))

Now we have a clean dataset of every rush and pass that happened in 2019. (Quick note: unfortunately, due to the way the pbp data is generated on ESPN, QB scrambles from dropbacks are coded as rushes. It's not ideal, but it's not a dealbreaker; just something to remember as you do your analysis).

Let's look at it!

glimpse(pbp)

Ok, wow! We have more than 100,000 observations in our dataset, and r ncol(pbp) variables. These variables include play details, like down, distance, and yard line (yards_to_goal), as well as results - was it a rush or a pass? How many yards were gained? Was it a scoring play? For our initial analysis today, we'll focus on yards gained, success rate, and expected points added, so let's use the select() function to grab some play and team details, plus those variables.

plays <- pbp %>% select(offense_play, defense_play, down, distance, yards_to_goal, rush, pass, yards_gained, play_text, success, EPA, garbage)

Now we have our relevant data, so let's make do some stats!

Creating Some Stats

Ok, now let's play with the tidyverse a little more. We've used filter(), which is an extremely useful function. The next most useful functions actually do things to the data: group_by(), and summarise().

Summarise is the way of creating season-long or game-by-game stats. Let's start with some season long raw offense numbers, yards per attempt (passing) and yards per rush. We're going to use the pipe (%>%) to tell R to grab our plays dataset, group it by offense team, and summarize their yards per attempt and yards per rush in a new object called offense.

To create a conditional mean, like "the mean of yards_gained when pass == 1", all you'll do is add the condition you want in brackets inside your mean() function.

offense <- plays %>% group_by(offense_play) %>% 
summarise(ypa = mean(yards_gained[pass==1]),
          ypr = mean(yards_gained[rush==1]))

Now we can ask and answer some fun questions. Who had the best rushing offense, on a per play basis? Who had the best passing offense, on a per-play basis? We will do this using the arrange(desc()) function, which tells R to order the data from greatest to least (descending).

offense %>% arrange(desc(ypr))
offense %>% arrange(ypa)

I took out the desc() part of the function to display the worst offenses, for fun. We see that Oklahoma had the highest yards per rush, followed by Oklahoma, Kentucky, and Louisiana. The top passing teams were Air Force, Alabama, LSU, and Oklahoma. Bonus points if you can tell me why Air Force is ranked that highly!

The worst offenses for rushing and passing were West Virginia, who averaged only 3.15 yards per rush, and Northwestern, who averaged an inconceivable 3.96 yards per passing attempt.

We can easily do the defensive side of the ball as well, grouping plays instead by defense_play. Let's make a dataset and then use left_join() to put offenses and defenses together, but let's do EPA instead of yards, because EPA is more fun. (What's EPA? Glad you asked: An EPA Primer.

If you would like to read more on the expected points model fundamentals, read the cfbscrapR tutorial on the subject:

For the tutorial, I've included the head() function as a quick way to look at the top values of our data. In your console, though, with the view() command, you can pull up the data in a traditional spreadsheet kind of format, which can help you parse through and see what you're working with, or see results you've created.

offense <- plays %>% group_by(offense_play) %>% 
  summarise(epa.pass.off = mean(EPA[pass==1]), epa.rush.off = mean(EPA[rush==1]))
defense <- plays %>% group_by(defense_play) %>% 
  summarise(epa.pass.def = mean(EPA[pass==1]), epa.rush.def = mean(EPA[rush==1]))
team.epa <- left_join(offense, defense, by = c("offense_play" = "defense_play")) 
head(team.epa)

In analytics, we often want to break down situations. Let's use filter to select only first down plays and see which offenses performed the best on first downs only:

offense <- plays %>% filter(down == 1) %>% group_by(offense_play) %>% 
  summarise(epa.pass.off = mean(EPA[pass==1]), epa.rush.off = mean(EPA[rush==1]))
defense <- plays %>% filter(down == 1) %>% group_by(defense_play) %>% 
  summarise(epa.pass.def = mean(EPA[pass==1]), epa.rush.def = mean(EPA[rush==1]))
firstdown.epa <- left_join(offense, defense, by = c("offense_play" = "defense_play")) 

head(firstdown.epa)

Easy! You can make any kind of filter you can think of, and analyse EPA or yards gained in any situation.

Lastly, let's do an example of creating some rates and using some advanced filters: We'll create success rate on early downs and early downs rush rate, filtering out garabge time. We'll use the same filter, summarise, and mean functions as above!

success <- plays %>% filter(garbage == 0 & down < 3) %>%
  group_by(offense_play) %>% 
  summarise(success.rte = mean(success),
            rush.rte = mean(rush))
head(success)

Visualization

We have some pretty data, now let's learn how to make a basic ggplot() to visualize our data! We didn't have to load ggplot(), it's included in the tidyverse. We'll take the success data from above, and start with a basic plot, then add some fancy formatting.

The syntax here is pretty simple: we will grab our dataset and use the pipe to say "open a plot". Then we'll tell the plot what the x and y coordinates are, then indicate the markers, in this case a scatterplot with geom_point(). Notice that we use + instead of %>% once we've called ggplot(). The syntax here tells R to take the `success' object and open a plot with the x.axis rush.rte and the y.axis success.rte, PLUS all these features. That can be a little confusing to keep straightforward, but R will tell you what's going on with an error if you mix that up!

success %>% ggplot(aes(x=rush.rte, y=success.rte)) + geom_point()

Ok, that's pretty cool! Let's dress it up a little bit by adding a title and labelling the axes, using the labs() option. We will also throw a couple reference lines - a vertical and horizontal line indicating the mean values of each statistic to help us compare. Plus, here are some tweaks I like that make the graph look a little better (all the theme() stuff).

success %>% ggplot(aes(x=rush.rte, y=success.rte)) + geom_point() + 
  geom_vline(xintercept = mean(success$rush.rte), linetype = "dashed", color = "red", alpha = 0.5) +
  geom_hline(yintercept = mean(success$success.rte), linetype = "dashed", color = "red", alpha = 0.5) +
  labs(x = "Early Downs Rush Rate", y= "Success Rate",
       title = "2019 FBS Early Downs Rush Rate and Success") +
  theme_minimal() +
    theme(axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 12),
    plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 12),
    panel.grid.minor = element_blank())

Adding Logos

If you want to get really fancy, and I know you do, you can include team logos on these graphs. You'll need to install and load the ggimage() package.

There are probably better/faster ways to do this, but they're all more complicated. For now, what I've done is posted the CSV from collegefootballdata.com's team page to my github, and we're just going to rip the logos from there. Not ideal, but it works. You don't need to change this line at all to get the logos data! We select school and logo because those are the only two things we need for the graphs.

cfblogos <- read.csv("https://raw.githubusercontent.com/spfleming/CFB/master/logos.csv") %>% select(school, logo)
chartdata <- success %>% left_join(cfblogos, by = c("offense_play" = "school"))

Now, we have our created success object and the team logos in the chartdata object. We'll make use of our graph from above, telling the plot that we want to use the logo column of chartdata to identify teams insteald of a regular ol' marker. This plot takes a minute to run - this is a consequence of pulling the logos from the web each time. It's not ideal, but again this is the easiest way to make these graphs, so I opted for simplicity.

It might take up to two minutes to load! Don't freak out! Don't close R! Just let it load! You'll see a white screen in your Plots window, but I promise the graph is coming.

chartdata %>% ggplot(aes(x=rush.rte, y=success.rte)) + geom_image(image = chartdata$logo, asp = 16/9) + 
  geom_vline(xintercept = mean(chartdata$rush.rte), linetype = "dashed", color = "red", alpha = 0.5) +
  geom_hline(yintercept = mean(chartdata$success.rte), linetype = "dashed", color = "red", alpha = 0.5) +
  labs(x = "Early Downs Rush Rate", y= "Success Rate",
       title = "2019 FBS Early Downs Rush Rate and Success") +
  theme_minimal() +
    theme(axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 12),
    plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 12),
    panel.grid.minor = element_blank())

You can save the graphs with ggsave(). That gives you a nice, high-res png file that we can post to Twitter or put in a blog post: ggsave('file.png', height = 7, width = 13, dpi = 300)

More Visualizations

Another useful visualization might be to see the EPA of each of a team's plays by location on the field. We will filter the dataset to include only our favorite team, and then plot the EPA of each play to examine outliers.

Here's the 2019 TCU Offense:

tcu <- plays %>% filter(offense_play == "TCU") 

tcu %>%
  ggplot(aes(x=yards_to_goal, y=EPA)) +
  geom_point() +
  labs(x = "Yard Line",
    y = "EPA",
    title = "Expected Points Added by Field Position",
    subtitle = "TCU Offense 2019") +
  geom_hline(yintercept = 0, alpha = 0.5, col = "purple") +
  theme_minimal() +
    theme(axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 12),
    plot.subtitle = element_text(size = 12),
  plot.caption = element_text(size = 12),
    panel.grid.minor = element_blank())

I added a couple fun features here: a geom_hline() reference line at zero, to help us better understand the graph, and a subtitle to clarify what was on the graph. Notice that TCU's EPA was spread almost evenly across the field, but you can see some serious negative ouliers that weighed them down. We could call those outliers into our console to explore them more:

plays %>% filter(offense_play == "TCU" & EPA < -4) %>% select(offense_play, defense_play, play_text, down, distance, yards_to_goal)

Lastly, let's explore the gt() pacakge to list the top ten rushing and passing offenses and defenses in the country. We will take our dataset, sort it by the value we want, add a rank variable using mutate(), then apply gt(), simply enough.

#Passing
team.epa %>% arrange(desc(epa.pass.off)) %>% mutate(rank = dense_rank(desc(epa.pass.off))) %>% 
  filter(rank < 10) %>% gt()

Now, over in your viewer (bottom right), you should see a sleek table. Ok, well, not so sleek. We will do three things to this table to make it a little more palatable. First, we'll add a title, second, we'll switch the order of the variables, and third, we will change the column titles.

team.epa %>% arrange(desc(epa.pass.off)) %>% mutate(rank = dense_rank(desc(epa.pass.off))) %>%
  select(rank, offense_play, epa.pass.off) %>% 
  filter(rank < 11) %>% gt() %>%
  tab_header(title = "Best Passing Teams") %>%
  cols_label(rank = "Rank", offense_play = "Offense", epa.pass.off = "EPA/Attempt")

This is just a little taste of what you can do in college football data, thanks to collegefootballdata.com and #cfbscrapR.

Let me know of any work you do, resources you can think of, or code you want to share, then I'll link to it below!


Other Great Resources:

Glossary of terms and functions

[cfbscrapR](https://saiemgilani.github.io/cfbscrapR/index.html): the R package that helps you get the college football data. Documentation can be found here:

Functions in cfbscrapR

Functions we used in the tidyverse

([gt](https://gt.rstudio.com/index.html): a nice package for getting publication-quality tables
see Tom's Cookbook to help tweak gt() with all sorts of features.)



meysubb/cfbscrapR documentation built on Dec. 15, 2020, 11:26 p.m.