knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(B581Final)

Installation Instructions

In order to install the B581Final Package, follow the workflow below.

# install.packages("devtools")
# library(devtools)
# install_github("madeline-peyton/B581Final")

Motivation/Background

Introduction

The primary aim for the B581Final package is to scrape for various MLB data and create data sets that give users the ability to run data analyses in a simple and straight-forward way. The package consists of three main sets of functions: data collection, manipulation and calculation, and visualization. In terms of collection, the package currently scrapes for MLB Season, Team, Player, Post-Season, and Triple Crown data. The analysis functions included in the package take user inputs for team, year, player, or post-season series along with a user-defined statistic of interest and calculate averages and maximums based on these inputs. Lastly, the shiny app allows users to input the MLB season year, team and statistic of interest and outputs a histogram of the specified data.

Challenges
  1. There are multiple websites that include the same or different data for MLB statistics. It can be difficult to find a database that includes all the variables a user is interested in or that doesn’t include more variables than needed. It can also be difficult to look at more than one set of data at a time when looking at online sources. The B581Final package makes it easy for users to scrape for the data they actually need and be able to look at multiple data sets at a time for comparison.

  2. When scraping for data on the internet, R does not always load data sets in the most usable format. I ran into many issues when I began this project where databases would load numeric data types as characters and would include unnecessary data from other parts of the website that were not wanted in the data frame. This package cleans data and converts it to a functional format so that users can manipulate and analyze the data more easily.

  3. Another challenge I faced when looking at MLB data online or with other scraping packages was the lack of interaction and visualization with the data. Therefore, this package includes a shiny application where users can interact with the databases and create visualizations based on their interests.

Existing Packages
  1. mlbstatR - This package’s aim is to “give users the ability to work with Major League Baseball data in a clean and detailed way. Which provides users with a variety of ways to improve visualizations”. The biggest issue with this package is that all the documentation is in Spanish. Further, when you follow the GitHub link for this package provided in the documentation you run into a 404 error. The B581Final package will have documentation written in English for ease to those whose primary language is English and will have a working and up-to-date GitHub.

  2. baseballr - This package’s description is “baseballr is a package written for R focused on baseball analysis. It includes functions for scraping various data from websites, such as FanGraphs.com, Baseball-Reference.com, and baseballsavant.com. It also includes functions for calculating metrics, such as wOBA, FIP, and team-level consistency over custom time frames.” This package mainly focusses on NCAA baseball data rather than MLB data. Along with this, it can be hard to find the MLB focused functions in the package and documentation since there is much more coverage for NCAA data. Therefore, B581Final will build on the lack of MLB data coverage.

  3. Lahman - This R package “Provides the tables from the 'Sean Lahman Baseball Database' as a set of R data frames. It uses the data on pitching, hitting and fielding performance and other tables from 1871 through 2019, as recorded in the 2020 version of the database. Since this package is based off an existing database that is not updated, it will not maintain relevance when new seasons begin. Since B581Final scrapes the web for data it will stay up-to-date and continue to build its databases.

Functionality/ Examples of use

Primary Data Collection Functions:

MLB_Player() - This function scrapes for MLB Player career data for a user-specified player including year, team name, age, league, batting statistics and awards earned.

MLB_PostSeason_Batting() - This function scrapes for MLB Post-Season batting data. The user inputs the year and series of interest and whether they would like the data from the winning or losing team and a dataset is output including the player names and batting statistics.

MLB_Season() - This functions scrapes for MLB Season data for a user-specified year and outputs a dataset including team names and batting or pitching statistics as chosen by the user.

MLB_Team() - This function scrapes for MLB Team data for a user-speified team and year and outputs a data set including player names and batting or pitching statistics as chosen by the user.

MLB_Triple_Crown() - This function scrapes for MLB Triple Crown winner data for a user-specified MLB League.

Data Analysis Functions:

MLB_Player_Stat_Avg() - This function takes an input of a player name and statistic of interest and calculates the player’s average of the statistic of interest throughout their career and prints a summary statement.

MLB_Player_Stat_Max() - This function takes an input of a player name and statistic of interest and calculates the player’s maximum of the statistic of interest throughout their career and prints a summary statement.

MLB_PostSeason_Stat() - This function takes an input of year, series and statistic of interest and calculates the average of that statistic of interest for the winning and losing team of that series and prints a summary statement.

MLB_Season_Stat() - This function takes an input of year, statistic, and Batting/Pitching and calculates the team with the highest of the statistic of interest for that season and prints a summary statement.

MLB_Team_Stat_Avg() - This function takes an input of team abbreviation, year, statistic, and batting/pitching and calculates the teams average for that statistic for that season and prints a summary statement.

MLB_Team_Stat_Max() - This function takes an input of team abbreviation, year, statistic, and batting/pitching and calculates the team’s maximum for that statistic for that season and prints a summary statement.

MLB_TripleCrown_Stat() - This function takes an input of MLB league and calculates the highest HR, RBI, and BA overall for all triple crown winners in that league and prints a summary statement.

Data Visualization App

app.R - This shiny app allows a user to input a year, team abbreviation and statistic of interest and displays a histogram for the specified data.

Functionality

Below I give examples to test the functionality of the primary data collection functions.

To create a database of Joey Votto's career statistics we can use MLB_Player().

name = 'Joey Votto'
MLB_Player(name)

To create a database of 1966 World Series Loser batting statistics we can use MLB_PostSeason_Batting().

year = 1966
series = 'WS'
WorL = 'L'
MLB_PostSeason_Batting(year, series, WorL)

To create a database of 2014 Batting statistics we can use MLB_Season().

year = 2014
BorP = 'B'
MLB_Season(year, BorP)

To create a database of 1987 Cincinnati Reds Pitching statistics we can use MLB_Team().

TeamAbbr = 'CIN'
year = 1987
BorP = 'P'
MLB_Team(TeamAbbr, year, BorP)

To create a database of American League Triple Crown winners we can use MLB_Triple_Crown().

League = 'AL'
MLB_Triple_Crown(League)
Examples of Use

Use Case One: Calculating Johnny Bench's average home runs in a season throughout his career

name <- "Johnny Bench"
statistic <- "HR"
MLB_Player_Stat_Avg(name, statistic)

Use Case Two: Calculating Johnny Bench's highest number of home runs in a season in his career

name <- "Johnny Bench"
statistic <- "HR"
MLB_Player_Stat_Max(name, statistic)

Use Case Three: Calculating the highest ERA for the 1999 MLB Season

year <- 1999
statistic <- "ERA"
BorP <- "P"
MLB_Season_Stat(year, statistic, BorP)

Use Case Four: Calculating the Houston Astros player's average number of strikeouts in 2008

TeamAbbr <- 'HOU'
year <- 2008
statistic <- 'SO'
BorP <- 'B'
MLB_Team_Stat_Avg(TeamAbbr, year, statistic, BorP)

Use Case Five: Calculating the Houston Astros player's highest number of strikeouts in 2008

TeamAbbr <- 'HOU'
year <- 2008
statistic <- 'SO'
BorP <- 'B'
MLB_Team_Stat_Max(TeamAbbr, year, statistic, BorP)

Use Case Six: Create Histogram for 2021 Cincinnati Reds OBP

TeamAbbr <- 'CIN'
year <- 2021
statistic <- 'OBP'
MLB_Team_Graphic(TeamAbbr, year, statistic)

Future Work and Plans

Plan 1

In the future, I hope to scrape for and create a database that includes all MLB players for a specific season. I would like to create a function where a user simply inputs a year and the function generates a database for every single MLB player for that year and their team name, position and batting/pitching statistics. I believe this function would allow users to holistically compare all players in the League. Along with this, I would like to be able to create a function that would sort and analyze the data such as sorting by highest batting average or finding the player with the most at bats for the season. In order to accomplish this, I believe I could start by scraping for all team data for a specific year and then bind these databases of team data together. However, I do not know how I would handle if players switch teams mid-season and appear on more than one roster and therefore are included in the new database multiple times.

Plan 2

In the future, I hope to be able to scrape for all MLB players for a specific team, such as creating a database of all players who ever appeared on the Cincinnati Reds roster. My hope would be for this function to accept a team abbreviation as an input and output a database with the player's name, year's played on that team, and their average batting/pitching statistics. I would like to also create a function to analyze this database and compare the players batting/pitching statistics and time on the team. In order to do this, I would need to use my MLB_Player function to calculate time on specific teams and average batting/pitching statistics. However, I would have to create a loop to go through many player databases to determine which players were on each team and this could take the computer a lot of time.

Plan 3

In the future, I hope to give the Shiny app more customizable options for how to visualize the data such as choosing a different or multiple graphs or allowing inputs to select colors/styles. In order to allow other graphs, I would need to render more plots in the server and create possibly a check box input including the multiple graph options such as a boxplot. In order to allow for other colors or styles, I could create another selection input for fill and border color or a new slider input to select number of bins for the histogram.



madeline-peyton/B581Final documentation built on Dec. 23, 2021, 11:16 p.m.