library(fivethirtyeight)
library(ggplot2)
library(dplyr)
library(readr)
library(knitr)
library(tibble)

# Pull all dataset names
all_datasets <- datasets_master %>% 
  pull(`Data Frame Name`) %>% 
  unique()


# Pull all fivethirtyeightdata dataset names
all_fivethirtyeightdata_datasets <- datasets_master %>% 
  filter(`In fivethirtyeightdata?` == "Y") %>% 
  pull(`Data Frame Name`) %>% 
  unique() %>% 
  sort()

if(FALSE){
  # Get data set names as listed in pkg
  pkg_data_list <- data(package = "fivethirtyeightdata")[["results"]] %>% 
    as_tibble() %>% 
    pull(Item) %>% 
    sort()

  # This should yield TRUE
  identical(all_fivethirtyeightdata_datasets, pkg_data_list)
}


# Pull all fivethirtyeight dataset names
all_fivethirtyeight_datasets <- datasets_master %>% 
  filter(is.na(`In fivethirtyeightdata?`)) %>% 
  pull(`Data Frame Name`) %>% 
  unique() %>% 
  sort()

if(FALSE){
  # Get data set names as listed in pkg
  pkg_data_list <- data(package = "fivethirtyeight")[["results"]] %>% 
    as_tibble() %>% 
    filter(Item != "datasets_master") %>% 
    pull(Item) %>% 
    sort()

  # This should yield TRUE
  identical(all_fivethirtyeight_datasets, pkg_data_list)
}

Acknowledgment

We are aware of this tweet{target="_blank"} by Mona Chalabi. Although, we have not yet decided the future of the fivethirtyeight package (and subsequently, the fivethirtyeightdata package), we re-iterate that this package is not officially published by 538.

Note on large datasets

There are r all_fivethirtyeight_datasets %>% length() datasets included in the fivethirtyeight package. However, there are also r all_fivethirtyeightdata_datasets %>% length() datasets that could not be included in fivethirtyeight due to CRAN package size restrictions:

all_fivethirtyeightdata_datasets

These r all_fivethirtyeightdata_datasets %>% length() datasets are included in the fivethirtyeightdata add-on package^[The fivethirtyeightdata package is hosted via a drat repository{target="_blank"}], which you can install by running:

install.packages('fivethirtyeightdata', repos = 'https://fivethirtyeightdata.github.io/drat/', type = 'source')

So for example, to load the senators dataset, run:

library(fivethirtyeight)
library(fivethirtyeightdata)
senators

All datasets

All r all_fivethirtyeight_datasets %>% length() + r all_fivethirtyeightdata_datasets %>% length() = r all_datasets %>% length() datasets between the fivethirtyeight and fivethirtyeightdata packages are listed here.

datasets_master %>% 
  mutate(`Data Frame Name` = paste("`", `Data Frame Name`, "`", sep=""),
         `In fivethirtyeightdata?` = ifelse(is.na(`In fivethirtyeightdata?`), "", "Yes")) %>% 
  kable()

Motivation

The motivation for creating this package is articulated in The fivethirtyeight R Package: "Tame Data" Principles for Introductory Statistics and Data Science Courses by Kim, Ismay, and Chunn (2018) published in Volume 11, Issue 1 of the journal "Technology Innovations in Statistics Education". Here is an executive summary.

We are involved in statistics and data science education, in particular at the introductory undergraduate level. As such, we are always looking for data sets that balance being:

  1. Rich enough to answer meaningful questions with, real enough to ensure that there is context, and realistic enough to convey to students that data as it exists "in the wild" often needs processing.
  2. Easily and quickly accessible to novices, so that we minimize the prerequisites to research.

It has been our experience that many data sets that exist in R packages, such as the nycflights13, babynames, and gapminder packages, are of great pedagogical value as they:

It is along these lines that we present fivethirtyeight: an R package of data and code behind the stories and interactives at FiveThirtyEight.com, a data-driven journalism website founded by Nate Silver and owned by ESPN. FiveThirtyEight has been very forward thinking in making the data used in many of their articles open and accessible on GitHub, a web-based repository for collaboration on code and data.

With consultation from Andrew Flowers and Andrei Scheinkman of FiveThirtyEight, we go one step further by:

  1. Doing just enough pre-processing (i.e. data "taming") so that statistics and data science novices can sink their teeth into the data right away.
  2. Packaging it all in an easy to load format: package installation instead of working with CSV files.
  3. Providing easily accessible documentation: The help file for each data set includes a thorough description of the observational unit and all variables, a link to the original article, and (if listed) the data sources.

"Tame" data principles

In order to make the data easily accessible to R novices, we pre-process the original data sets as they exist in the 538 GitHub repository to adhere to the following "tame" data guidelines:

  1. Naming conventions for data frame and variable names:
    1. Whenever possible, all names should be no more than 20 characters long. Exceptions to this rule exist when shortening the names to less than 20 characters would lead to a loss of information.
    2. Use only lower case characters and replace all spaces with underscores. This format is known as snake_case and is an alternative to camelCase, where successive words are delineated with upper case characters.
    3. In the case of variable (column) names within a data frame, use underscores instead of spaces.
  2. Variables identifying observational units:
    1. Any variables uniquely identifying each observational unit should be in the left-hand columns.
  3. Dates:
    1. If only a year variable exists, then it should be represented as a numerical variable.
    2. If there are year and month variables, then convert them to Date objects as year-month-01. In other words, associate all observations from the same month to have a day of 01 so that a correct Date object can be assigned.
    3. If there are year, month, and day variables, then convert them to Date objects as year-month-day.
  4. Ordered Factors, Factors, Characters, and Logicals:
    1. Ordinal categorical variables are represented as ordered factors.
    2. Categorical variables with a fixed and known set of levels are represented as regular factors.
    3. Categorical variables whose possible levels are either unknown or of a very large number are represented as characters.
    4. Any "yes/no" character encoding of binary variables is converted to TRUE/FALSE logical variables.
  5. Tidy data format:
    1. Whenever possible, save all data frames in "tidy" data format as defined by Hadley Wickham: a) Each variable forms a column. a) Each observation forms a row. a) Each type of observational unit forms a table.
    2. If converting the raw data to "tidy" data format alters the dataset too much, then make the code to convert to tidy format easily accessible.

Note: The code used to pre-process the data can be found on the GitHub repository for the package in the process_data_sets.R files. These can serve as data manipulation/wrangling examples and exercises for more advanced students.



rudeboybert/fivethirtyeight documentation built on Jan. 1, 2023, 10:17 p.m.