README.md

startr

A template for data journalism projects in R.

This project structures the data analysis process, reducing the amount of time you'll spend setting up and maintaining a project. Essentially, it's an "opinionated framework" like Django, Ruby on Rails or React, but for data journalism.

Broadly, startr does a few things:

Table of contents

Installation

This template works with R and RStudio, so you'll need both of those installed. To scaffold a new startr project, we recommend using our command-line tool, startr-cli, which will copy down the folder structure, rename some files, configure the project and initialize an empty Git repository.

Using startr-cli, you can scaffold a new project by simply running create-startr in your terminal and following the prompts:

startr-cli interface GIF

Alternatively, you can run:

git clone https://github.com/globeandmail/startr.git <your-project-name-here>

(But, if you do that, be sure to rename your startr.Rproj file to <project-name>.Rproj and set up your settings in config.R manually.)

Once a fresh project is ready, double-click on the .Rproj file to start a scoped RStudio instance.

You can then start copying in your data and writing your analysis. At The Globe, we like to work in a code editor like Atom or Sublime Text, and use something like r-exec to send code chunks to RStudio.

Philosophy on data analysis

This analysis framework is designed to be flexible, reproducible and easy to jump into for a new user. startr works best when you assume The Globe’s own philosophy on data analysis:

Workflow

The heart of the project lies in these three files:

There's also an optional (but recommended) RMarkdown file (notebook.Rmd) you can use to generate an HTML codebook – especially useful for longer-term projects where you need to document the questions you're asking.

Step 1: Set up your project

The bulk of any startr project's code lives within the R directory, in files that are sourced and run in sequence by the run.R at the project's root.

Many of the core functions for this project are managed by a specialty package, upstartr. That package is installed and imported in run.R automatically.

Before starting an analysis, you'll need to set up your config.R file.

That file uses the initialize_startr() function to prepare the environment for analysis. It will also load all the packages you'll need. For instance, you might want to add the cancensus library. To do that, just add 'cancensus' to the packages vector. Package suggestions for GIS work, scraping, dataset summaries, etc. are included in commented-out form to avoid bloat. The function also takes several other optional parameters — for a full list, see our documentation.

Once you've listed the packages you want to import, you'll want to reference your raw data filenames so that you can read them in during process.R. For instance, if you're adding pizza delivery data, you'd add this line to the filenames block in config.R:

pizza.raw.file <- dir_data_raw('Citywide Pizza Deliveries 1998-2016.xlsx')

Our naming convention is to append .raw to variables that reference raw data, and .file to variables that are just filename strings.

Step 2: Import and process data

In process.R, you'll read in the data for the filename variables you assigned in config.R, do some clean-up, rename variables, deal with any errors, convert multiple files to a common data structure if necessary, and finally save out the result. It might look something like this:

pizza.raw <- read_excel(pizza.raw.file, skip = 2) %>%
  select(-one_of('X1', 'X2')) %>%
  rename(
    date = 'Date',
    time = 'Time',
    day = 'Day',
    occurrence_id = 'Occurrence Identification Number',
    lat = 'Latitude',
    lng = 'Longitude',
    person = 'Delivery Person',
    size = 'Pizza Size (in inches)',
    price = 'Pizza bill \n after taxes'
  ) %>%
  mutate(
    price = parse_number(price),
    year_month = format(date, '%Y-%m-01'),
    date = ymd(date)
  ) %>%
  filter(!is.na(date))

write_feather(pizza.raw, dir_data_processed('pizza.feather'))

When called via the run_process() function in run.R, variables generated during processing will be removed once the step is completed to keep the working environment clean for analysis.

We prefer to write out our processed files using the binary .feather format, which is designed to read and write files extremely quickly (at roughly 600 MB/s). Feather files can also be opened in other analysis frameworks (i.e. Jupyter Notebooks) and, most importantly, embed column types into the data so that you don't have to re-declare a column as logicals, dates or characters later on. If you'd rather save out files in a different format, you can just use a different function, like the tidyverse's write_csv().

Output files are written to /data/processed using the dir_data_processed() function. By design, processed files aren't checked into Git — you should be able to reproduce the analysis-ready files from someone else's project by running process.R.

Step 3: Analyze

This part's as simple as consuming that file in analyze.R and running with it. It might look something like this:

pizza <- read_feather(dir_data_processed('pizza.feather'))

delivery_person_counts <- pizza %>%
  group_by(person) %>%
  count() %>%
  arrange(desc(n))

deliveries_monthly <- pizza %>%
  group_by(year_month) %>%
  summarise(
    n = n(),
    unique_persons = n_distinct(person)
  )

Step 4: Visualize

You can use visualize.R to consume the variables created in analyze.R. For instance:

plot_delivery_persons <- delivery_person_counts %>%
  ggplot(aes(x = person, y = n)) +
  geom_col() +
  coord_flip()

plot_delivery_persons

write_plot(plot_delivery_persons)

plot_deliveries_monthly <- deliveries_monthly %>%
  ggplot(aes(x = year_month, y = n)) +
  geom_col()

plot_deliveries_monthly

write_plot(plot_deliveries_monthly)

Step 5: Write a notebook

TKTKTKTK

Helper functions

startr's companion package upstartr comes with several functions to support startr, plus helpers we've found useful in daily data journalism tasks. A full list can be found on the reference page here. Below is a partial list of some of its most handy functions:

Tips

Directory structure

├── data/
│   ├── raw/          # The original data files. Treat this directory as read-only.
│   ├── cache/        # Cached files, mostly used when scraping or dealing with packages such as `cancensus`. Disposable, ignored by version control software.
│   ├── processed/    # Imported and tidied data used throughout the analysis. Disposable, ignored by version control software.
│   └── out/          # Exports of data at key steps or as a final output. Disposable, ignored by version control software.
├── R/
│   ├── process.R     # Basic data processing (fixing column types, setting dates, pre-emptive filtering, etc.) ahead of analysis.
│   ├── analyze.R     # Your exploratory data analysis.
│   ├── visualize.R   # Where your visualization code goes.
│   └── functions.R   # Project-specific functions.
├── plots/            # Your generated graphics go here.
├── reports/
│   └── notebook.Rmd  # Your analysis notebook. Will be compiled into an .html file by `run.R`.
├── scrape/
│   └── scrape.R      # Scraping scripts that save collected data to the `/data/raw/` directory.
├── config.R          # Global project variables including packages, key project paths and data sources.
├── run.R             # Wrapper file to run the analysis steps, either inline or sourced from component R files.
└── startr.Rproj      # Rproj file for RStudio

An .nvmrc is included at the project root for Node.js-based scraping. If you prefer to scrape with Python, be sure to add venv and requirements.txt files, or a Gemfile if working in Ruby.

See also

startr is part of a small ecosystem of R utilities. Those include:

Version

1.1.0

License

startr © 2020 The Globe and Mail. It is free software, and may be redistributed under the terms specified in our MIT license.

Get in touch

If you've got any questions, feel free to send us an email, or give us a shout on Twitter:

Tom Cardoso | Michael Pereira ---|--- Tom Cardoso @tom_cardoso | Michael Pereira @__m_pereira



globeandmail/startr documentation built on June 24, 2021, 4:41 a.m.