# Set options
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ""
)

# Load the package
library(oystr)

Summary

You can use the {oystr} package to process Oyster card journey data in CSV files provided by Transport for London (TfL).

Provide a file path to oy_read() that contains the CSVs, then use data_clean() to prepare the data and extract new features ready for further processing and summarisation.

remotes::install_github("matt-dray/oystr")
library(oystr)

data_raw <- oy_read(path = "data/")  # supply your own path
data_clean <- oy_clean(x = data_raw)

Background

Transport for London (TfL) is responsible for the majority of the transport network in the UK's capital. Transport options include overground and underground trains, buses, trams and riverboats.

TfL operates the Oyster card system on its network. You can charge the card with funds and get access to the network by tapping in your card on a special reader.

Your card has a unique number and this is recorded each time you 'tap in' and 'tap out' of the network. TfL stores this data for consumption by users for eight weeks. You can view your history on the TfL website or you can be sent a monthly digest by email.

In both cases you can collect the data as a CSV file. This is in more of a human-readable form than machine-readable, which can make it difficult to perform further processing from a programmatic perspective.

Oyster data

TfL provide the data in a consistent format. Here are the first few rows from an anonymised example where the locations and services have been replaced.

example_raw <- read.csv(file = "../inst/extdata/july.csv")
head(example_raw)

The resulting data frame has some problems. For example:

The aim of the {oystr} package is to help address these issues and more.

Read

The oy_read() function reads and combines multiple CSV files of Oyster data. The function takes one argument: a string that is the path to a folder containing your Oyster data files.

The names of the files don't matter, but the structure of each of them must be in the format provided by TfL.

For example, the file path in the example below leads to a folder containing four files: two are CSVs containing monthly Oyster data in the format supplied by TfL (july.csv and august.csv) and the other two are a CSV and a text file that don't contain Oyster data (not_oyster.csv and not_oyster.txt).

# Print the file names in a folder of Oyster data
list.files("data/")  # supply your own path
list.files("../inst/extdata/")

So we need to read the two CSVs of Oyster data and ignore the others. The {oystr} package can help you. You can install it from GitHub and then load it.

# Install the {remotes} package to help install {oystr} 
install.packages("remotes")
remotes::install_github("matt-dray/oystr")

# Load the package
library(oystr)

The functions of the {oystr} package are now available to you.

Use the oy_read() function to read the data sets. To the 'path' argument provide a file path to a folder containing your files. Here I've used the filepath created in teh sectio above.

# Read data from a folder
data_raw <- oy_read("data/")
data_raw <- oy_read("../inst/extdata/")

This gave a warning for one CSV file that was in the wrong format -- the function identified that it didn't contain Oyster data. The text file was ignored.

Let's take a look at the structure of the data.

str(data_raw)

The output data frame contains rows of data from both CSV files. We can tell because it has data from both months of the input data frames (July and August in this example).

# Count rows from each month of input data
july <- sum(grepl("Jul", data_raw$Date))
august <- sum(grepl("Aug", data_raw$Date))

# Check their sum equals the output from oy_read()
july + august == nrow(data_raw)

Clean

There's still a number of problems after the data is read in and the blank first row is removed. For example:

You can clean up the formatting and engineer new features with oy_clean().

# Clean the output from oy_read()
data_clean <- oy_clean(data_raw)

# See the data frame structure
str(data_clean) 

So what did oy_clean() do? You can see there are some new and altered columns:

The data are now in a suitable shape for further processing, such as summarisation and visualisation.

Summarise

For now, you can summarise the data in two ways: as a list of summary statistcis or as a plot. These functions are underdeveloped while I wait for more information about the possible formats of data from the Journey/Action column.

List

You can use the oy_summary() function to crate a list where each element is some statistic. By default, the function works for train journeys.

oy_summary(data_clean)

You can also specify bus journeys.

oy_summary(data_clean, mode = "Bus")

See ?oy_summary for details.

Plot

There is limited suport for plotting too. You can call oy_lineplot() for a very simple lineplot of some continuous variables over time. For ecample, here's the balance remaining on the card over time.

oy_lineplot(data_clean, y_var = "balance")

And here's journey duration with weekends removed.

oy_lineplot(data_clean, y_var = "journey_duration", weekdays = FALSE)


matt-dray/oystr documentation built on Jan. 20, 2021, 6:43 a.m.