library(knitr)
opts_chunk$set(fig.width = 6, fig.height = 4)

ETL

citibike is an R package that facilitates Extract - Transform - Load operations for NYC CitiBike Data. citibike inherits functions from etl , which is the parent package. Similarly to etl, citibike allows the user to pass the date arguments and get a populated SQL database back.

ETL citiBike

citibike is dependent on etl. To get started, both packages need to be loaded.

install.packages("devtools")
devtools::install_github("beanumber/citibike")
library(etl)
library(citibike)

Instantiate an etl object using a string citibike that determines the class of the object. Specifiy the local directory for data storage.

bikes <- etl("citibike", dir = "local_path")
class(bikes)

Connect to a local or remote database

etl works with a local or remote database, so as citibike. If no SQL source has been specified, a local RSQLite database will be created for the user. However, the user could also specify the path to an existing database using dplyr::src_sql.

Extract

Extract takes years and months parameters and allows user to fetch the data of specific date of interest. Note that the data is updated monthly. The default date is the starting month, which is July 2013. If the user enters any date before that, the user will get an error message notification. The user could check the raw folder in the local directory that's created before.

# default
etl_extract(bikes)

# a specific month
etl_extract(bikes, years = 2014, months = 9)

# duration 
etl_extract(bikes, years = 2015, moths = 3:6)

# invalid date will bring up error message
etl_extract(bikes, months = 3:5)

Transform

By default, etl_transform takes July 2013 data file and transform the raw data into CSV files. Similar to etl_extract, the user could specify the dates. The user could check the load folder of the local directory that has been created.

bikes %>%
  etl_transform()

Load

Import the CSV files into SQL and populate the SQL database with the transformed data.

bikes %>%
  etl_load()

About the dataset

All of the tables contain the following columns:

The user could choose which ones to look at for further analysis.

Examples

library(lubridate)
library(dplyr)

The following code loads citibike data from 2015 January into a SQLite database.

bikes <- etl("citibike")
bikes %>%
  etl_create(years = 2015, months = 2)

trips <- tbl(bikes,"trips")
head(trips, 4)

trips <- trips %>%
  collect()

citibike_time <- trips %>%
  select(starttime, stoptime) %>%
  mutate(starttime = dmy_hm(starttime)) %>%
  mutate(stoptime = dmy_hm(stoptime)) %>%
  mutate(duration_time = as.numeric(stoptime-starttime))

citibike_time <- citibike_time %>%
  filter(duration_time >= 0)

mean(citibike_time$duration_time)

citibike_start <- trips %>%
  select(start.station.name, start.station.longitude, start.station.latitude) %>%
  rename(longitude = start.station.longitude) %>%
  rename(latitude = start.station.latitude)

if (require(leaflet)) {
  leaflet(data = citibike_start) %>% 
    addTiles() %>%
    addCircles()
}

citibike_end <- trips %>%
  select(end.station.name, end.station.longitude, end.station.latitude) %>%
  rename(longitude = end.station.longitude) %>%
  rename(latitude = end.station.latitude)

if (require(leaflet)) {
  leaflet(data = citibike_end) %>% 
    addTiles() %>%
    addCircles()
}


beanumber/citibike documentation built on June 4, 2019, 9:46 a.m.