library(knitr) opts_chunk$set(fig.width = 6, fig.height = 4)
citibike
is an R package that facilitates Extract - Transform - Load operations for NYC CitiBike Data. citibike
inherits functions from etl , which is the parent package. Similarly to etl
, citibike
allows the user to pass the date arguments and get a populated SQL database back.
citibike
is dependent on etl
. To get started, both packages need to be loaded.
install.packages("devtools") devtools::install_github("beanumber/citibike")
library(etl) library(citibike)
Instantiate an etl
object using a string citibike that determines the class of the object. Specifiy the local directory for data storage.
bikes <- etl("citibike", dir = "local_path") class(bikes)
etl
works with a local or remote database, so as citibike
. If no SQL source has been specified, a local RSQLite
database will be created for the user. However, the user could also specify the path to an existing database using dplyr::src_sql
.
Extract takes years
and months
parameters and allows user to fetch the data of specific date of interest. Note that the data is updated monthly. The default date is the starting month, which is July 2013. If the user enters any date before that, the user will get an error message notification. The user could check the raw folder in the local directory that's created before.
# default etl_extract(bikes) # a specific month etl_extract(bikes, years = 2014, months = 9) # duration etl_extract(bikes, years = 2015, moths = 3:6) # invalid date will bring up error message etl_extract(bikes, months = 3:5)
By default, etl_transform
takes July 2013 data file and transform the raw data into CSV files. Similar to etl_extract
, the user could specify the dates. The user could check the load folder of the local directory that has been created.
bikes %>% etl_transform()
Import the CSV files into SQL and populate the SQL database with the transformed data.
bikes %>% etl_load()
All of the tables contain the following columns:
The user could choose which ones to look at for further analysis.
library(lubridate) library(dplyr)
The following code loads citibike data from 2015 January into a SQLite database.
bikes <- etl("citibike") bikes %>% etl_create(years = 2015, months = 2) trips <- tbl(bikes,"trips") head(trips, 4) trips <- trips %>% collect() citibike_time <- trips %>% select(starttime, stoptime) %>% mutate(starttime = dmy_hm(starttime)) %>% mutate(stoptime = dmy_hm(stoptime)) %>% mutate(duration_time = as.numeric(stoptime-starttime)) citibike_time <- citibike_time %>% filter(duration_time >= 0) mean(citibike_time$duration_time) citibike_start <- trips %>% select(start.station.name, start.station.longitude, start.station.latitude) %>% rename(longitude = start.station.longitude) %>% rename(latitude = start.station.latitude) if (require(leaflet)) { leaflet(data = citibike_start) %>% addTiles() %>% addCircles() } citibike_end <- trips %>% select(end.station.name, end.station.longitude, end.station.latitude) %>% rename(longitude = end.station.longitude) %>% rename(latitude = end.station.latitude) if (require(leaflet)) { leaflet(data = citibike_end) %>% addTiles() %>% addCircles() }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.