Data and nyctaxi Package

What data are included in this package?

The nyctaxi R package allows users to download, clean, and load data into SQL databasses. There are four types of data that are available for users to get access to, and they are trip level yellow taxi data from 2009 to the most recent month, trip level green taxi data from August 2013 to the most recent month, uber pick-up data from April to September 2014 and from Janaury to June 2015, and weekly-aggregated lyft trip data from 2016 to the most recent week.

Yellow Taxi

The total size of all yellow taxi trip data csv files (from Jan 2010 to Dec 2016) is 191.38 GB, and NYC yellow taxi trip data from Jan 2009 to the most recent month can be found on the NYC Taxi & Limousine Commission (TLC) website. [@datayellow] The data were collected and provided to the NYC TLC by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).

The yellow taxi trip records include the following fields: pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Green Taxi

The total size of green taxi trip data csv files (from Aug 2013 to Dec 2016) is 7.8 GB, and green taxi trip data from Aug 2013 to the most recent month can be downloaded from NYC Taxi & Limousine Commission (TLC). [@datayellow] Green taxi trip records include the same variables as yellow taxi trip records do.

TLC Summary Report

The New York City TLC publishes summary reports that include aggregate statistics about taxi, Uber, and Lyft usage. These are in addition to the trip-level data; although the summary reports contain much less detail, they’re updated more frequently, which provides a more current glimpse into the state of the cutthroat NYC taxi market. [@datayellowmonth]

In addition, the trip level NYC Uber data only covers two periods, from April to September 2014 and from January to June 2015. However, the summary reports cover weekly-aggreagated data from 2015 to the most recent week.

The data can be accessed by using the following commands:

download.file("http://www.nyc.gov/html/tlc/downloads/csv/data_reports_monthly_indicators_yellow.csv", destfile = "~/Desktop/yellow_monthly_data.csv")
download.file("http://data.cityofnewyork.us/api/views/2v9c-2k7f/rows.csv?accessType=DOWNLOAD", destfile = "~/Desktop/fhv_weekly_data.csv")

Uber

The total size of Uber pick-up data (from Apr to Sep 2014 and from Jan to June 2015) is 900 MB, and thanks to FiveThirtyEight who obtained the data from NYC TLC by submitting a Freedom of Information Law request on July 20, 2015, these data are now open to public. [@datauber]

The 2014 Uber data contains four variables: Date/Time (the date and time of the Uber pick-up), Lat (the latitude of the Uber pick-up), Lon (the longitude of the Uber pick-up), and Base (the TLC base company code affiliated with the Uber pickup).

The 2015 Uber data contains four variables: Dispatching_base_num (the TLC base company code of the base that dispatched the Uber), Pickup_date (the date of the Uber pick-up), Affiliated_base_num (the TLC base company code affiliated with the Uber pickup), and locationID (the pick-up location ID affiliated with the Uber pickup).

NYC Open Data also provides weekly-aggreagated Uber pick-up data from 2015 to the most recent month. [@datauberweek]

Lyft

The total size of weekly-aggregated Lyft trip data (from Jan 2015 to Dec 2016) is 914.9 MB, and these data are open to public and weekly-aggregated Lyft data from 2015 to the most recent week can be found on NYC OpenData website. [@datalyft]

Taxi zone shapefile

Two datasets are attached to nyctaxi. The first one is called taxi_zone_lookup, and this dataset contains information, such as taxi zone location IDs, location names, and corresponding boroughs for each ID. [@datayellow] A shapefile containing the boundaries for the taxi zones, taxi_zones, is also included in the package for users to do spatial analysis.

data("taxi_zones")

ETL nyctaxi Package

etl is the parent package of nyctax. etl provides a framework that allows R users to work with medium data without any knowledge in SQL database. Users can run SQL queries by using dplyr commands in R and choose to only return the final result, which could be a summary table, from SQL database into R Environment in order to aovid R from crashing. The user interaction takes place solely within R.

etl framework has three operations -Extract, Transfer, and Load- which bring real-time data into local or remote SQL databases. Users can specify which type of SQL database they prefer to connect to. etl-dependent packages, such as nyctaxi, make medium data more accessible to a wider audience. [@baumer2014]

nyctaxi was initially designed to work with New York City taxi data, but later on Uber and Lyft data were added and the ETL functions are modified to be specialized in working with these data. This package compiled three major sources of hail service in New York City so that it is convenient for users to compare and contrast the performance of these three services. [@pkgnyctaxi]

This package inherits functions from many packages: etl, dplyr, DBI, rlang, and stringr.

Since SQL databases are good tools for medium data analysis, ETL functions build connection to a SQL database at the back end and convert R code automatically into SQL queries and send them to the SQL database to get data tables containing data of each hail service. Thus, users do not need to have any knowledge of SQL queries and they can draw in any subsets of the data from the SQL database in R.

Getting started

To start using the package, load the 'nyctaxi' package into your R session. Since the 'nyctaxi' package currently lives on CRAN, you can install it by running the follwoing commands:

install.packages("nyctaxi")
library(nyctaxi)

Extract-Transform-Load Operation

Users need to create an etl object in order to apply the etl operations to it, and only the name of the SQL database, working directory, and type of SQL database need to be specified during initialization. If the type of SQL database is not specified, a local RSQLite database will be generated as default.

# initializing an etl object
db <- src_mysql("nyctaxi", user = "urname", host = "host", password = "pw")
taxi <- etl("nyctaxi", dir = "~/Desktop/nyctaxi", db)

SQL Database Initialization

init.mysql() is written under nyctaxi to help users to set up five basic table structures for MySQL database. yellow_old is created for Yellow Taxi data that are prior to August 2016, and yellow is created for data later than July 2016. green, uber, and lyft are also initiated for the three transportations.

etl_init() can be run after a database connection is built to process to process init.mysql() to initialize a MySQL database, and default columns with the correct variable names and typed defined will be automatically generated.

taxi %>%
  etl_init()

Extract

etl_extract.etl_nyctaxi() allows users to download New York City yellow taxi, green taxi, Uber, and Lyft data from the corresponding data sources. It takes the years, months, and type parameters and download the New York City taxi data specified by users. New York City Yellow and Green Taxi data are updated on NYC Taxi & Limousine Commission (TLC) website on a monthly basis.

taxi %>% 
   etl_extract(years = 2014:2016, months = 1:12, type = c("yellow", "green"))

Transform

etl_transform.etl_nyctaxi() allows users to transform New York City yellow taxi, green taxi, Uber, and Lyft data into cleaned formats, and it utlizes different data cleaning techniques when it transforms data for each transportation type. In general, it cleans the data and creates a new csv file in the load directory to store the cleaned data. It helps us to retain and protect raw data from being modified or destroyed. Users are allowed to specify the month of interest in order to only transform the data that they are interested in. This functionality helps people to be more efficient with their use of time.

By default, it takes the current year Yellow taxi trip records data files, and save copies of them in the load diectory. It skips the cleaning step, because the raw Yellow Taxi data downloaded from TLC is already in a desired format with all variables correctly labelled.

taxi %>% 
   etl_transform(years = 2014:2016, months = 1:12, type = c("yellow", "green", "uber", "lyft"))

Load

etl_load.etl_nyctaxi() allows users to load New York City yellow taxi, green taxi, Uber, and Lyft data into different data tables in a SQL database. It populates a SQL database with data cleaned by etl_transform.

taxi %>% 
   etl_load(years = 2014:2016, months = 1:12, type = c("yellow", "green", "uber", "lyft"))

Case Study

library(nyctaxi)
library(dplyr)
library(leaflet)
library(lubridate)

Using NYC Taxi Trip Data

4 datasets are attached to this package: green_2016_01_sample and yellow_2016_01_sample are two of them. They are random samples of 100 observations generated by the sample() function in base R from the January 2016 green and yellow taxi trip data.

head(green_2016_01_sample)
green_2016_01_sample <- green_2016_01_sample %>%
  filter(is.na(pickup_longitude) == FALSE) %>%
  filter(is.na(pickup_latitude) == FALSE) %>%
  filter(is.na(pickup_longitude) == FALSE) %>%
  filter(is.na(pickup_longitude) == FALSE)

Visualization

We can use leaflet to visualize the pickup and dropoff locations of the first 100 trips in the green taxi trip dataset. There are also shapefiles of NYC Taxi zones attached to nyctaxi R package. Let's plot them:

green_2016_01_sample %>%
  head(100) %>%
  leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 2)

Play with Datetime Variable

We can use lubridate to clean datetime variable:

clean_datetime <- green_2016_01_sample %>% 
  mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime)) %>%
  mutate(lpep_dropoff_datetime = ymd_hms(lpep_dropoff_datetime)) %>% 
  mutate(weekday_pickup = weekdays(lpep_pickup_datetime)) %>%
  mutate(weekday_dropoff= weekdays(lpep_dropoff_datetime))

We can now analyze the number of trips occurred on each day of a week:

clean_datetime %>% 
  group_by(weekday_pickup) %>%
  summarize(N = n(), avg_dist = mean(trip_distance), 
            avg_passengers = mean(passenger_count), 
            avg_price = mean(total_amount))

It looks like on Friday and Saturday had the most trips.

Using Extract-Transform-Load Operations to Get Access Green Taxi data

To access data during wider time spans, make use of the 'etl' package to download the data and import it into a database. Please see the documentation for etl_extract for further details and examples.

help("etl_extract.etl_nyctaxi")

The code below creates a directory on your local desktop and downloads NYC taxicab trip data from Janaury, 2016 to your local directory. It also transforms/cleans the data and loads it to a sqlite database.

taxi <- etl("nyctaxi", dir = "~/Desktop/nyctaxi/")
taxi %>%
  etl_extract(years = 2016, months = 1, types = "green") %>% 
  etl_transform(years = 2016, months = 1, types = "green") %>% 
  etl_load(years = 2016, months = 1, types = "green")

Reference



beanumber/nyctaxi documentation built on May 12, 2019, 9:43 a.m.