New York City's Taxi and Limousine Commission (TLC) Trip Data

NYC Taxi

NYC's Taxi and Limousine Commission Trip Data is a collection of trip records including fields capturing pick-up and drop-off locations, times, trip distances, fares, rate types, and driver-reported passenger counts. The data was collected and provided to the NYC TLC by technology providers under the Taxicab & Livery Passenger Enhancement Programs.

Getting started

To start using the package, load the 'nyctaxi' package into your R session. Since the 'nyctaxi' package currently lives on GitHub and not on CRAN, you have to install it using 'devtools'.

#install.packages("devtools")
#devtools::install_github("beanumber/nyctaxi")

Using the NYC Taxi Trip Data

Two dataframes are included in this package: 'green_2016_01_sample' and 'yellow_2016_01_sample'. There are random samples of 1000 observations generated by the 'sample' function in base R from the 2016 January green and yellow taxi trip data.

library(nyctaxi)
data(green_2016_01_sample)
head(green_2016_01_sample)

Extracting, Transforming and Loading the data

To access data during wider time spans, make use of the 'etl' package to download the data and import it into a database. Please see the documentation for 'etl_extract' for further details and examples.

help("etl_extract.etl_nyctaxi")

The code below creates a directory on your local desktop and downloads NYC taxicab trip data from Janaury, 2016 to your local directory. It also transforms/cleans the data and loads it to a sqlite database.

taxi <- etl("nyctaxi", dir = "~/Desktop/nyctaxi/")

taxi %>%
  etl_extract(years = 2016, months = 1, types = c("green")) %>% 
  etl_transform(years = 2016, months = 1, types = c("green")) %>% 
  etl_load(years = 2016, months = 1, types = "green")}

Using the NYC Green Taxi Trip Data

library(dplyr)
library(leaflet)
library(lubridate)

We can use leaflet to visualize the pickup and dropoff locations of the 1000 trips in the green taxi trip dataset:

my_trips <- green_2016_01_sample

#clean_up data according to date and time of pickup
one_cab <- my_trips %>% 
  filter(Pickup_longitude != 0)

leaflet(data = one_cab) %>% 
  addTiles() %>% 
  addCircles(lng = ~Pickup_longitude, lat = ~Pickup_latitude) %>% 
  addCircles(lng = ~Dropoff_longitude, lat = ~Dropoff_latitude, color = "green")

We can use lubridate to clean datetime variable:

clean_datetime <- my_trips %>% 
  mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime)) %>%
  mutate(Lpep_dropoff_datetime = ymd_hms(Lpep_dropoff_datetime)) %>% 
  mutate(weekday_pickup = weekdays(lpep_pickup_datetime)) %>%
  mutate(weekday_dropoff= weekdays(Lpep_dropoff_datetime))

We can now analyze the number of trips occurred on each day of a week:

clean_datetime %>% 
  group_by(weekday_pickup) %>%
  summarize(N = n(), avg_dist = mean(Trip_distance), 
            avg_passengers = mean(Passenger_count), 
            avg_price = mean(Total_amount))

It looks like on Friday and Saturday had the most trips.



Try the nyctaxi package in your browser

Any scripts or data that you put into this service are public.

nyctaxi documentation built on Nov. 17, 2017, 3:59 a.m.