library(knitr)
opts_chunk$set(fig.width = 6, fig.height = 4)

NYC 311 Open Data

3-1-1 is a telephone number used in the United States for citizens to get access to non-emergency municipal services. The nyc311 package provides an interface to NYC 311 Phone Call information from NYC OpenData. Because these are medium data, the nyc311 packages leverages the etl package for creating and maintaining an SQL database.

Getting started:

Install packages

The etl package provides the generic framework for the nyc311 package. Since the nyc311 package currently lives on GitHub and not on CRAN, you have to install it using devtools.

install.packages("devtools")
devtools::install_github("beanumber/nyc311")

Load the package

This command loads both etl and dplyr.

library(nyc311)

Note:In order to initialize an etl object above using a src_sqlite database connection, you need to have RSQLite installed in your System Library.

Examples

Create an etl object:

First, you must create an etl object.

The etl function has three key attributes -- object, dir, and db.

calls <- etl("nyc311")
summary(calls)

By default, a SQLite database is created in a temporary directory. Please read the etl doucumentation for more information.

help(etl)

Populate the database:

First, we will populate the database with data from January and February of 2010 and 2011. These data will be downloaded directly from the NYC OpenData portal.

calls %>%
  etl_extract(years = 2010:2011, months = 1:2, num_calls = 100) %>%
  etl_transform(years = 2010:2011, months = 1:2) %>%
  etl_load(years = 2010:2011, months = 1:2)

The SQLite database that has just been created is populated with 100 calls from each of these four months.

calls

Now you can pull the data that you are interested in into your R Environment.

my_calls <- calls %>%
  tbl("calls") %>%
  collect()

And take a look at it.

glimpse(my_calls)

Since all of the Date variables have been converted into character vectors, it would be helpful to transform the data types.

Transform data types:

library(lubridate)

calls_cleaned <- my_calls %>%
  mutate(closed_date = ymd_hms(closed_date)) %>%
  mutate(created_date = ymd_hms(created_date)) %>%
  mutate(resolution_action_updated_date =
           ymd_hms(resolution_action_updated_date))

Now, take a look at it again.

glimpse(calls_cleaned)

How long does it take for NYC agencies to solve an issue?

One obvious question is how many days it takes for NYC agencies to resolve each case. We can compute this by subtracting closed_date from created_date.

calls_new <- calls_cleaned %>%
  mutate(difference = as.numeric(closed_date-created_date))%>%
  arrange(difference)

Here, we get the number of seconds it takes to resolve each case under a colunmn called difference.

Let's take a look at the first 5 rows of calls_new.

calls_new %>%
  select(created_date, closed_date, difference) %>%
  head(5)

As you can see, some of the difference cells have negative entries. Since a case must be created before it is closed, some of the NYC agencies must have made some mistakes in the process of tracking and recording Phone Call information.

Therefore, we probably want to filter out the readings that have negative difference values. Then, we can the average number of days by dividing the average number of seconds by 24*60*60 seconds.

calls_new %>%
  filter(difference >= 0) %>%
  summarize(N = n(), mean_hrs = mean(difference) / (24 * 60 * 60))

For these calls, it took the agencies approximately 13 hours and 24 minutes to solve and close an issue.

How long does it take for different NYC agencies to solve an issue?

Does the average number of hours it takes to solve an issue vary among multiple NYC departments?

agency_diff <- calls_new %>%
  filter(difference >= 0) %>%
  group_by(agency_name) %>%
  summarize(number_hrs = mean(difference) / (60 * 60 * 24)) %>%
  arrange(number_hrs) 
agency_diff

It seems that Department of Housing Preservation and Development is doing a better job on quickly solving problems. It takes it on average less than 13 minutes to solve each issue.

How long does it take for NYC agencies to solve differnt types of issue?

Are there differences in the average number of hours it takes to solve each type of complaint?

complaint_diff <- calls_new %>%
  filter(difference >= 0) %>%
  group_by(complaint_type) %>%
  summarize(number_hrs = mean(difference)/(60*60*24)) %>%
  arrange(number_hrs) 
complaint_diff %>%
  head(10)

As you can see from the list above, HEATING takes the least time to resolve!

Where did General Construction issues occur in NYC?

Most of the calls have a latitude and longitude coordinate record, so we can contextualize these observations using a map. In this example, we illustrate where general construction issues occurred.

complaint_type <- calls_new %>%
  group_by(complaint_type) %>%
  summarize(N = n()) %>%
  arrange(desc(N)) 

general_construction_location <- my_calls %>%
  filter(complaint_type == "GENERAL CONSTRUCTION") %>%
  select(latitude, longitude) %>%
  mutate(latitude = as.numeric(latitude), longitude = as.numeric(longitude))
class(general_construction_location$latitude)
if (require(leaflet)) { 
  leaflet(data = general_construction_location) %>% 
    addTiles() %>%
    addCircles()
}

The blue dots shown in the map above represent NYC construction sites that had been complained about in the months specified.

More Examples: 2014 New York City New Year Celebration

The 2014 New York City New Year Celebration took place in Times Square, and let's take a look at what the issue that annoys people living in New York City the most is on New Year's Day.

Here, I am using etl_update which is equivalent to a combination of etl_extract, etl_transform, and etl_load. Instead of having different arguments for each one of the three functions, you can enter all arguments needed in etl_update.

calls2 <- etl("nyc311")
calls2 %>%
  etl_update(years = 2014, months = 1, num_calls = 100 )

calls2 <- calls2 %>%
  tbl("calls") %>%
  collect()
new_year <- calls2 %>%
  select(latitude, longitude, complaint_type)

new_year %>%
  group_by(complaint_type) %>%
  summarise(N = n()) %>%
  arrange(desc(N)) %>%
  head(3)

if (require(leaflet)) { 
  leaflet(data = new_year) %>% 
    addTiles() %>%
    addCircles()
}

As shown in the table and the map above, most of the first 100 calls reported HEATING problems happened NEW YORK country.

An Interesting Article about NYC311

Steven Johnson wrote an interesting article about NYC311 dataset, and this article can be found by clicking the link below.

Steven Johnson, "What A Hundred Million Calls to 311 Reveal About New York” November 10, 2010.

Steven started his article by introducing the "new aroma" event which took place in New York City about 15 years ago. The city officials realized that the NYC311 phone call tracking system data actually provided clues about where the arma was from.

By using this packge to analyze NYC311 data, you will be able to find more interesting "clues" about what is happening in New York City.

Acknowledgement

I want to thank Professor Benjamin Baumer, Assistant Professor in Statistical & Data Sciences at Smith College, for teaching me all the techniques that I need for the completion of this package.

I also want to thank Shuqi Mao, Connie Zhang, and Crystal Zang for helping with the testing of this package. They gave me very useful comments on how to make this vignette more straightforward.



beanumber/nyc311 documentation built on May 12, 2019, 9:43 a.m.