# This chunk ensures that the reedtemplates package is installed and loaded
# This reedtemplates package includes the template files for the thesis and also
# two functions used for labeling and referencing
if(!require(devtools))
  install.packages("devtools", repos = "http://cran.rstudio.com")

if(!require(reedtemplates)){
  library(devtools)
  devtools::install_github("ismayc/reedtemplates")
  }
library(reedtemplates)
library(arm)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(gridExtra)
library(ggmap)

Data Sources {#data-sources}

We combine several data sources to do our analysis. Information about individual rides, including the GPS trace, the rider, and start timestamp were provided by Ride Report. Weather data were collected from Weather Underground's archive of the KPDX weather station and a Portland Fire Bureau station.

Our goal in this chapter is to discuss these data and what considerations we should have in mind before exploring it in depth. This includes how and by whom the data were collected, who and what this data are representative of, and what samples were taken of the data.

Some of these considerations, such as the limited demographics represented in the Ride Report data, pose serious limitations to how our inferences can be generalized. Others, such as the large number of missing responses in the Ride Report data, motivate the analysis we are doing in this thesis. Finally, there are other considerations which we will acknowledge here, but addressing them is out of the scope of this thesis. This data set contains an abundance of potential research questions, only a fraction of which could be reasonably addressed in one thesis.

Ride Report

Ride Report's data is the focus of this paper. Knock Software created the app to collect large amounts of information about urban cyclists' routes and experiences on those routes. The hope is that this information will be valuable to city planners.^[Knock's other project is making a cheaper bicycle counter for cities to monitor traffic flow, again intended to be sold to cities wishing to improve bike infrastructure.]

Ride Report's approach to crowd sourcing these data is particularly important to understand. The app automates every piece of the data collection process except for the rating given by the rider. Thus, the app casts aside nuanced and (somewhat) reliable human input in favor of increasing sample size: one could imagine a similar app where users have more control over how the route is recorded, have the ability to rate on a more fine-grained scale, and are given more direction in what they are rating for. This trade off causes two problems with the reliability with the data.

Before we get into the potential issues in the data collection, though, let's examine the data collection process itself. When installed on a person's phone, the Ride Report app attempts to automatically detect when the user starts riding their bicycle, based accelerometer data, when a user leaves a familiar Wi-Fi network, and some other pieces of information. When the app detects the start of a ride, it starts recording a GPS trace. At the end of the users ride, the app detects them getting of their bike (in a similar process to how it detected the start of a ride) and prompts them to give a rating of the ride. The ride data are saved then, even if the user does not provide a rating.

This automatic detection of when a ride starts and stops leads to two related and common errors in the dataset: first, one ride is often split into two or more rides at points, such as at a stoplight or a train crossing, where a cyclist stops for an extended period of time; second, car rides are sometimes misclassified as bicycle rides and vice-versa (car rides are not rated.) The app allows riders to correct the misclassification, but provides no way to join split rides back together.

label(path = "figure/ride_report_rating.jpg", 
      caption = "The Ride Report app's interface",
alt.cap = "The Ride Report app's interface has changed significantly between
versions, including the rating text displayed after a ride. This is the current
version as of Februrary 2015.", 
      label = "ride-rating", type = "figure", scale=.2,
options = "tbh")

The app only recently became publicly available and has undergone significant changes in the course of its life. In particular, while the ratings have always been binary, the labels have changed at various points in time. At one point the rating labels were "Stressful" and "Chill", while now they are labelled "Recommend" and "Avoid" (see r ref("ride-rating", type="figure")). Other fundamentals of the data collection process---such as the binary ratings, the automatic collection of GPS traces of routes---have remained constant.

The data collection method itself has some problems, but there also may be some biases in the population of riders using the app. The app is only available on iOS, so only iPhone owners could use this application which may imply a bias toward riders of higher socioeconomic status. At the time of the start of the thesis, the app was in private beta, meaning only people who actively sought out using the app were able to use it. Now the application is public and on the Apple App Store, making it more widely available. Due to these issues, many of the earlier rides may be people within the developer's personal network. Unfortunately, it's hard to make any solid conclusions about the users of the app because Ride Report doesn't collect any demographic data about their riders.

One other issue with the Ride Report data guided our analysis: privacy. Because the data involves time stamps and GPS locations of people's commutes, the data is very sensitive: one could easily infer someone's home and workplace based on their most common routes. In fact, this data is protected by an end-user license agreement (EULA) which prevents sharing of data, without the explicit permission of those involved. This presented a logistical challenge: how were we to do inference and data exploration without access to the data?

By agreement with Knock Software, identifying data must be kept private. With permission from five riders, Knock was able to give us a small subset consisting of all the rides from those five riders, to be kept confidential. That is the data set we used for prototyping models and some basic exploratory analysis. Knock also agreed to allow us to run models fitting scripts on larger samples of their data set, as long as they were performed on their computers, with no identifying data leaving their system.

While at first this set up seems like an inconvenience, it actually has some advantages. One of the pitfalls of having an entire data set, especially a high dimensional one, is that in performing exploratory analyses it is often too easy to find spurious "statistically significant" results. Instead, we must come up with our models before running them, greatly limiting the choices we can make in the "garden of forking paths."^["The garden of forking paths," a reference to the short story by Jorge Luis Borges, is a term coined by Andrew Gelman to refer to the infinite number of choices researchers have in analyzing a set of data, which often allows for enough flexibility to discover coincidences @gelman2013.]

Weather Data

Slippery roads and formidable winds are no fun for anyone balancing on a two-wheeled vehicle. Weather is, then, one of the most obvious family of predictors for ride rating, at least intuitively. We use the time of a ride to join in data about the weather conditions during the ride, including

We include the first two, temperature and precipitation, to account for rider comfort. A sweltering, frigid, or stormy day could make an unpleasant experience for a bicyclist and thus could lead to more negative ratings.

On the other hand, we include the last two, wet road and gust speed, as factors that impact safety. During and after storms, puddles often accumulate in bike lanes before the center of the road, pushing cyclists into lanes shared by cars, which are often more dangerous.

Gust speeds impact the aerodynamics of a ride, which are particular important for bicyclists. It's one of the main reasons cyclists care about getting into lower (and more aerodynamic) rider positions. Thus, high wind or gust speeds may affect rider rating.

weather_stations <- data.frame(lat = c(45.5887089, 45.5219685),
                               long = c(-122.5968694, -122.670755),
                               label = c("KPDX Weather", "PFB Rain Gauge"))
portland_map <- get_map("portland, oregon",
                        maptype="roadmap",
                        color="bw",
                        zoom=12)
weather_station_map <- ggmap(portland_map) + 
  geom_point(aes(x = long, y = lat), data = weather_stations) + 
  geom_label(aes(x = long - 0.03, y = lat - 0.005, label = label), 
             data = weather_stations,
             nudge_y = -0.007) +
  labs(x = "longitude", y = "latitude")

ggsave("figure/weather_station_map.pdf", weather_station_map, width = 4.5, height = 3.5)

label(path = "figure/weather_station_map.pdf", 
      caption = "Locations of weather data collection sites",
alt.cap = "Locations of weather data collection sites. Daily weather information
was collected at the KPDX weather station at Portland International Airport. 
Hourly precipitation data were collected at the Portland Fire Bureau's rain
gauge in downtown Portland.", 
      label = "weather-stations", type = "figure", scale=1,
options = "tbh")

We are limiting our study to rides in Portland, Oregon. Given this, we can first assume that it may be reasonable to expect that riders are used to the same climate, and thus have somewhat similar responses to weather. This also makes it reasonable to use data from one nearby weather station, rather than attempting to collect from several stations and creating a spatial model for weather.

For daily summaries of weather conditions, we used weather history from the KPDX weather station at Portland International Airport downloaded from Weather Underground.^[@wunderground] From this we were able to get daily weather data, including

We also got hourly rainfall data from a data stream at the Portland Fire Bureau Rain Gage at 55 SW Ash St.,^[@pdxrain] which is just about the geographic center of Portland. This just gives raw uncorrected rain gauge data, but gives us a fine grain look at how much rain there has been recently.

For daily weather data, such as temperature highs and average wind speed, we use information from the KPDX weather station. It is further from the geographic center of the rides we are examining, but because the weather is daily summary statistics, we don't expect closer weather stations to be much more informative. r ref("weather-stations", type="figure") shows the geographic positions of these two stations.

Notation for the Joined Data Set {#data-notation}

We combined the ride records with weather data by joining by start date stamp of the ride. We will denote our set of ride-level predictors, each of which is an $n$ by 1 column vector as,

We will often represent this set of predictors as the ride-level predictor matrix $X = (x^\text{length} \; x^\text{rain} \; x^\text{rain4h} \; x^\text{wind} \; x^\text{gust}\; x^\text{temp}).$ We also have the predictor $t \in [0, 24]^n$, representing the time of day of the ride, measured by hours since midnight. (Because it must be modelled in a different fashion than the other variables, we use the simple notation of a single letter for it.)

Let $y_i = 1$ if the $i$th ride received a negative rating, and $y_i = 0$ if it received a positive rating, for $i = 1, \ldots, n$. Choice of coding which events are 0's and which are 1's is arbitrary when making logistic regression models, though we made this choice because for the sake of our analysis, negatively rated rides are more interesting events. For urban planning applications, they define the areas that need attention.



wjones127/thesis documentation built on May 4, 2019, 7:34 a.m.