# This chunk ensures that the reedtemplates package is installed and loaded
# This reedtemplates package includes the template files for the thesis and also
# two functions used for labeling and referencing
if(!require(devtools))
  install.packages("devtools", repos = "http://cran.rstudio.com")

if(!require(reedtemplates)){
  library(devtools)
  devtools::install_github("ismayc/reedtemplates")
  }
library(reedtemplates)

Introduction {.unnumbered}

Knock Software's Ride Report app combines a simple thumbs-up/thumbs-down rating system with GPS traces to compile a crowdsourced data set of commuter bicycle rides. Knock's goal is to use this data to help cities identify the most problematic routes in their infrastructure and help cyclists identify the best routes in their area.

From the user's perspective, the app that collects the data is simple: Ride Report automatically detects when a user starts riding their bike, records the GPS trace of the route, and then prompts the user at the end of the ride to give either a thumbs-up or thumbs-down rating. From this, they were able to create a simple "stress map" of Portland, OR, which displays the average ride rating of rides going through each discrete ride segment.

label(path = "figure/stress_map.jpg", 
      caption = "Ride Report's Stress Map for Portland, OR.",
alt.cap = "Ride Report's Stress Map for Portland, OR. Greener road segments incidate less
stressful streets while more red segments indicate more stressful streets. 
``Stress'' is computed by taking the average rating for each segment.", 
      label = "stress-map", type = "figure", scale=0.75)

The app is designed to minimize barriers to response in order to maximize sample size, at the expense of ensuring quality and consistent responses. It automates all of the data collection except the rating, and the rating only requires (and allows) a binary response. There is no direct prompting from the app indicating what criteria cyclists should be using to evaluate their rides, other than the labels for the binary ratings. In addition, riders' rating their rides on \textit{Ride Report} are volunteers, so they are under no obligation to rate their rides. In fact, most of the rides lack ratings, and we have no guarantee that the pattern of missing ratings is ignorable.

The end goal of collecting and studying this data is to be able to accurately map which roads are not serving bicycle commuters well. This paper makes steps toward this goal by building models that predict ride rating based on information other than route. (We did not create models that involve routes, but we do propose ways of modeling routes in r ref("routes",type = "header").) The models we discuss, besides measuring the effect of weather, time of day, and ride length on the probability of a negative rating, address two nontrivial issues in this data: the variability in how different riders rate their rides and the problematic missing ratings.

We are not the first to worry about the issue of variability in riders interpretations of what a good or bad route is. As Meyers, a previous researcher examining the \textit{Ride Report} data observed, "everyone has different standards for what a 'good' or 'bad' ride is, and the data might benefit from randomized IDs attached to each cell device."^[@meyers2015] Thankfully, Ride Report does keep track of which rides belong to each to rider. We model the varying overall tendencies of each rider to rate a ride negatively with random intercepts in a multilevel model. For example, if we let $y_i$ be the rating of the $i$th ride and $X_i$ be the ride-level variables, then we can fit a regression:

$$\mathbb{P}(y_i = 1) = \text{logit}^{-1} \left( \alpha_{j[i]} + X_i \beta \right) ,$$ where $\alpha_j$ is an intercept specific to rider $j$. In addition, the rider intercepts come from a common distribution, $$\alpha_j \sim N (\mu_\alpha, \sigma^2_\alpha),$$ where $\mu_\alpha$ is the mean of all the $\alpha_j$s. Similar models have been used in situations when data consist of subjective ratings, including one study examining how people rate sexual attraction^[@mackaronis2013].

Missing ratings are another important problem in this data set. While we have the route they chose and all associated covariates, the response variable (rating) is missing for many rides. As we will discuss in r ref("missing-data", type="header"), the pattern of non-response is likely to be correlated with the rating the rider would have given, which may mean our parameter estimates are inaccurate. To address this problem, we implement a version of the expectation maximization algorithm for missing data, creating a model that simultaneously estimates the missing data mechanism and the ride rating model.



wjones127/thesis documentation built on May 4, 2019, 7:34 a.m.