# This chunk ensures that the reedtemplates package is installed and loaded # This reedtemplates package includes the template files for the thesis and also # two functions used for labeling and referencing if(!require(devtools)) install.packages("devtools", repos = "http://cran.rstudio.com") if(!require(reedtemplates)){ library(devtools) devtools::install_github("ismayc/reedtemplates") } library(reedtemplates)
Knock Software's Ride Report app combines a simple thumbs-up/thumbs-down rating system with GPS traces to compile a crowdsourced data set of commuter bicycle rides. Knock's goal is to use this data to help cities identify the most problematic routes in their infrastructure and help cyclists identify the best routes in their area.
From the user's perspective, the app that collects the data is simple: Ride Report automatically detects when a user starts riding their bike, records the GPS trace of the route, and then prompts the user at the end of the ride to give either a thumbs-up or thumbs-down rating. From this, they were able to create a simple "stress map" of Portland, OR, which displays the average ride rating of rides going through each discrete ride segment.
label(path = "figure/stress_map.jpg", caption = "Ride Report's Stress Map for Portland, OR.", alt.cap = "Ride Report's Stress Map for Portland, OR. Greener road segments incidate less stressful streets while more red segments indicate more stressful streets. ``Stress'' is computed by taking the average rating for each segment.", label = "stress-map", type = "figure", scale=0.75)
The app is designed to minimize barriers to response in order to maximize sample size, at the expense of ensuring quality and consistent responses. It automates all of the data collection except the rating, and the rating only requires (and allows) a binary response. There is no direct prompting from the app indicating what criteria cyclists should be using to evaluate their rides, other than the labels for the binary ratings. In addition, riders' rating their rides on \textit{Ride Report} are volunteers, so they are under no obligation to rate their rides. In fact, most of the rides lack ratings, and we have no guarantee that the pattern of missing ratings is ignorable.
The end goal of collecting and studying this data is to be able to accurately
map which roads are not serving bicycle commuters well. This paper makes steps
toward this goal by building models that predict ride rating based on information
other than route. (We did not create models that involve routes, but we do propose
ways of modeling routes in r ref("routes",type = "header")
.) The models we discuss,
besides measuring the effect of weather, time of day, and ride length on the
probability of a negative rating, address two nontrivial issues in this data:
the variability in how different riders rate their rides and the problematic
missing ratings.
We are not the first to worry about the issue of variability in riders interpretations of what a good or bad route is. As Meyers, a previous researcher examining the \textit{Ride Report} data observed, "everyone has different standards for what a 'good' or 'bad' ride is, and the data might benefit from randomized IDs attached to each cell device."^[@meyers2015] Thankfully, Ride Report does keep track of which rides belong to each to rider. We model the varying overall tendencies of each rider to rate a ride negatively with random intercepts in a multilevel model. For example, if we let $y_i$ be the rating of the $i$th ride and $X_i$ be the ride-level variables, then we can fit a regression:
$$\mathbb{P}(y_i = 1) = \text{logit}^{-1} \left( \alpha_{j[i]} + X_i \beta \right) ,$$ where $\alpha_j$ is an intercept specific to rider $j$. In addition, the rider intercepts come from a common distribution, $$\alpha_j \sim N (\mu_\alpha, \sigma^2_\alpha),$$ where $\mu_\alpha$ is the mean of all the $\alpha_j$s. Similar models have been used in situations when data consist of subjective ratings, including one study examining how people rate sexual attraction^[@mackaronis2013].
Missing ratings are another important problem in this data set.
While we have the route they chose and all associated covariates, the
response variable (rating) is missing for many rides. As
we will discuss in r ref("missing-data", type="header")
, the pattern
of non-response is likely to be correlated with the rating the rider would have
given, which may mean our parameter estimates are inaccurate. To address this problem, we
implement a version of the expectation maximization
algorithm for missing data, creating a model that simultaneously estimates the
missing data mechanism and the ride rating model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.