In wjones127/thesis: Bike Routes Thesis (Title Case)

Conclusion {.unnumbered}

\setcounter{chapter}{7} \setcounter{section}{0}

By focusing on minimizing barriers to responding and automating as much of the data collection as possible, the designers of the Ride Report app created an infrastructure that could collect large numbers of ride ratings. But sample size isn't everything: the subjectivity of the ratings and the pattern of missing ratings make this a treacherous data set to model naively.

The subjective ratings pose a problem particularly when used to infer the quality of particular road segments; if most of the rides are by one particular rider, then the typical rating over that segment will reflect that particular rider's interpretation of the ratings more than others. Our models from r ref("model-chapter", type = "header") confirm that modeling ride rating with rider intercepts is essential. Adding rider intercepts to a multivariate regression model increased the cross-validated AUC from 0.552---little better than the null model---to 0.797. These intercepts turned out to encode much more than a rider's baseline tendency to rate a ride negatively; r ref("time-pred-plot") showed how much information rider intercepts had about riders' typical time of day, and it's likely that riders' typical routes are also encoded in these intercepts as well. Future research should pay special attention to how these intercepts change when routes are incorporated into these models.

Our missing data models showed some questionable results, though it's hard to know if those issues stem from the data quality of the unrated rides or a flaw in the model. If many of the unrated rides are actually not bicycle rides--- which we suspect is the case---then these missing data models will not be appropriate until the misclassification of non-bike rides as bike rides is no longer a problem.

In some ways, this is an incomplete work. To leverage the insights from this paper in creating a map of good and bad routes, models that use ride route information need to be developed and implemented. Both the theoretical development and the technical implementation are difficult problems in and of themselves. There are many ways one could model the relationship between ride rating and route, and it's difficult to find any good theoretical justification for one particular model. And even if a good theoretical model can be formulated, such models will likely not be simple to implement, both because the models will probably not be supported by common model fitting packages and because matching the GPS traces to the road network model is a difficult inference problem in itself.

\appendix

A Code Sample of the EM Algorithm {#em-implementation}

Despite its attractive features, there are few explicit explanation of how to actually program the EM algorithm for missing response using weights. The theoretical is contained in r ref("em-algorithm", type="header"), but for the benefit of the reader, we lay out the practical implementation here.

For this example, we present the same code used for the simulation in r ref("em-algorithm", type="header"). The data can be simulated with,

inv_logit <- function(x) 1 / (1 + exp(-x))
n <- 1e3
x <- rnorm(n, 0, 1)
y <- rbinom(n, 1, prob = inv_logit(4 * x))
r <- rbinom(n, 1, prob = inv_logit(0.3 + 0.4 * y))

simulated_data <- data.frame(x, y, r)
simulated_data$y <- ifelse(simulated_data$r == 1, NA, simulated_data$y)

For convenience, we define the models once as functions, so we can use them more than once.

fit_r <- function(data, weights = NULL) {
  gam(r ~ x + y_pred, data = data, family = binomial, weights = weights)
}
fit_y <- function(data, weights = NULL) {
  gam(y_pred ~ x, data = data, family = binomial, weights = weights)
}

First, we separate out the portions of the data that are complete and that are missing the response. To make our code clear and simple, we make use of the dplyr package.

data_complete <- simulated_data %>% filter(!is.na(y)) %>% mutate(weight = 1)
data_missing <- simulated_data %>% filter(is.na(y)) %>% mutate(weight = NA)

We then start the algorithm with our initial guesses at the model for y and r:

simulated_data$y_pred <- simulated_data$y
model_y <- fit_model_y(data_complete)
simulated_data$y_pred <- (predict(model_y,
                                  newdata = simulated_data,
                                  type = "response") > 0.5) %>% as.numeric()
model_r <- fit_model_r(simulated_data)

Finally, we perform the main loop of EM algorithm iterations. We have two stopping conditions here: when the algorithm reaches the maximum number of iterations or when the difference between the current model's AIC and the previous model's AIC is less than the tolerance.

last_aic <- AIC(model_y)

for (i in 1:1000) {
  # get prob of 1
  y_pred <- predict(model_y, newdata = data_missing, type="response")

  # get prob missing given y
  pred_r_y1 <- predict(model_r, 
                       newdata = mutate(data_missing, y_pred = 1), 
                       type="response")
  pred_r_y0 <- predict(model_r, 
                       newdata = mutate(data_missing, y_pred = 0),
                       type="response")

  # Make weights
  denom <- (y_pred * pred_r_y1) + ((1-y_pred) * pred_r_y0)
  w_y1 <- y_pred * pred_r_y1 / denom
  w_y2 <- (1-y_pred) * pred_r_y0 / denom

 # print(pred)
  data_augmented <- bind_rows(data_complete,
                        mutate(data_missing,
                               weight = w_y1, 
                               y_pred = 1),
                        mutate(data_missing,
                               weight = w_y2,
                               y_pred = 0))
  model_y <- fit_y(data_augmented,
                         data_augmented$weight)
  model_r <- fit_r(data_augmented,
                         data_augmented$weight)

  # Check Stopping Condition
  current_aic <- AIC(model_y)
  print(AIC(model_y))
  if ((i > 1) && (last_aic - current_aic < 0.0001)) break
  last_aic <- current_aic
}