In wjones127/thesis: Bike Routes Thesis (Title Case)

# This chunk ensures that the reedtemplates package is installed and loaded
# This reedtemplates package includes the template files for the thesis and also
# two functions used for labeling and referencing
if(!require(devtools))
  install.packages("devtools", repos = "http://cran.rstudio.com")

if(!require(reedtemplates)){
  library(devtools)
  devtools::install_github("ismayc/reedtemplates")
  }
library(reedtemplates)
library(dplyr)
library(ggplot2)
library(lme4)

Modeling Rides and Riders {#model-chapter}

Complex statistical models can accurately model intricate processes. But they also run the risk of overfitting to the data. To avoid this, we build up our models from simple to complex, comparing the models with cross validation to make sure the complexities introduced add real value.

In this chapter we focus on building models that incorporate information about rider, weather conditions, time of day, and ride length. In brief, our models start with a logistic regression model considering only ride-level variables, and formulate more complex models by adding various terms. r ref("models", type = "table") describes each model briefly along with the models label.

Model Description

Model 1 (Baseline) logistic regression

Model 2 Add rider intercepts

Model 3 Add trigonometric terms for time of day

Model 4 Additive model with cubic cyclic spline for time of day

Model 5 Additive model with spline for ride length

Model 6 Remove random rider intercepts from Model 4

Table: Brief descriptions of Models 1--6 \label{tab:models}

Six Models for Probability of a Negative Ride Rating {#ride-models}

Model 1, which we will use as the baseline for comparing further models, is a logistic regression model:

$$ \mathbb{P} (Y_i=1) = \text{logit}^{-1} (\alpha + X_i \beta),$$ where $\alpha \in \mathbb{R}$ and $\beta \in \mathbb{R}^p$ are parameters to be estimated. ($X$ is the matrix of ride-level predictors specified at the end of r ref("data-notation", type = "header").)

label(path = "figure/rider_avgs.pdf", 
      caption = "Distribution of Average Rider Rating",
alt.cap = "The overall rates at which each rider gives a negative rating for a
ride varies greatly. This is our primary motivation for including rider intercepts
and predictors.", 
      label = "rider-avgs", type = "figure", scale=1,
options = "tbh")

Riders appear to have different tendencies to rate rides negatively more often, as we note in r ref("rider-avgs"). In fact, many riders give zero or nearly zero negative ratings. For Model 2, we account for this variability by adding intercepts that vary by rider:

\begin{equation} y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} \left( \alpha + \alpha_{j[i]} + X_i \beta \right) \right), \end{equation}

for $i = 1, \ldots, n$.

Rider intercepts themselves aren't as interesting as how they deviate from the mean, so we keep a fixed intercept $\alpha$ and constrain the rider intercepts, $\alpha_j$, by specifying

$$\alpha_j \sim N(0, \sigma^2_\alpha).$$

Starting with Model 3, we address time of day, $t \in [0, 24)$ as a predictor. (We measure time of day in hours since midnight.) We use time of day to account for all the daily trends that may affect ratings, including as a simple way to model the overall traffic level, which is difficult to model on its own. These patterns are cyclic and very non-linear, so we can't model time as a linear term. One approach is to add sinusoidal terms with a period of one day. We would be interested in fitting a term,

$$\beta \sin (T x^{\text{time}} + \phi).$$

Estimating $\beta$ wouldn't be hard: we can easily estimate coefficents of transformed terms; it's more difficult to estimate $T$ and $\phi$. But, we know that we want to restrict our terms to fitting trends that happen over the course of one day, so we can set $T = 2 \pi / d$, where $d$ is 24 hours or some fraction of that.

As for $\phi$, a trigonometric transformation reframes the estimation of a phase shift parameter into the estimation of two coefficients for trigonometric functions with no phase shift:

\begin{align} \beta \sin (T x + \phi) &= \beta \left( \sin (T x) \cos (\phi) + \cos (T x) \sin (\phi) \right)\ &= \beta \cos (\phi)) \sin (T x) + \sin (\phi) \cos (T x)\ &= \beta_1 \sin (T x) + \beta_2 \cos (T x), \end{align} where $\beta_1 = \beta \cos (\phi)$ and $\beta_2 = \sin (\phi).$ At this point, we are now just estimating the coefficients of a couple of transformed variables, which can easily be done in any package that does generalized linear regressions.

We also want to take into account that weekday hourly patterns may be different than weekend patterns. We use a variable $X^\text{weekend}$ that serves as a weekend indicator. For Model 3, we add two sets of sinusoidal terms: one set for weekdays and one for weekends. More explicitly, we define the model,

\begin{equation} \begin{split} \mathbb{P} (Y_i=1) = \text{logit}^{-1} (&\alpha + \alpha_{j[i]} + X_i \beta \ &+ X^\text{weekend} \cdot [\beta^{t1} \sin(T \cdot t) + \beta^{t2} \cos (T \cdot t)\ &+ \beta^{t3} \sin(T/2 \cdot t) + \beta^{t4} \cos (T/2 \cdot t)]\ &+ (1 - X^\text{weekend}) \cdot [\beta^{t1} \sin(T \cdot t) + \beta^{t2} \cos (T \cdot t)\ &+ \beta^{t3} \sin(T/2 \cdot t) + \beta^{t4} \cos (T/2 \cdot t))]. \end{split} \end{equation}

For Model 4 we abandon parametric methods and use a cyclic non-parametric smoother to model time of day, making our model,

\begin{equation} y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} \left( \alpha + \alpha_{j[i]} + X_i \beta + X^\text{weekend} \cdot f^\text{time.w} (t_i) + (1 - X^\text{weekend}) \cdot f^\text{time} (t_i) \right) \right), \end{equation} for $i = 1, \ldots, n$, where $\alpha_j$ is specified like Model 2 and $f^\text{time.w}$ and $f^\text{time}$ are cyclic cubic spline terms for weekend and weekday rides, respectively, with knots at 0, 3, 6, 9, 12, 15, 18, 21, and 24 (0, again) hours.

Model 5 extends Model 4 by adding a cubic spline for ride_length:

\begin{equation} y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} \left( \begin{split} &\alpha + \alpha_{j[i]} + X_i \beta + f^\text{length} (x^\text{log.length}_i) +\ &X^\text{weekend} \cdot f^\text{time.w} (t_i) + (1 - X^\text{weekend}) \cdot f^\text{time} (t_i) \end{split} \right) \right), \end{equation} for $i = 1, \ldots, n$, where $f^\text{length}$ is a cubic spline smoother.

Finally, Model 6 is identical to Model 5, but without the rider intercepts:

\begin{equation} y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} \left( \alpha + X_i \beta + f^\text{time} (t_i) \right) \right), \end{equation} for $i = 1, \ldots, n$, where $f^\text{time}$ is a cyclic cubic spline term, with the same knots as in Model 4.

Model Evaluation

\begin{table}[htb] \centering \caption{Fit summaries for Models 1--6.\label{tab:modelfits}} \begin{tabular}{lm{4in}rrr} \toprule \textbf{Model} & \textbf{Separation Plot} & \textbf{$\log (\mathcal{L})$} & \textbf{AIC} & \textbf{AUC}$_{\text{CV}}$\footnotemark\ \midrule Model 1 & \includegraphics{figure/model1-sep.pdf} & -4,786 & 9,586 & 0.552\ Model 2 & \includegraphics{figure/model2-sep.pdf} & -3,971 & 7,957 & 0.797\ Model 3 & \includegraphics{figure/model3-sep.pdf} & -3,923 & 7,877 & 0.802\ Model 4 & \includegraphics{figure/model4-sep.pdf} & -3,930 & 7,870 & 0.802\ Model 5 & \includegraphics{figure/model5-sep.pdf} & -3,928 & 7,878 & 0.803\ Model 6 & \includegraphics{figure/model6-sep.pdf} & -4,713 & 9,455 & 0.601\ \bottomrule \end{tabular} \end{table} \footnotetext{Area under ROC curve estimated with 10-fold cross-validation.}

To fit the data, we got all of the rides in Portland, OR, from December 3, 2014 to February 8, 2016 for riders that had over 20 rated rides. (We only look at riders with a certain number of rides to make sure we get can get estimates for rider-level parameters, like rider random intercepts, that aren't too uncertain.) There were 25,397 rides, 14,032 of which were rated. Overall, 10.88 percent of these rides were given a negative rating. There were 138 riders in the data set.

The separation plots in r ref("modelfits", type = "table") give a clear initial picture of how these model fits compare. Model 1 performs very poorly compared to those that include rider intercepts, assigning the same probability to most observations. Models that include the rider intercept perform similarly to each other. The log likelihoods and AIC,^[Akaike Information Criterion (AIC) is a metric that penalizes the $\log (\mathcal{L})$ with the number of parameters estimated $k$, with lower values being better. It is defined as $\text{AIC} = 2k - 2 \log(\mathcal{L})$.] shown in r ref("modelfits", type = "table"), corroborate this. Adding time dependency doesn't seem to impact predictive ability. We will see later, however, that it gives a fascinating result to interpret.

The gains from the rider intercepts are great, but we are compelled to ask: how much of that gain could have been achieved with randomly chosen groups? In other words, if riders were randomly assigned to rides, would the flexibility in the model created by allowing intercepts to vary increase predictive performance to the same degree? To test this, we ran a Model 4 after we randomly assign the rides a rider, by randomly permuting the rider column. This quick test nullified this skepticism, as you can see in the resulting separation plots in r ref("sep-plots-intercept-test").

label(path = "figure/intercept_test_plot.pdf", 
      caption = "Separation plots for models 2 compared to a similar model where
riders are randomly assigned to rides",
alt.cap = "Separation plots for models 2 compared to a similar model where
riders are randomly assigned to rides. This test demonstrates that the 
improvement in predictive performance provided by the rider intercepts
was not coincidence.", 
      label = "sep-plots-intercept-test", type = "figure", scale=1,
options = "tbh")

Model Results

\begin{table}[htb] \centering \def\arraystretch{1.3} \caption[Regression coefficients for Models 1, 2, 4, and 6.]{Regression coefficients for Model 1, Model 2, Model 4, and Model 6. 95\% confidence intervals are given in parentheses.\label{tab:modelcoef}} \begin{tabular}{lllll} \toprule Regression Term & Model 1 & Model 2 & Model 4 & Model 6\ \midrule Log(length) & -0.122 & -0.100 & -0.092 & -0.114\ & \footnotesize (-0.180, -0.063) & \footnotesize (-0.168, -0.032) & \footnotesize (-0.162, -0.022) & \footnotesize (-0.174, -0.054)\ Mean Temp. & 0.053 & 0.076 & 0.075 & 0.069\ & \footnotesize (-0.0004, 0.110) & \footnotesize (0.005, 0.147) & \footnotesize (0.003, 0.147) & \footnotesize (0.012, 0.127)\ Mean Wind speed & 0.028 & 0.014 & 0.012 & 0.027\ & \footnotesize (0.004, 0.052) & \footnotesize (-0.013, 0.041) & \footnotesize (-0.014, 0.039) & \footnotesize (0.002, 0.051)\ Gust speed & -0.003 & 0.001 & 0.001 & -0.003\ & \footnotesize (-0.015, 0.008) & \footnotesize (-0.012, 0.013) & \footnotesize (-0.012, 0.013) & \footnotesize (-0.014, 0.009)\ Rainfall & 0.008 & 0.012 & 0.008 & 0.005\ & \footnotesize (-0.015, 0.031) & \footnotesize (-0.015, 0.038) & \footnotesize (-0.019, 0.035) & \footnotesize (-0.019, 0.027)\ Rainfall 4-Hour & 0.013 & 0.016 & 0.017 & 0.014\ & \footnotesize (0.005, 0.021) & \footnotesize (0.007, 0.025) & \footnotesize (0.008, 0.027) & \footnotesize (0.006, 0.022)\ Intercept & -2.2868 & -3.075 & -3.127 & -2.313\ & \footnotesize (-2.428, -2.108) & \footnotesize (-3.386, -2.764) & \footnotesize (-3.436, -2.818) & \footnotesize (-2.475, -2.151)\ \bottomrule \end{tabular} \end{table}

r ref("modelcoef", type = "table") presents the fixed effect estimates for our models. Length has a robust strong negative effect. This makes sense if we think of length as the only information we have about route in these models: it seems routes that are longer tend to be less likely to be rated negatively. Perhaps longer rides tend to be for leisure or sport rather than commuting, so and so are less likely to go along routes with high traffic and other dangers. Temperature also seems to have a small effect. It could be the case that the temperature itself is important, or perhaps season is a confounding factor. It could be the case that the type of rides taken during the warmer months are more likely to be rated negatively. Wind and gust speed don't seem to have robust effects. These models suggest that four hour cumulative rainfall was more important than rainfall during the hour of the ride. This supports our suspicion that weather effects that are more relevant to safety than comfort--- like wet road and puddles rather than rain at the time of a ride---are important in determining a cyclist's rating of their ride. These coefficients, however, aren't nearly as enlightening as the time of day fits.

\begin{figure}[htb] \centering \includegraphics{figure/time_fit_plot.pdf} \includegraphics{figure/time_fit_plot_6.pdf} \caption[Predicted probabilities of a negative rating by time for a typical ride]{ Predicted probabilities of a negative rating by time for a typical ride. The rider was chosen so the intercept was closest to the mean intercept for model 5. The median length and average mean temperature were used, and all other predictors were set to zero. The dotted lines show $\pm 2$ standard errors from the predictions.\label{fig:model-time-fit}} \end{figure}

The marginal fits for time of day, shown in r ref("model-time-fit"), are predictable. On weekdays, the probability of a negative rating peaks in the afternoon from 4--6 p.m., around when we expect rush hour traffic, and on weekends it stays steady throughout the day. While Model 4 and Model 5 give similar fits for time of day, Model 3's predictions peak at different times on weekdays and exhibit much more variability on weekends. There are two probable reasons for these differences: first, the sinusoidal terms are less flexible than the splines; second, the splines, because they are non-parametric functions, penalize complexity of the fit while the parametric sinusoidal form does not, making the splines more conservative in their ``curviness.'' The former explains the discrepancies in the weekday fits while the latter explains the discrepancies in the weekend fits. Given these differences, fitting time of day with splines is preferable; there is no motivation to constrain the functional form to any strict parametric form.

But these marginal time-of-day fits don't just tell a story about our time terms; they also reveal part of why the random rider intercepts are such powerful predictors. Notice that in comparing Model 6 to the other models in r ref("model-time-fit"), the scale at which the Model 6 time fitted probabilities vary is much larger than the scale at which the other models' predictions vary. Without allowing for varying rider intercepts, the time terms take on a significant role. Yet, interestingly, the time term has nowhere near the amount of information that the rider intercepts seem to encode, according to the separation plots in r ref("modelfits", type = "table").

label(path = "figure/time_pred_plot.pdf", 
      caption = "Predicted probabilities of negative rating by time of day for Models 3--6",
alt.cap = "Predicted probabilities of negative rating by time of day for Models 3--6.
Notice how starting with Model 2, daily trends start to emerge. This indicates that the
rider intercepts are picking up on time of day trends, which must be reflected in a rider's
typical ride.", 
      label = "time-pred-plot", type = "figure", scale=1,
options = "tbh")

r ref("time-pred-plot") paints a clearer picture of what is going on. These models show two different ways to look at the time of day pattern in ride rating: Model 2 suggests who is riding determines these patterns while Model 6 suggests there is something inherent in that time of day, such as traffic, that determines these patterns. The models between fit a combination of these, but as we saw in r ref("model-time-fit"), the time dependence is more than an order of magnitude weaker after accounting for the rider intercepts. The two black pillars of rides in the predictions of Models 2--5 line up with when we expect commuters to be riding, suggesting that that the riders with high rider intercepts are commuters. The converse, however, is not true: a great number of rides during commuting times are predicted to have almost zero probability of receiving a negative rating.

What explains this relationship between the temporal patterns and the rider intercepts? We suspect ride route is a confouding factor here. r ref("time-pred-plot") confirms our suspicion that riders have particular schedules they stick to; so it's also likely, given that these are mostly commuting cyclists, that most of their rides follow the same route as well. These models ignore ride route, so we suspect the typical ride route is encoded in the rider intercepts; i.e. a rider whose commuting route goes through many of the most stressful intersections and streets in Portland will likely have a higher intercept than most riders. This hypothesis can only be tested, however, when future research develops models with random rider intercepts and a model for how routes effect ratings.

logistic <- function(x) { 1 / (1 + exp(-x)) }
n_obs <- 1e4
n_groups <- 5
groups_intercepts <- rnorm(n_groups, 0, 1)
group_mu_x <- rnorm(n_groups, 0, 1.5)
corr_simulation <- data.frame(group = sample(1:n_groups, n_obs, replace=TRUE)) %>%
  mutate(x = rnorm(n_obs, group_mu_x[group], 2),
         p = logistic(-0.3 + 1.4 * x + groups_intercepts[group]),
         y = rbinom(n_obs, 1, p))

mean_lengths <- corr_simulation %>%
  group_by(group) %>%
  summarise(mean_x = mean(x))

mean_lengths$mean_x
group_mu_x
# These are accurate enough estimates


qplot(x = corr_simulation$p)

naive_model <- glm(y ~ x, data = corr_simulation, family=binomial)
ml_model <- glmer(y ~ 1 + x + (1|group), data = corr_simulation, family=binomial)
summary(naive_model)
summary(ml_model)
coef(ml_model)$group$`(Intercept)`
groups_intercepts - 0.3


generate_data <- function(n_obs = 1000) {
  corr_simulation <- data.frame(group = sample(1:n_groups, n_obs, replace=TRUE)) %>%
    mutate(x = rnorm(n_obs, group_mu_x[group], 2),
           p = logistic(-0.3 + 1.4 * x + groups_intercepts[group]),
           y = rbinom(n_obs, 1, p))

  corr_simulation
}


fit_naive <- function(df) {
  model <- glmer(y ~ 1 + x + (1|group), data = df, family=binomial)
  return(coef(model)$group[["(Intercept)"]][1])
}
fit_smart <- function(df) {
    # Add mean length data
    mean_lengths <- df %>%
      group_by(group) %>%
      summarise(mean_x = mean(x)) %>%
      arrange(desc(group))
    new_df <- left_join(df, mean_lengths, by="group")

    model <- glmer(y ~ 1 + mean_x + x + (1|group), data = new_df, family=binomial)

    intercept <- coef(model)$group[["(Intercept)"]][1]
    slope <- coef(model)$group[["mean_x"]][1]
    full_intercept <- intercept + mean_lengths$mean_x[1] * slope
    return(full_intercept)
}

n_bootstrap <- 50

model_test <- data.frame(iter = 1:n_bootstrap,
                         intercept_naive = numeric(n_bootstrap),
                         intercept_smart = numeric(n_bootstrap))
for (i in 1:n_bootstrap) {
data <- generate_data()
model_test$intercept_naive[i] <- fit_naive(data)
model_test$intercept_smart[i] <- fit_smart(data)
}
ggplot(model_test) + 
  geom_histogram(fill="red", alpha=0.5, aes(x = intercept_naive)) + 
  geom_histogram(fill="blue", alpha=0.5, aes(x = intercept_smart)) + 
  geom_vline(xintercept= groups_intercepts[1] - 0.3)