# This chunk ensures that the reedtemplates package is installed and loaded
# This reedtemplates package includes the template files for the thesis and also
# two functions used for labeling and referencing
if(!require(devtools))
  install.packages("devtools", repos = "http://cran.rstudio.com")

if(!require(reedtemplates)){
  library(devtools)
  devtools::install_github("ismayc/reedtemplates")
  }
library(reedtemplates)

Characterizing Riders

Given the results from last chapter, there is a clear need to understand what kinds of riders are in this data set. To do that, we identify predictors that differentiate riders and then use these variables to identify clusters of riders. We assess how these predictors do as rider-level predictors for the rider intercepts.

We also compare random intercepts with riders to random intercept models done by cluster. Ride Report, because they need to respect the privacy of their users, cannot identify individual riders in the data they provide to clients, yet our model results show that differentiating riders is crucial to getting good estimates in our models. If we can identify clusters of riders in the data set that give us nearly the same information as grouping by individuals did, these could be provided by Ride Report without the same level of risks to user privacy as identifying individual riders.

Extracting Features and Determining Clusters

We characterized riders based on their rides because Ride Report does not collect data about their users besides their rides and email address. We limited our exploration to riders that had over 20 rated rides, in order to focus on riders who had been using the app for some time and had an identifiable pattern of rides.

Cyclists' patterns in their rides are complex, particularly their time patterns. Computing their mean ride length for weekends was useful, but mean time of day for their rides does not capture anything meaningful. So we took care in selecting features that distinguished different rider patterns we saw when exploring the data.

\begin{figure}[htb] \centering \includegraphics{figure/time-designations.pdf} \caption[Time intervals used for morning, lunchtime, and evening in rider feature extraction]{Time intervals used for morning, lunchtime, and evening in rider feature extraction. These intervals define the time designations we used in clustering. The proportion of each riders rides in each of these time intervals made up three of our features.\label{fig:time-splits}} \end{figure}

We define the following for each rider $j$: frequency of rides^[We define frequency of a cyclist's rides as the number of rides divided by the difference between the time of the most recent ride and time of the first ride. (Units are arbitrary, because we standardized all of our rider-level variables.)] ($u^\text{freq}_j$), proportion of rides on weekdays ($u^\text{weekend}_j$), median length of rides on weekdays ($u^\text{med.len}_j$) and weekends ($u^\text{med.len.w}_j$), variance of ride length on weekdays ($u^\text{var.len}_j$) and weekends ($u^\text{var.len.w}_j$), and proportion of weekday rides during morning rush ($u^\text{morning}_j$), lunch rush ($u^\text{lunch}_j$), and evening rush ($u^\text{evening}_j$). The time intervals that describe the morning, lunch, and evening rush are shown in r ref("time-splits").

label(path = "figure/choice_k.pdf", 
      caption = "Comparing total within sum of squares for different $k$ for $k$-means clustering of riders",
alt.cap = "Comparing total within sum of squares for different $k$ for $k$-means clustering of riders.", 
      label = "choosing_k", type = "figure", scale=0.8,
options = "tbh")

Selecting variables for a cluster analysis is difficult, for the reason that many choices about what to use are arbitrary. We have chosen these variables, but two important other choices remain: what scales and transformations should these variables have? We chose here to transform all variables to be approximately gaussian---eliminating the right skew that was present in most of these features with log and square root transformations---and standardizing them by subtracting their mean and dividing by their standard deviation. When clustering, the scaling of variables determines how much weight each of them has in determining the clusters. Future research may find more appropriate ways to select features for clustering, but in our approach here we stick to a naive and simple approach to see what we can learn.

\begin{figure}[htb] \centering \includegraphics[width=.5\textwidth]{figure/cluster_scatter1.pdf}\hfill \includegraphics[width=.5\textwidth]{figure/cluster_scatter3.pdf} \caption[Rider clusters identified by $k$-means clustering]{ Rider clusters identified by $k$-means clustering. The triangles represent the centroids (computed as the mean) of the cluster members. \label{fig:cluster-scatter}} \end{figure}

With the rider-level predictors in hand, we clustered the riders using $k$-means clustering, which groups a set of points into the $k$ clusters that minimize the within-cluster sum of squared distance to the cluster centroids.^[$k$-means clustering is fit using a hueristic algorithm, which assigns points random to clusters, and then repeatedly recomputes the cluster centroids and reassigns the points to the cluster with the nearest centroid, until the clusters stop changing. This algorithm is fast, but the result is sensitive to the initial random assignment, so it is run many times. See page 460 of @esl for more details of $k$-means clustering.] To choose the number of clusters, we assessed the total of the sum of squares within each cluster for different values of $k$, shown in r ref("choosing_k"), and selected $k = 4$ as the point where we thought after which there was little value in having more clusters.

label(path = "figure/cluster_patterns.pdf", 
caption = "Patterns of ride length and ride time of day for each cluster.", 
      label = "cluster-patterns", type = "figure", scale=1,
options = "bht")

Looking at r ref("cluster-scatter"), the clusters split the data into the four quadrants of the first two principal components of the rider data. These clusters appear to be less distinct groups than a partition of the space. Regardless, they still provide some useful information about riders. Cluster 2 seemed to pick out more casual riders, with fewer rides per week and more weekend rides than the other clusters, as shown in the visible in the right panel of r ref("cluster-scatter"). This is seen more clearly in r ref("cluster-patterns"), where there aren't strong weekday commuting patterns for cluster 2, but there are for clusters 1 and 3. Clusters 1 and 3 seem to be the groups that are the most consistent commuters, but are differentiated by the typical length of their weekend rides. Clusters 2 and 4 show much more variance in the timing of their weekday rides, with cluster 4 having more consistently long weekend rides.

test_data <- data.frame(x1 = rnorm(1000, 0, 4), x2 = rnorm(1000, 0, 4))
test_k <- function(k) kmeans(test_data, k, nstart = 100)$tot.withinss
ss <- sapply(1:8, test_k)

choice_k <- qplot(x = 1:8, y = ss) + geom_line() +
  labs(title = "Total Within SS by Choice of k",
       x = "k",
       y = "Total Within SS") +
  theme_bw(base_family="CMU Serif")

choice_k

test_data2 <- data.frame(x1 = c(rnorm(800, 1, 1), rnorm(200, -1, 1)),
                         x2 = c(rnorm(800, 2.7, 1), rnorm(200, -2, 0.8)))
qplot(data = test_data2, x = x1, y = x2)
test_k <- function(k) kmeans(test_data2, k, nstart = 100)$tot.withinss
ss <- sapply(1:8, test_k)
choice_k <- qplot(x = 1:8, y = ss) + geom_line() +
  labs(title = "Total Within SS by Choice of k",
       x = "k",
       y = "Total Within SS") +
  theme_bw(base_family="CMU Serif")

choice_k

Models with Rider-Level Predictors {#stan-model}

Having several rider-level predictors, we set out to see how well they predict rider intercepts. Now let $U_j$ be the vector of rider-level variables. Then our model will be

\begin{equation} Y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} (\alpha_{j[i]} + X_i \beta) \right), \end{equation} where, \begin{equation} \alpha_j \sim N(\gamma_0 + U_j \gamma, \sigma_\alpha), \end{equation}

with $(\gamma_0, \gamma)$ being the group-level parameters we estimate.

This model should be comparable to Model 2, though because we are trying to get estimates of the rider-level predictor parameters, we make use of \textit{Stan} instead of lme4. Though we would prefer to use a model similar to Model 4 from the previous chapter (the one with smoothing splines for time of day) the current additive mixed models package gamm4 (which uses lme4 to fit the mixed models part) does not support estimating the variability in group-level estimates. Unfortunately, in \textit{Stan}, smoothing splines would have to be coded by hand and we lacked the expertise to write the functions to fit smoothing splines ourselves.

\begin{table}[htb] \caption{Estimates of rider level predictors. \label{tab:rider-level-estimates}} \centering \begin{tabular}{lrrrr} \toprule \textbf{Parameter} & \textbf{Estimate} & \textbf{2.5\% percentile} & \textbf{97.5\% percentile}\ \midrule $\gamma^\text{freq}$ & 0.08 & -0.19 & 0.35\ $\gamma^\text{weekend}$ & -0.13 & -0.50 & 0.35\ $\gamma^\text{morning}$ & 0.06 & -0.22 & 0.34\ $\gamma^\text{afternoon}$ & 0.13 & -0.20 & 0.44\ $\gamma^\text{evening}$ & -0.02 & -0.31 & 0.27\ $\gamma^\text{med.len}$ & 0.01 & -0.25 & 0.29\ $\gamma^\text{med.len.w}$ & 0.08 & -0.19 & 0.36\ $\gamma^\text{var.len}$ & 0.07 & -0.15 & 0.31\ $\gamma^\text{var.len.w}$ & -0.15 & -0.47 & 0.17\ $\gamma_0$ & -2.99 & -3.29 & -2.69\ $\sigma_\alpha$ & 1.47 & 1.27 & 1.69\ \bottomrule \end{tabular} \end{table}

The ride-level predictor coefficients from the fitted model, however, are unimpressive. The variance in the rider intercepts not captured by the predictors, quantified with $\sigma_\alpha$, is high, and not one of the rider-level predictors have a 95\% confidence interval that does not contain zero. (In fact, many are centered near zero.) These features may differentiate riders, but they don't give much information about how they rate their rides.

Cluster Intercepts Versus Rider Intercepts

Do these clusters provide similarly useful information that we got from introducing rider intercepts? Because a model with only 4 random intercepts---as in the case of cluster intercepts---rather than several hundred---as in the case with rider intercepts---is much less flexible, we expect that the cluster intercept model will preform much worse. One still might suspect there is still a significant benefit over a fixed intercept model.

There isn't. We computed Model 7, which is identical to Model 4 from the previous chapter, but has random intercepts by cluster rather than rider. Model 7 performed slightly better than Model 6---which only had a fixed intercept---but nowhere near as well as Model 4. The separation plots, $\log (\mathcal{L})$, AIC, and AUC measures, shown in r ref("cluster-model-fits", type = "table"), all demonstrate this clearly. Given that the rider-level predictors did not seem to be predictive of the rider intercepts, this is not a surprise.

\begin{table}[htb] \centering \caption{Model fit summaries for fixed intercept, rider random intercept, and cluster random intercepts.\label{tab:cluster-model-fits}} \begin{tabular}{lm{3in}rrr} \toprule \textbf{Model} & \textbf{Separation Plot} & \textbf{$\log (\mathcal{L})$} & \textbf{AIC} & \textbf{AUC}\footnotemark\ \midrule Model 4 & \includegraphics[width=3.1in]{figure/rider-model1-sep.pdf} & -4,266 & 8,549 & 0.805\ Model 6 & \includegraphics[width=3.1in]{figure/rider-model0-sep.pdf} & -5,089 & 10,205 & 0.597\ Model 7 & \includegraphics[width=3.1in]{figure/rider-model2-sep.pdf} & -4,973 & 9,965 & 0.646\ \bottomrule \end{tabular} \end{table} \footnotetext{Area under ROC curve for training data.}



wjones127/thesis documentation built on May 4, 2019, 7:34 a.m.