# This chunk ensures that the reedtemplates package is installed and loaded # This reedtemplates package includes the template files for the thesis and also # two functions used for labeling and referencing if(!require(devtools)) install.packages("devtools", repos = "http://cran.rstudio.com") if(!require(reedtemplates)){ library(devtools) devtools::install_github("ismayc/reedtemplates") } library(reedtemplates)
Given the results from last chapter, there is a clear need to understand what kinds of riders are in this data set. To do that, we identify predictors that differentiate riders and then use these variables to identify clusters of riders. We assess how these predictors do as rider-level predictors for the rider intercepts.
We also compare random intercepts with riders to random intercept models done by cluster. Ride Report, because they need to respect the privacy of their users, cannot identify individual riders in the data they provide to clients, yet our model results show that differentiating riders is crucial to getting good estimates in our models. If we can identify clusters of riders in the data set that give us nearly the same information as grouping by individuals did, these could be provided by Ride Report without the same level of risks to user privacy as identifying individual riders.
We characterized riders based on their rides because Ride Report does not collect data about their users besides their rides and email address. We limited our exploration to riders that had over 20 rated rides, in order to focus on riders who had been using the app for some time and had an identifiable pattern of rides.
Cyclists' patterns in their rides are complex, particularly their time patterns. Computing their mean ride length for weekends was useful, but mean time of day for their rides does not capture anything meaningful. So we took care in selecting features that distinguished different rider patterns we saw when exploring the data.
\begin{figure}[htb] \centering \includegraphics{figure/time-designations.pdf} \caption[Time intervals used for morning, lunchtime, and evening in rider feature extraction]{Time intervals used for morning, lunchtime, and evening in rider feature extraction. These intervals define the time designations we used in clustering. The proportion of each riders rides in each of these time intervals made up three of our features.\label{fig:time-splits}} \end{figure}
We define the following for each rider $j$:
frequency of rides^[We define frequency of a cyclist's rides as the number of
rides divided by the difference between the time of the most recent ride and
time of the first ride. (Units are arbitrary, because we standardized all of our
rider-level variables.)] ($u^\text{freq}_j$), proportion of rides on weekdays
($u^\text{weekend}_j$), median length of rides on weekdays ($u^\text{med.len}_j$)
and weekends ($u^\text{med.len.w}_j$), variance of ride length on weekdays
($u^\text{var.len}_j$) and weekends ($u^\text{var.len.w}_j$), and proportion of
weekday rides during morning rush ($u^\text{morning}_j$), lunch rush
($u^\text{lunch}_j$), and evening rush ($u^\text{evening}_j$). The time intervals
that describe the morning, lunch, and evening rush are shown in r ref("time-splits")
.
label(path = "figure/choice_k.pdf", caption = "Comparing total within sum of squares for different $k$ for $k$-means clustering of riders", alt.cap = "Comparing total within sum of squares for different $k$ for $k$-means clustering of riders.", label = "choosing_k", type = "figure", scale=0.8, options = "tbh")
Selecting variables for a cluster analysis is difficult, for the reason that many choices about what to use are arbitrary. We have chosen these variables, but two important other choices remain: what scales and transformations should these variables have? We chose here to transform all variables to be approximately gaussian---eliminating the right skew that was present in most of these features with log and square root transformations---and standardizing them by subtracting their mean and dividing by their standard deviation. When clustering, the scaling of variables determines how much weight each of them has in determining the clusters. Future research may find more appropriate ways to select features for clustering, but in our approach here we stick to a naive and simple approach to see what we can learn.
\begin{figure}[htb] \centering \includegraphics[width=.5\textwidth]{figure/cluster_scatter1.pdf}\hfill \includegraphics[width=.5\textwidth]{figure/cluster_scatter3.pdf} \caption[Rider clusters identified by $k$-means clustering]{ Rider clusters identified by $k$-means clustering. The triangles represent the centroids (computed as the mean) of the cluster members. \label{fig:cluster-scatter}} \end{figure}
With the rider-level predictors in hand, we clustered the riders using $k$-means
clustering, which groups a set of points into the $k$ clusters that minimize the
within-cluster sum of squared distance to the cluster centroids.^[$k$-means clustering
is fit using a hueristic algorithm, which assigns points random to clusters, and
then repeatedly recomputes the cluster centroids and reassigns the points to the
cluster with the nearest centroid, until the clusters stop changing. This algorithm
is fast, but the result is sensitive to the initial random assignment, so it is
run many times. See page 460 of @esl for more details of $k$-means clustering.]
To choose the number of clusters,
we assessed the total of the sum of squares within each cluster for different
values of $k$, shown in r ref("choosing_k")
, and selected $k = 4$ as the point
where we thought after which there was little value in having more clusters.
label(path = "figure/cluster_patterns.pdf", caption = "Patterns of ride length and ride time of day for each cluster.", label = "cluster-patterns", type = "figure", scale=1, options = "bht")
Looking at r ref("cluster-scatter")
, the clusters split the data into the
four quadrants of the first two principal components of the rider data. These
clusters appear to be less distinct groups than a partition of the space.
Regardless, they still provide some useful information about riders. Cluster
2 seemed to pick out more casual riders, with fewer rides per week and more
weekend rides than the other clusters, as shown in the visible in the right
panel of r ref("cluster-scatter")
. This is seen more clearly in
r ref("cluster-patterns")
, where there aren't strong weekday commuting patterns
for cluster 2, but there are for clusters 1 and 3.
Clusters 1 and 3 seem to be the groups that are the most consistent
commuters, but are differentiated by the typical length of their weekend rides.
Clusters 2 and 4 show much more variance in the timing of their weekday rides, with
cluster 4 having more consistently long weekend rides.
test_data <- data.frame(x1 = rnorm(1000, 0, 4), x2 = rnorm(1000, 0, 4)) test_k <- function(k) kmeans(test_data, k, nstart = 100)$tot.withinss ss <- sapply(1:8, test_k) choice_k <- qplot(x = 1:8, y = ss) + geom_line() + labs(title = "Total Within SS by Choice of k", x = "k", y = "Total Within SS") + theme_bw(base_family="CMU Serif") choice_k test_data2 <- data.frame(x1 = c(rnorm(800, 1, 1), rnorm(200, -1, 1)), x2 = c(rnorm(800, 2.7, 1), rnorm(200, -2, 0.8))) qplot(data = test_data2, x = x1, y = x2) test_k <- function(k) kmeans(test_data2, k, nstart = 100)$tot.withinss ss <- sapply(1:8, test_k) choice_k <- qplot(x = 1:8, y = ss) + geom_line() + labs(title = "Total Within SS by Choice of k", x = "k", y = "Total Within SS") + theme_bw(base_family="CMU Serif") choice_k
Having several rider-level predictors, we set out to see how well they predict rider intercepts. Now let $U_j$ be the vector of rider-level variables. Then our model will be
\begin{equation} Y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} (\alpha_{j[i]} + X_i \beta) \right), \end{equation} where, \begin{equation} \alpha_j \sim N(\gamma_0 + U_j \gamma, \sigma_\alpha), \end{equation}
with $(\gamma_0, \gamma)$ being the group-level parameters we estimate.
This model should be comparable to Model 2, though because we are trying to
get estimates of the rider-level predictor parameters, we make use of \textit{Stan}
instead of lme4
. Though we would prefer to use
a model similar to Model 4 from the previous chapter (the one with smoothing
splines for time of day) the current additive mixed models package gamm4
(which
uses lme4
to fit the mixed models part) does not support estimating the
variability in group-level estimates. Unfortunately, in \textit{Stan}, smoothing
splines would have to be coded by hand and we lacked the expertise to write
the functions to fit smoothing splines ourselves.
\begin{table}[htb] \caption{Estimates of rider level predictors. \label{tab:rider-level-estimates}} \centering \begin{tabular}{lrrrr} \toprule \textbf{Parameter} & \textbf{Estimate} & \textbf{2.5\% percentile} & \textbf{97.5\% percentile}\ \midrule $\gamma^\text{freq}$ & 0.08 & -0.19 & 0.35\ $\gamma^\text{weekend}$ & -0.13 & -0.50 & 0.35\ $\gamma^\text{morning}$ & 0.06 & -0.22 & 0.34\ $\gamma^\text{afternoon}$ & 0.13 & -0.20 & 0.44\ $\gamma^\text{evening}$ & -0.02 & -0.31 & 0.27\ $\gamma^\text{med.len}$ & 0.01 & -0.25 & 0.29\ $\gamma^\text{med.len.w}$ & 0.08 & -0.19 & 0.36\ $\gamma^\text{var.len}$ & 0.07 & -0.15 & 0.31\ $\gamma^\text{var.len.w}$ & -0.15 & -0.47 & 0.17\ $\gamma_0$ & -2.99 & -3.29 & -2.69\ $\sigma_\alpha$ & 1.47 & 1.27 & 1.69\ \bottomrule \end{tabular} \end{table}
The ride-level predictor coefficients from the fitted model, however, are unimpressive. The variance in the rider intercepts not captured by the predictors, quantified with $\sigma_\alpha$, is high, and not one of the rider-level predictors have a 95\% confidence interval that does not contain zero. (In fact, many are centered near zero.) These features may differentiate riders, but they don't give much information about how they rate their rides.
Do these clusters provide similarly useful information that we got from introducing rider intercepts? Because a model with only 4 random intercepts---as in the case of cluster intercepts---rather than several hundred---as in the case with rider intercepts---is much less flexible, we expect that the cluster intercept model will preform much worse. One still might suspect there is still a significant benefit over a fixed intercept model.
There isn't. We computed Model 7, which is identical to Model 4 from the
previous chapter, but has random intercepts by cluster rather than rider. Model 7
performed slightly better than Model 6---which only had a fixed intercept---but
nowhere near as well as Model 4. The separation plots, $\log (\mathcal{L})$,
AIC, and AUC measures, shown in r ref("cluster-model-fits", type = "table")
, all demonstrate
this clearly. Given that the rider-level predictors did not seem to be predictive
of the rider intercepts, this is not a surprise.
\begin{table}[htb] \centering \caption{Model fit summaries for fixed intercept, rider random intercept, and cluster random intercepts.\label{tab:cluster-model-fits}} \begin{tabular}{lm{3in}rrr} \toprule \textbf{Model} & \textbf{Separation Plot} & \textbf{$\log (\mathcal{L})$} & \textbf{AIC} & \textbf{AUC}\footnotemark\ \midrule Model 4 & \includegraphics[width=3.1in]{figure/rider-model1-sep.pdf} & -4,266 & 8,549 & 0.805\ Model 6 & \includegraphics[width=3.1in]{figure/rider-model0-sep.pdf} & -5,089 & 10,205 & 0.597\ Model 7 & \includegraphics[width=3.1in]{figure/rider-model2-sep.pdf} & -4,973 & 9,965 & 0.646\ \bottomrule \end{tabular} \end{table} \footnotetext{Area under ROC curve for training data.}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.