knitr::opts_chunk$set(echo = FALSE)

The Data

Ride Report

Ride collection:

Three main issues in data (we are concerned with here):

  1. Subjective ratings
  2. Many ratings missing
  3. Many rides misclassified

Other issues:

Why study this data?

Ride Report's Goals

Our Approach

Weather Data Sources

We combined the ride data with

A note about reproducibility

This entire analysis is availible as an R package in a GitHub repository

Ride Rating Models

Notation

We have $n$ observations of rides.

\begin{equation} y_i = \begin{cases} 1, & \text{if ride } i \text{ was given a negative rating;}\ 0, & \text{otherwise.} \end{cases} \end{equation}

Define predictors,

for $i = 1, \ldots, n$. All but the last we represent together as matrix $X$.

Six Models for Ride Rating

How well do they fit?

\begin{table}[htb] \centering \caption{Fit summaries for Models 1--6.\label{tab:modelfits}} \begin{tabular}{lrrr} \toprule \textbf{Model} & \textbf{$\log (\mathcal{L})$} & \textbf{AIC} & \textbf{AUC}$_{\text{CV}}$\footnotemark\ \midrule \rowcolor{red} Model 1 & -4,786 & 9,586 & 0.552\ Model 2 & -3,971 & 7,957 & 0.797\ Model 3 & -3,923 & 7,877 & 0.802\ Model 4 & -3,930 & 7,870 & 0.802\ Model 5 & -3,928 & 7,878 & 0.803\ \rowcolor{red} Model 6 & -4,713 & 9,455 & 0.601\ \bottomrule \end{tabular} \end{table} \footnotetext{Area under ROC curve estimated with 10-fold cross-validation.}

Time of Day Trends

What do the intercepts encode?

Classifying Riders

What features can we use?

For riders $j = 1, \ldots, l$, we have rider-level predictors

Variables were standardized and clustered using $k$-means clustering

What patterns do these riders exhibit?

What good are these as predictors for rider intercepts?

Rider clusters and rider-level predictors

Missing Data

Missing Ratings

Of $n = 25,397$ rides, $11,365$ not rated.

Types of Missing Data

Let,

\begin{equation} r_i = \begin{cases} 1, & \text{if ride } i \text{ is missing a rating;}\ 0, & \text{otherwise.} \end{cases} \end{equation}

Rubin classifies missing data into three situations^[@little1987 (page 14)]:

  1. Missing Completely at Random (MCAR), where $r$ is independent of $r$ and the predictors $X$. i.e. $\mathbb{P} (r = 1| y, X) = \mathbb{P}(r = 1$)
  2. Missing at Random (MAR), where $r$ is independent of $y$, but may depend on $X$, i.e. $\mathbb{P} (r = 1 |y, X) = \mathbb{P} (r = 1 | X )$
  3. Nonignorable, or not MCAR nor MAR, where $r$ is dependent on $y$.

We believe the missing ratings are nonignorable.

The EM Algorithm for Missing Data

The EM Algorithm: Setup

EM algorithm general procedure^[@little1987]:

  1. E-step: Compute expected loglikelihood, \begin{equation} Q(\alpha, \beta | \alpha^{(t)}, \beta^{(t)}) = \int l(\alpha, \beta | y) \cdot f(y_\text{mis} | \: y_\text{obs}, \alpha^{(t)}, \beta^{(t)}) \; dy_\text{mis} \end{equation}
  2. M-step: maximize $Q(\alpha, \beta | \alpha^{(t)}, \beta^{(t)})$ to get $(\alpha^{(t + 1)}, \beta^{(t + 1)})$

EM Algorithm: Weighting procedure

  1. Get initial estimates of $\alpha$ and $\beta$.
  2. Compute weights \begin{equation} w_{i\: y_i}^{(t)} = \frac{f(y_i \;|\; x_i, \beta^{(t)}) f(r_i \;|\; x_i, y_i, \alpha^{(t)})}{ \sum_{y_i \in {0,1}} f(y_i \;|\; x_i, \beta^{(t)}) f(r_i \;|\; x_i, y_i, \alpha^{(t)}) }. \end{equation}
  3. Create augmented data:

  4. Fit data model and missing data model separately using augmented data

  5. Repeat 2--4 until loglikelihood converges

EM Algorithm: Augmented Data

\begin{figure}[htb] \centering \caption{Creation of augmented data set for the weighted method of the EM algorithm for missing response data. \label{fig:augmented-data}} \begin{tabular}{lcl} \toprule \textbf{Original Data} & & \textbf{Augmented Data}\ \midrule

\begin{tabular}{lll} $y_i$ & $x_i$ & $r_i$\ \midrule 1 & 2.4 & 0\ 0 & 1.3 & 0\ NA & -0.4 & 1\ & & \end{tabular} & $\to$ & \begin{tabular}{llll} $y_i$ & $x_i$ & $r_i$ & $w_i$\ \midrule 1 & 2.4 & 0 & 1\ 0 & 1.3 & 0 & 1\ 1 & -0.4 & 1 & 0.2\ 0 & -0.4 & 1 & 0.8 \end{tabular}\ \bottomrule \end{tabular} \end{figure}

Missing Data Model Results: Data Model

\begin{table}[htb] \centering \begin{tabular}{lrrrr} \toprule \textbf{Parameter} & \textbf{Model 4} & \textbf{EM Model}\ \midrule Log(Length) & -0.147 & 0.205\ & \footnotesize (-0.290, -0.005) & \footnotesize (0.106, 0.304)\ Mean Temperature & 0.142 & 0.100\ & \footnotesize (0.004, 0.281) & \footnotesize (0.005, 0.196)\ Mean Wind Speed & 0.002 & -0.026\ & \footnotesize (-0.054, 0.057) & \footnotesize (-0.069, 0.016)\ Max Gust Speed & -0.005 & 0.020\ & \footnotesize (-0.031, 0.021) & \footnotesize (0.001, 0.039)\ Rainfall & 0.050 & 0.051\ & \footnotesize (-0.017, 0.117) & \footnotesize (0.009, 0.093)\ Rainfall 4-Hour & 0.022 & 0.017\ & \footnotesize (0.003, 0.041) & \footnotesize (0.003, 0.030)\ Intercept & -2.792 & -3.144\ & \footnotesize (-3.334, -2.250) & \footnotesize (-3.604, -2.684)\ \bottomrule \end{tabular} \end{table}

Missing Data Model Results: Nonresponse Model

\begin{table}[htb] \centering \begin{tabular}{lrrrr} \toprule \textbf{Parameter} & \textbf{Basic Model} & \textbf{EM Model}\ \midrule $y$ & 0.730 & 1.035\ & \footnotesize (0.235, 1.224) & \footnotesize (0.493, 1.577) \ Log(Length) & -0.297 & -0.327\ & \footnotesize (-0.362, -0.232) & \footnotesize (-0.393, -0.262)\ Mean Temperature & 0.200 & 0.139\ & \footnotesize (0.139, 0.262) & \footnotesize (0.077, -0.262)\ Mean Wind Speed & 0.032 & 0.031\ & \footnotesize (0.003, 0.060) & \footnotesize (0.001, 0.061) \ Max Gust Speed & -0.003 & -0.007\ & \footnotesize (-0.016, 0.010) & \footnotesize (-0.021, 0.006) \ Rainfall & 0.007 & -0.024\ & \footnotesize (-0.028, 0.041) & \footnotesize (-0.057, 0.009)\ Rainfall 4-Hour & -0.002 & 0.010\ & \footnotesize (-0.012, 0.009) & \footnotesize (-0.001, 0.021) \ Intercept & -0.927 & -0.967\ & \footnotesize (-1.124, -0.729) & \footnotesize (-1.163, -0.771)\ \bottomrule \end{tabular} \end{table}

Should we trust these results?

Conclusions

What we've learned

What remains to be researched

References {.allowframebreaks}


nocite: | @stan, @lme4, @gamm4, @Rlang, @wunderground, @pdxrain, @ridereport ...



wjones127/thesis documentation built on May 4, 2019, 7:34 a.m.