# This chunk ensures that the reedtemplates package is installed and loaded
# This reedtemplates package includes the template files for the thesis and also
# two functions used for labeling and referencing
if(!require(devtools))
  install.packages("devtools", repos = "http://cran.rstudio.com")

if(!require(reedtemplates)){
  library(devtools)
  devtools::install_github("ismayc/reedtemplates")
  }
library(reedtemplates)
library(arm)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(binom)
library(extrafont)

Methods

We use many statistical methodologies in this paper. We outline here the central methods used, both to familiarize the reader and to establish the notation we use throughout this paper. The models we present combine logistic regression, multilevel^[Multilevel models are often referred to as hierarchical models or mixed effects models.] models, additive models, and smoothing splines. We also make use of two recently developed graphical model evaluation tools: the separation plot and the heat map plot.

The one large methodology not covered here is the expectation maximization algorithm we use to model the missing data mechanism for missing ride ratings. The theory of that algorithm is outlined in r ref("missing-data", type = "header") and an example of its implementation can be found in r ref("em-implementation", type = "header").

Logistic Regression

Statistical models are often split into regression models---models with a quantitative response---and classification models---models with a categorical response. Thus, it may seem odd that we are using a regression model when our response variable, ride rating, is a binary outcome.

But, when modeling a binary variable $Y$, we consider it a Bernoulli random variable, $$Y \sim \text{Bernoulli}(p),$$ where $p$ is the probability the response is 1 and $1-p$ is the probability the response is 0. So our response variable $Y$ may be binary, but the primary quantity of interest behind that outcome is $p$, a quantitative variable. This is why we consider logistic regression a regression model.

Logistic regression is one form of a generalized linear model. Recall that in linear regression we use data with response variable $y_i$ and $j$ predictors $x_{i1}, \ldots, x_{ij}$ to fit the best-fitting linear function

$$y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_j x_{ij} + \epsilon,$$ where $\epsilon \sim N(0, \sigma^2)$, by estimating $\beta_0, \ldots, \beta_j$. We can equivalently write,

$$y_i \sim N(\beta_0 + \beta_1 x_{i1} + \ldots + \beta_j x_{ij}, \sigma^2).$$

We could, in fact, try to predict $p$ with a linear regression, though such a model will always have the problem of predicting probabilities outside of the range of $[0, 1]$. That's not a recipe for simple interpretation or reliable predictions. A generalized linear model uses a ``link function,'' $g$, to modify the regression so the range of the response more accurately reflects the practical range of the variable; i.e. model:

$$g(y_i) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \epsilon_i.$$

In logistic regression, the "link" function is the logit function, $\text{logit}:[0,1] \to \mathbb{R}$,

$$\text{logit} (p) = \log \left(\frac{p}{1-p}\right). $$

This function is also known as the log-odds, because odds are defined as $p/1-p$ for any probability $p$.

So in logistic regression, we model the probability of $y_i = 1$ as,

$$\mathbb{P} (y_i = 1) = \text{logit}^{-1} (\beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j).$$

Notice that the inverse logit function^[sometimes called the logistic function] maps values from $\mathbb{R}$ to $[0,1]$. Thus, the function provides a convenient way to map linear combinations of other variables onto values that are valid probabilities. Other such functions exist and are also used for regressions with binary responses, such as the probit function. Logistic regression, however, is easier to interpret---because of the odds connection---and more efficient to compute.

points <- data.frame(x = (1:1000 - 500) / 30)
points$y <- invlogit(points$x)

logit.plot <- ggplot(points, aes(x = x, y = y)) + geom_line() +
  labs(title="Inverse Logit (Logistic) Function") +
    theme_bw(base_family="CMU Serif")

ggsave('figure/logit_plot.pdf', width = 4, height = 2)

label(path = "figure/logit_plot.pdf", 
      caption = "The inverse logit function",
alt.cap = "The inverse logit function gives a convenient way to map linear
combinations of real numbers to valid probability values.", 
      label = "logit-plot", type = "figure", scale=1,
options = "tbh")

Though coefficients from a logistic regression can't be interpreted the same way as a linear model, they do have a convenient interpretation. Because the linear part of the model represents log odds, the coefficients are log odds ratios; That is, the exponentiated coefficients $e^{\beta_1}, \ldots, e^{\beta_j}$ represent the multiplicative effect a one-unit increase in the corresponding predictor has on the odds. For example, if we fit a simple logistic regression with $\mathbb{P}(Y = 1) = \text{logit}^{-1} (\alpha + \beta + X)$, we could interpret $e^{\beta} = 1.1$ as meaning that a one unit increase in $X$ gives a 10% increase in the odds that $Y = 1$.

Hierarchical Models and Mixed Effects Models {#h-models}

Data often contain hierarchies. For example, a set of students' test scores may contain the hierarchy of districts and schools those students attend. Or a set of soil samples may have been taken at several distinct sites, thus having a hierarchy of sites and samples. In the bike ride data we examine, there is the hierarchy of riders and rides.

We will talk about different "levels" of variables corresponding to places in this hierarchy. When we refer to "ride-level variables," we refer to variables that are specific to a ride, whereas we refer to "rider-level variables" as those specific to the rider, and thus also all the rides that rider takes. For example, we consider length a ride-level variable and total number of rides taken a rider-level variable.

We will also discuss road segment-level variables, which are variables that are specific to the road segments in the route of a ride (e.g. length, presence of bike lanes, etc). But there isn't a clear road segment-ride hierarchy: each ride contains multiple road segments and each road segment is contained by multiple rides. Thus, this isn't a case where multilevel modeling is applicable. (The ideas behind it, though, may be fruitfully adapted, as we discuss in r ref("routes", type = "header"))

Gelman describes two traditional ways of dealing with hierarchical data that multilevel models contrasts with: "complete pooling" and "no pooling."^[@gelman, p. 7] In "complete pooling," we ignore the group-level variables, and give identical estimates for parameters for every group. In "no pooling," we do entirely separate regressions for each group. Multilevel models are a compromise between these extremes ("partial pooling", as Gelman calls it) where all the data are considered in a single regression with some parameters shared between groups and some different between groups.

These multilevel models work for other forms of regression, but we will focus on logistic regression, as it is the method we use in this paper. We will be using notation adapted from Gelman and Hill's description of multilevel models.^[@gelman, p. 251--252] Consider a data set composed of

We could fit a model where the intercept varies by group:

\begin{equation} y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} \left( \alpha_{j[i]} + X_i \beta \right) \right), \end{equation} \begin{equation} \alpha_{j[i]} \sim N(\gamma_0 + U_{j[i]} \gamma, \sigma_{\alpha}^2), \end{equation} where $\alpha_{j[i]}$ is the intercept for the $j$th group, $\beta$ is a vector of coefficients for the observation-level predictors, $\gamma_0$ are the group-level intercepts, and $\gamma$ is a vector of coefficients for the group-level predictors. We could also specify a similar model where there are no group-level predictors, such that we simply have different intercepts for each group,

\begin{equation} \alpha_{j[i]} \sim N(\gamma_0, \sigma_{\alpha}^2). \end{equation}

We can also consider a model that has slopes varying by group. For simplicity, let's consider just one observation-level predictor, $x_i$, that will have varying slopes $\beta_{j[i]}$ as well as one group-level predictor, $u_j$. We could specify the model as,

\begin{equation} y_i \sim \text{Bernoulli} \left( \text{logit}^{-1} (\alpha_{j[i]} + \beta_{j[i]} x_i) \right), \end{equation}

\begin{equation} \left( \begin{array}{c} \alpha_{j}\ \beta_{j} \end{array} \right) = N \left( \left( \begin{array}{c} \gamma_0^\alpha + \gamma_1^\alpha u_j\ \gamma_0^\beta + \gamma_1^\beta u_j \end{array} \right), \left( \begin{array}{cc} \sigma^2_\alpha & \rho \sigma_\alpha \sigma_\beta\ \rho \sigma_\alpha \sigma_\beta & \sigma^2_\beta \end{array} \right) \right). \end{equation}

These models can be fit with maximum likelihood estimation using the lme4 package in \textit{R}[@lme4] or can be fit with Bayesian MCMC using \textit{Stan}[@stan]. The latter has the advantage of making it easy to estimate group-level uncertainty at the expense of more computation. We fit models using lme4, but make use of \textit{Stan} when we have ride-level parameters we want to estimate, in r ref("stan-model", type = "header").

Additive Models and Smoothing Splines

Often, it is helpful to allow more flexibility in the functional forms in the models. While parametric models, like logistic regression, assume a particular form for the relationship between the variables and response, nonparametric models use the data to determine both the functional form and values of the parameters in models. However, the curse of dimensionality (the more predictors that are in a model, the fewer similar observations there are to any observation) can impair nonparametric models. Additive models, however, are able to keep a lot of the flexibility of nonparametric methods while avoiding the curse of dimensionality. Additive models assume that the response is the sum of functions of each of the predictors:

$$\text{logit} (\mathbb{P}(y_i = 1)) = \alpha + \sum_{j = 1}^p f_j(x_{ij}).$$

These functions can be linear, so generalized linear regression is a subset of additive models. But more interestingly, these functions can be non-parametric. ^[How are these models fit? Using what's known as the Backfitting Algorithm. We define the $k$th partial residuals $Y^{(k)} = Y - \left(\alpha + \sum_{j \neq k} f_j(x_j)\right)$. (That is, define the portion of $Y$ leftover for $f_k(x_k)$ to fit to after the other $f_j$'s have had their share.) Then, iteratively fit each of the functions $f_j$ on the partial residuals $Y^{(j)}$ until each of the functions converge. For a further quick look at additive models, check out Cosma Shalizi's lecture notes (@cosmaadditive)] One of the most common types of functions fit are smoothing splines.

Smoothing splines are essentially cubic functions stitched together at points called ``knots'' such that the full piece-wise function is continuous and has continuous first and second derivatives. One can further define cyclic cubic splines, which simply have the constraint that the last knot and first knot be treated as the same knot, thus allowing a continuous cyclic function to be fit. ^[For a brief and entertaining introduction to smoothing splines, see @cosmasplines. For a more in-depth look at splines, check out @wood2006]

Computation of multilevel additive models with splines is available in the gamm4 package [@gamm4].

Tools for Evaluating Models {#evaluate}

After fitting our models, we will want to know how each of our models compare. Did adding a particular term enhance or diminish the accuracy of our model? Instead of focusing on one measure of fit we use several. Log likelihood and AIC provide useful summaries of fits to the data based on the likelihood function. Separation plots---which we discuss in the next section---allow us to assess the predictive ability of a model; in particular, can the model identify high and low probability events? We also use the area under the ROC curve (AUC) measure popular for assessing logistic regression. In particular, we compute 10-fold cross validated ROC statistics to detect if models are overfitting to the data.

The Separation Plot

The separation plot, created by Greenhill, Ward, and Sacks^[@greenhill2011], is designed to show how well a logistic regression model can distinguish between high and low probability events.

source("../R/separation-plot.R")
# Example of poor model
example.data <- data.frame(y = (1==rbinom(500, 1, 0.5)), y.hat = runif(500, 0, 1))
p1<- separation_plot(example.data, "y", "y.hat")
# Example of better model
better.data <- data.frame(y.hat = invlogit(seq(-3,3, length.out=500)))
draw.bin.with.p <- function(p) { 1==rbinom(1,1,p) }
better.data$y <- sapply(better.data$y, draw.bin.with.p)
p2 <- separation_plot(better.data, "y", "y.hat") 
# Example of perfect model
example.data <- example.data %>% mutate(y.p = as.numeric(y))
p3 <- separation_plot(example.data, "y", "y.p")
p.grid <- arrangeGrob(p1, p2, p3, ncol=1)

ggsave('figure/sep_plot_examples.pdf', p.grid, width=5.5, height=1.5)
label(path = "figure/sep_plot_examples.pdf", 
      caption = "Examples of three separation plots",
alt.cap = "Examples of three separation plots. The first plot shows
what it looks like when $y$ and $\\hat{y}$ are uncorrelated. The second plot 
shows a fairly good model, where the $y$ are generated as Bernoulli($\\hat{y}$).
The third plot shows a model where the responses are fully separated.", 
      label = "sep-plot-examples", type = "figure", scale=1,
options = "tbh")

Let $y$ be a vector of observed binary response and $\hat{y}$ a vector of predicted probabilities of a 1 for each observation, predicted by some model. Then we can construct the plot as follows: We plot the data $(y, \hat{y})$ as a sequence of vertical stripes, colored according to observed outcome, $y$, and ordered from low to high probability based on $\hat{y}$. A curve is superimposed upon the stripes showing the $\hat{y}$ as a line graph. And finally, a small triangle is placed indicated the point at which the two colors of lines would meet if all observations $y = 0$ were placed to the left of all the $y=1$ observations; \textit{i.e.} showing where the boundary would be if the two classes were perfectly separated by the model.

Separation plots don't do well with larger sample sizes: if there are too many observations, it becomes difficult to read. There are several ways around this, but we choose to randomly sample the observations.



wjones127/thesis documentation built on May 4, 2019, 7:34 a.m.