Part 1: Logistic Regression

library(ggplot2)

In this exercise we will use binary regression with a logit link to model the probability of successfully climbing a mountain given it's height and prominence as covariates. We assume the number of people who successfully climbs mountain $i$ $Y_i$, is binomial distributed, $Y_i \sim$ Bin$(n_i, \pi_i)$, where $n_i$ is the number of people attempting to climb the mountain, and $\pi_i$ is the probability for success. That is, a model on the form,

  1. Model for response: $Y_i \sim \text{Bin}(n_i, \pi_i) \quad \text{for} \ i = 1,\ldots,113$
  2. Linear predictor: $\eta_i = \mathbf{x}_i^T\boldsymbol{\beta}$
  3. Link function: $\eta_i = \ln \left(\frac{\pi_i}{1-\pi_i}\right)$

(a) Log-likelihood function for the parameters

The likelihood for this model is

$$ L(\beta) = \prod_{i=1}^n f(y_i ; \beta) = \prod_{i=1}^n \binom{n_i}{y_i} \pi_i^{y_i} (1-\pi_i)^{n_i - y_i}. $$

So the log-likelihood is given

$$ l(\beta) = \log L(\beta) = \sum_{i=1}^n \left( \log \binom{n_i}{y_i} + y_i \log(\pi_i) + (n_i - y_i) \log(1-pi_i) \right) $$ $$ = \sum_{i=1}^n \left( \log \binom{n_i}{y_i} + y_i \log\left(\frac{\exp(\beta x_i^T)}{1+\exp(\beta x_i^T)}\right) + (n_i - y_i) \log\left(\frac{1}{1+\exp(\beta x_i^T)} \right) \right). $$

To arrive at the maximum likelihood estimators for $\beta$, one would find the maximum of the log-likelihood by numerical optimization. In practice that would mean to find the score function, $S(\beta)$, by differentiating $l(\beta)$, and solving

$$ S(\beta) = \frac{\partial l(\beta)}{\partial \beta} = 0. $$

This is solved by Newton-Raphson or Fisher Scoring algorithm, which for an exponential distribution with the canonical link (like this model) are the same. It is an iterative method on the form

$$ \beta^{(t+1)} = \beta^{(t)} + F(\beta^{(t)})^{-1} S(\beta^{(t)}), $$

where $S$ is the score function, and $F$ is the expected Fisher information matrix. The expected Fisher information matrix is found either by differentiating the score function and taking the expectation, or calculating the covariance matrix of $S$,

$$ F(\beta) = E \left(- \frac{\partial l(\beta)}{\partial \beta \ \partial \beta^T}\right) = Cov(S(\beta)). $$

(b) Calculate and interpret coefficients

We fit the model, with height and prominence as predictors.

filepath <- "https://www.math.ntnu.no/emner/TMA4315/2018h/mountains"
mount <- read.table(file = filepath, header = TRUE, col.names = c("height", "prominence", "fail", "success"))
model <- glm(cbind(success, fail) ~ height + prominence, data = mount, family = "binomial")
summary(model)

Interpret the parameters

To interpret the coefficients we consider the odds, as

$$ \eta_i = \ln \left(\frac{\pi_i}{1-\pi_i}\right)\ \implies \quad \frac{\pi_i}{1-\pi_i} = \exp(\beta x_i^T) = \exp( \beta_0) \exp( x_{i,\text{height}} \beta_{\text{height}}) \exp(x_{i,\text{prominence}} \beta_{\text{prominence}}) $$

So if $x_{i,\text{height}}$ changes to $x_{i,\text{height}}+1$ the odds change with a factor $\exp(\beta_{\text{height}})$ as

$$ \text{odds}{\text{new}} = \exp( \beta_0) \exp( x{i,\text{height}} \beta_{\text{height}}) \exp((x_{i,\text{prominence}}+1) \beta_{\text{prominence}}) \ $$ $$ = \exp( \beta_0) \exp( x_{i,\text{height}} \beta_{\text{height}}) \exp(x_{i,\text{prominence}} \beta_{\text{prominence}}) \exp(\beta_{\text{height}}) = \text{odds}{\text{old}} \exp(\beta{\text{height}}). $$

Similarly if $x_{i,\text{prominence}}$ changes to $x_{i,\text{prominence}}+1$ the odds change with a factor $\exp(\beta_{\text{prominence}})$. Calculating these factors we get

exp(model$coefficients[2])
exp(model$coefficients[3])

Both coefficients are negative, meaning the exponential transform of the coefficients are less than 0. This means that the odds decrease. This means the probability of successfully climbing a mountain decreases with increasing values of height and prominence. The factor from a change in height is larger than from a change in prominence.

Discuss the significance of the parameters

By conducting a Wald test we can discuss the significance of the parameters. The null hypothesis is that the true parameter value is zero, and asymptotically we have that

$$ \frac{\hat{\beta_i} - 0}{SD(\hat{\beta}_i)} \sim N(0,1). $$

We calculate the p-value for each of the parameters.

standard_error <- sqrt(diag(vcov(model)))
pvalue_coeff1 <- 2*pnorm( abs((model$coefficients[1] - 0) / standard_error[1]), lower.tail = FALSE)
pvalue_coeff2 <- 2*pnorm( abs((model$coefficients[2] - 0) / standard_error[2]), lower.tail = FALSE)
pvalue_coeff3 <- 2*pnorm( abs((model$coefficients[3] - 0) / standard_error[3]), lower.tail = FALSE)
pvalue_coeff1
pvalue_coeff2
pvalue_coeff3

We see that all the coefficients have a small p-value, and we would conclude that all the parameters are significant with any reasonable significance level.

Confidence interval for the parameters

Using the fact that the parameters are asymptotically normally distributed we can construct an 95\% confidence interval by

$$ [\hat{\beta}i - z{0.025} SD(\hat{\beta}i), \hat{\beta}_i + z{0.025} SD(\hat{\beta}_i)], $$

where $z_{0.025}$ is the $2.5\%$ lower quantile of the standard normal distribution. For the three parameters in our model we get:

confint_coeff1 <- c(model$coefficient[1] - qnorm(0.975)*standard_error[1], model$coefficient[1] + qnorm(0.975)*standard_error[1])
confint_coeff2 <- c(model$coefficient[2] - qnorm(0.975)*standard_error[2], model$coefficient[2] + qnorm(0.975)*standard_error[2])
confint_coeff3 <- c(model$coefficient[3] - qnorm(0.975)*standard_error[3], model$coefficient[3] + qnorm(0.975)*standard_error[3])
confint_coeff1
confint_coeff2
confint_coeff3

We can see that the confidence interval for the coefficients $\beta_{\text{height}}$ and $\beta_{\text{prominence}}$ does not include numbers above 1, implying that we can with a confidence level of 95\% claim that the effect of both height and prominence have a negative effect of the odds.

Confidence interval for exponential transform

We calculate the confidence interval for the exponential transform of the parameters by simply taking the exponential transform of the upper and lower limit. This transform does not have an intuitive interpretation for the intercepts, like for the two coefficients for height and prominence - the multiplicative effect on the odds. We therefore only consider these confidence intervals.

exp_confint_coeff2 <- exp(confint_coeff2)
exp_confint_coeff3 <- exp(confint_coeff3)
exp_confint_coeff2
exp_confint_coeff3

We see that the confidence interval for both parameters are less than zero, indicating that both parameters with a confidence level of 95\% have a decreasing effect on the odds.

(c) Deviance residual plot and probability plot

Deviance residual plot

We have that the deviance is given by

$$ D = -2 (l(\hat{\beta}) - l(\tilde{\beta})) = -2 (l(\hat{\pi}) - l(\tilde{\pi})), $$

where $\tilde{\pi}$ denote the probability for the saturated model. This probability is simply $\tilde{\pi}_i = y_i / n_i$, such that the estimated number of successes is equal to the actual number for each mountain. The deviance for mountain $i$ is

$$ D_i = \text{sign}(y_i - \hat{y}_i) \cdot (-2) (l(\hat{\pi}_i) - l(\tilde{\pi}_i)) $$ $$ = \text{sign}(y_i - \hat{\pi}_i n_i) \cdot (-2) ( n_i (\log(\hat{\pi}_i) -\log(y_i /n_i) + (n_i -y_i) (\log (1-\hat{\pi}_i) - \log(1-(\frac{y_i}{n_i}))) ). $$

We calculate the deviance of each mountain with the built in R-function, and plot the deviance against height and prominence:

deviance_vector <- residuals(model, type="deviance")
gg_frame <- data.frame(height <- mount$height, prominence <- mount$prominence, deviance <- deviance_vector)
ggplot(data = gg_frame, aes(x = height, y=deviance)) + geom_point() + geom_hline(yintercept=0, linetype="dotted", colour="red")
ggplot(data = gg_frame, aes(x = prominence, y=deviance)) + geom_point() + geom_hline(yintercept=0, linetype="dotted", colour="red")

It seems that the residuals are reasonably randomly distributed around 0 for both "low" mountains and high mountains. However, it seems that the model underestimates the probability of successes for some of the lowest mountains, as there are some large residuals for these. The residuals for the somewhat larger mountains seems to in general be closer to zero. As a consequence of the "easy" mountains, it seems that the model often overestimates the probability of success for the more "average" mountains, as there seems to in general be more negative residuals than positive.

For the prominence we see pretty much the same pattern as for the height. The residuals seem to be reasonably randomly distributed around zero, expect for a few outliers for small prominence. It seems that the model also underestimates the probability of success in this case. For larger prominence the residuals are in general smaller. As with the previous plot it seems to in general be more extreme negative residuals than positive.

Estimated probabilities as function of the covariates

We proceed by plotting the estimated probability as a function of the covariates height and prominence. We have that

$$ \hat{\pi}_i = \frac{\exp(\hat\beta x_i^T)}{1 + \exp(\hat\beta x_i^T)}. $$

We calculate this value for a set of heights and prominences from the minimum value to the maximum value in the data set for both. We plot the corresponding estimated probabilities against the covariates.

N=50
height_vector <- seq(min(mount$height), max(mount$height), length.out = N)
prominence_vector <- seq(min(mount$prominence), max(mount$prominence), length.out = N)

gg_frame <- expand.grid(height=height_vector, prominence=prominence_vector)
gg_frame$prob <- vector("numeric", length(gg_frame$height))
for(i in 1:length(gg_frame$height)){
nu_temp <- as.numeric(model$coefficients[1] + model$coefficients[2]*gg_frame$height[i] + model$coefficients[3]*gg_frame$prominence[i])
gg_frame$prob[i] <- exp(nu_temp)/(1+exp(nu_temp))
}
ggplot(gg_frame, aes(height, prominence, z=prob, fill = prob)) + geom_raster() + scale_fill_gradientn(colours = terrain.colors(10)) #+ geom_contour()

We see that the probability for success is close to 1 for a low mountain with low prominence, while the probability is smaller for a high mountain with a high prominence. The probability decreases faster with increasing height than with increasing prominence, but is clearly decreasing in both.

(d) Prediction of success probability for Everest and Chogolisa

Mount Everest

We have that $x_{\text{everest, height}} = 8848$ and $x_{\text{everest, prominence}} = 8848$. We calculate the corresponding probability of success by the formula above. For the confidence interval we use a similar approach as for the coefficients, using the fact that the coefficients are asymptotically normal, meaning that the linear predictor is asymptotically normal

$$ \hat\eta_e = \hat \beta x_e^T = \hat\beta_0 + x_{e,\text{height}} \hat\beta_{\text{height}} + x_{e,\text{prominence}} \hat\beta_{\text{prominence}} \ \implies \quad \hat\eta_e \sim N(\mu_{\eta_e} = \beta x_e^T, \sigma_{\eta_e}^2 = x_e Cov(\mathbf{\beta}) x_e^T). $$

We construct a confidence interval for the linear predictor with the same approach as employed earlier, namely

$$ [\hat{\eta}e - z{0.025} \sigma_{\eta_e}, \hat{\eta}e + z{0.025} \sigma_{\eta_e}], $$

Finally, to get a 95\% confidence interval for the probability we transform the interval above by the equation for probability

$$ \left[ \frac{\exp(\hat{\eta}e - z{0.025} \sigma_{\eta_e})}{1+ \exp(\hat{\eta}e - z{0.025} \sigma_{\eta_e})}, \frac{\exp(\hat{\eta}e + z{0.025} \sigma_{\eta_e})}{1+ \exp(\hat{\eta}e + z{0.025} \sigma_{\eta_e})} \right] $$

height_everest <- 8848
prominence_everest <- 8848
nu_everest <- as.numeric(model$coefficients[1] + model$coefficients[2]*height_everest + model$coefficients[3]*prominence_everest)
pi_everest <- exp(nu_everest)/(1+exp(nu_everest))
pi_everest

sd_nu_everest <- sqrt(t(c(1, height_everest, prominence_everest)) %*% vcov(model) %*% c(1, height_everest, prominence_everest))
confint_nu_everest <- c(nu_everest - qnorm(0.975)*sd_nu_everest, nu_everest+ qnorm(0.975)*sd_nu_everest)
confint_pi_everest <- exp(confint_nu_everest)/(1 + exp(confint_nu_everest))
confint_pi_everest

max(mount$height)
mean(mount$height)
max(mount$prominence) 

We see that the estimated probability of successfully climbing mount Everest is quite low, around 9\%. The 95\% confidence interval is rather large, from just over 5\% to 14\%. We note that Mount Everest is rather extreme compared to the other mountains in the data set, being approx. 400 meters higher than the highest in the data set (more than 1000 above the average), and almost twice as large prominence as the largest in the data set (almost 6 times the average prominence). This means the model does not necessarily describe Mount Everest as good as the other mountains.

Chogolisa

We have that $x_{\text{chogolisa, height}} = 7665$ and $x_{\text{chogolisa, prominence}} = 1624$. We calculate the estimated probability for success, and corresponding 95\% confidence interval, in the same way as for Mount Everest. In addition we calculate the estimated $y_{\text{chogolisa}}$ by $\hat y_{\text{chogolisa}} = \hat\pi_{\text{chogolosa}} n_{\text{chogolisa}}$, and the confidence interval.

height_chogolisa <- 7665
prominence_chogolisa <- 1624
nu_chogolisa <- as.numeric(model$coefficients[1] + model$coefficients[2]*height_chogolisa + model$coefficients[3]*prominence_chogolisa)
pi_chogolisa <- exp(nu_chogolisa)/(1+exp(nu_chogolisa))
pi_chogolisa

sd_nu_chogolisa <- sqrt(t(c(1, height_chogolisa, prominence_chogolisa)) %*% vcov(model) %*% c(1, height_chogolisa, prominence_chogolisa))
confint_nu_chogolisa <- c(nu_chogolisa - qnorm(0.975)*sd_nu_chogolisa, nu_chogolisa+ qnorm(0.975)*sd_nu_chogolisa)
confint_pi_chogolisa <- exp(confint_nu_chogolisa)/(1 + exp(confint_nu_chogolisa))
y_chogolisa_hat <- pi_chogolisa*22
confint_y_chogolisa_hat <- confint_pi_chogolisa*22

We see that the estimated number of people successfully climbing the mountain is somewhat smaller than the true number. This is consistent with our observation that the model have a tendency to underestimate the probability for mountains with small prominence, which is the case here.



JakobGM/mylm documentation built on May 7, 2019, 2:27 p.m.