In casxue/bis557: Coursework for BIS 557

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Homework 4

Problem 1: CASL Exercises 5.8, question 2.

We mentioned that the Hessian matrix in Equation 5.19 can be more ill-conditioned than the matrix $X^t X$ itself. Generate a matrix $X$ and propabilities $p$ such that the linear Hessian $(X^t X)$ is well-conditioned but the logistic variation is not.

The condition number of $X$ is $$cond (X) = \frac{c_{max}}{c_{min}} = \frac{\underset{\delta : \lVert \delta \rVert = 1}{\max} {\lVert X \delta \rVert}}{\underset{\delta : \lVert \delta \rVert = 1}{\min} {\lVert X \delta \rVert}}.$$ Assuming the $\ell_2$-norm, this is equal to $$cond (X) = \frac{\sigma_{max}}{\sigma_{min}},$$ the ratio between the maximal and minimal singular values of $X$.

Here, we seek to compare the condition number of linear Hessian $X^\top X$ and logistic Hessian $X^\top \cdot \mathrm{diag} (p \cdot (1 - p)) \cdot X$.

H_lin <- function (X) {t(X) %*% X}
H_log <- function (X, p) {t(X) %*% diag(p * (1 - p)) %*% X}

cond <- function (svd) {svd$d[1] / svd$d[length(svd$d)]}

X <- matrix(c(1, 1, 1, -1), nrow = 2, ncol = 2)
p <- c(1000000, 3)
cond(svd(H_lin(X)))
cond(svd(H_log(X, p)))

Since the condition number of the logistic Hessian is greater than the condition number of the linear Hessian, the logistic variation is more ill-conditioned.

Problem 2: CASL Exercises 5.8, question 4

It is possible to incorporate a ridge penalty into the maximum likelihood estimator. Modify the function irwls_glm to include an $\ell_2$-norm penalty on the regression vector for a fixed tuning parameter $\lambda$.

Adding the ridge penalty to the maximum likelihood estimator for logistic regression adds a term of $-\frac{\lambda}{n} \sum_{i = 1}^n \left( \beta^{(i)} \right)^2$ to the log likelihood. This adds a term of $\frac{\lambda}{n} \beta^{(k)}$ to the gradient $\frac{\partial l}{\partial \beta^{(k)}}$. So, the gradient descent rule is $$\beta^{(k + 1)} = \beta^{(k)} - H^{-1} (l) (\beta^{(k)}) \cdot \nabla_\beta (l) (\beta)^{(k)} + \frac{\lambda}{n} \beta^{(k)},$$ with $H$ and $\nabla_\beta$ as defined in the text. The new value for $z$ is $$z = X \beta^{(k)} \frac{n - \lambda}{\lambda} + \left( \frac{y - \mathbb{E} y^{(k)}}{\mathrm{diag}(Var (y^{(k)}))} \right).$$

The code is included in the package.