In tqchen07/bis557: Package Encapsulating work for BIS557

knitr::opts_chunk$set(echo = TRUE)

Part 1

CASL 5.8 Exercise number 2: We mentioned that the Hessian matrix in Equation 5.19 can be more illconditioned than the matrix $X^tX$ itself. Generate a matrix X and propabilities p such that the linear Hessian ($X^tX$) is well-conditioned but the logistic variation is not.

The condition number for a matrix $A$ is $k(A) = \frac{\sigma_{max}}{\sigma_{min}}$, where $\sigma_{max}$ and $\sigma_{min}$ are the maximum and minimum sigular values of matrix $A$. If the condition number is not too much larger than one, the matrix is well-conditioned, which means that its inverse can be computed with good accuracy. If the condition number is very large, then the matrix is said to be ill-conditioned. In this question, the linear Hessian is $X^tX$, and the logistic variation is $X^tDX$, where $D$ is a diagonal matrix with elements $D_{i,i} = p_i(1-p_i)$.

The $X$ matrix and $p$ vectors are as follows:

X <- matrix(c(1, 10, 10, 1), 2, 2)
p <- c(0.6, 1e-4)
X
p
svd_XX <- svd(t(X) %*% X)$d
cd_XX <- max(svd_XX)/min(svd_XX)
H <- t(X) %*% diag(p*(1-p)) %*% X
svd_H <-svd(H)$d
cd_H <- max(svd_H)/min(svd_H)

Then the condition number for linear Hessian is r cd_XX, which indicates as well-conditioned, and the condition number for the logistic variation is r cd_H, which indicates as ill-conditioned.

Part 2

Describe and implement a first-order solution for the GLM maximum likelihood problem using only gradient information, avoiding the Hessian matrix. Include both a constant step size along with an adaptive one. You may use a standard adaptive update Momentum, Nesterov, AdaGrad, Adam, or your own. Make sure to explain your approach and compare it's performance with a constant step size.

Let $f_{\theta}$ denote an exponentil family parametrized by the scalar parameter $\theta$. Define the link function $g$ and assume $E(y_i) = \mu_i = g^{-1}(\eta_i) = g^{-1}(x_i^t\beta)$, $y_i \sim f_{\theta}$.

Then the score function is $\triangledown_{\beta}l = X^t(y - E(y)$

In the gradient descent algorithm, if we use constant step size, the updating step in the algorithm is: $$\beta_t = \beta_{t-1} - \lambda \triangledown_{\beta} l(\beta_{t-1}), $$

where $\lambda$ is the step size.

If we use momentum method to have adaptive step size, the updating step in the algorithm is: $$ v_t = \gamma v_{t-1} + \lambda \triangledown_{\beta} l(\beta_{t-1})$$ $$\beta_t = \beta_{t-1} - v_t$$

Here are the code implementations of embedded glm function and my glm function with constant and adaptive step sizes:

library(bis557)
set.seed(123)
n <- 1000; p <- 3
beta <- c(0.2, 2, 1)
X <- cbind(1, matrix(rnorm(n * (p-1)), ncol = (p-1) ))
mu <- 1 - pcauchy(X %*% beta)
Y <- as.numeric(runif(n) > mu)

# constant step size
fit1 <- my_glm(X, Y, family = binomial(link = "cauchit"),
               step_option = "constant", lambda = 0.001)

# adaptive step size
fit2 <- my_glm(X, Y, family = binomial(link = "cauchit"),
               step_option = "momentum", lambda = 0.001, gamma = 0.2)

cbind(fit1$coefficients, fit2$coefficients)

From the comparison of the results, we can see that both methods have similar beta coefficients.

system.time(
  for(i in 1:50) {
    my_glm(X, Y, family = binomial(link = "cauchit"),
               step_option = "constant", lambda = 0.001)
  }
)

system.time( 
  for(i in 1:50){
    my_glm(X, Y, family = binomial(link = "cauchit"),
               step_option = "momentum", lambda = 0.001, gamma = 0.2)
  }
)

From the above resutls, we can see that algorithm with momentum step size are more efficient than the algorithm with constant step size.

Part 3

Describe and implement a classification model generalizing logistic regression to accommodate more than two classes.

The one-vs-all approach is used here. For K-class algorithm, $K$ binary models are fitted. Each model predicts whether an observation is in the $k$th class or not. We use my_glm function with family = binomial(link = "logit") to build the model, and compute the coefficients. For the $K$ models, we have $K$ sets of coefficients $\beta_1, \dots, \beta_K$. Here $\beta_k$ is a $p$-vector. To predict new values, apply all of the models and calculate the probability: $P(Y_{hat} = k) = \frac{exp(X\beta_k)}{1 + exp(X\beta_k)}$ for $k$th class. Then the class with the largest probability is chosen as predicted class.

Here is an example using iris data:

data(iris)
X <- model.matrix(Species ~ ., data = iris)
Y <- iris$Species
fit_m <- my_multiclass(X, Y)
predit_accu <- mean(predict(fit_m, X)$Y_hat == Y)