knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(bis557)
We have the equation: $$ H(l) = X^{T}DX, $$
where $D_{i,i}=p_{i} (1-p_{i})$.
We know that $D=I$ (identity matrix) for the linear Hessian.
To make the logistic Hessian matrix ill-conditioned, we want to set many
of the diagonal elements to be infinitesimally close to 0, so that:
$$
\exists i : D_{i,i} \rightarrow 0.
$$
This means that most of the probabilities should be very close to 0 or 1 ($p_{i} \rightarrow 0$ or $p_{i} \rightarrow 1$).
Let's create a $120 \times 8$ matrix using the rng set.seed(2020)
.
To make the probabilities really close to $0$ or $1$, let
$\beta=(200,200,...,200)$.
We will check the condition number $\kappa$ for the linear Hessian and the
logistic Hessian.
A high condition number indicates ill-condition.
# Dimensions set.seed(2020) n <- 120; p <- 8 # Matrix X, Coefficient betas, and Probabilities beta <- rep(200, p) X <- cbind(1, matrix(data = rnorm(n * (p - 1)), nrow = n, ncol = p - 1)) probs <- 1 / (1 + exp(-X %*% beta)) # logistic print(quantile(probs, c(0.1, 0.25, 0.5, 0.75, 0.9))) # mostly near 0 or 1 # Linear Hessian and Logistic Hessian print(Hessian_linear <- kappa(t(X) %*% X)) # Answer: 2.33 D <- diag(as.vector(probs) * (1 - as.vector(probs)), nrow = n, ncol = n) print(Hessian_logis <- kappa(t(X) %*% D %*% X)) # Answer: 3.01e+13
Here, we see that $\kappa (X^{t} X) = 2.33$ (low condition number) and $\kappa (X^{t} DX) = 3.01 \times 10^{13}$ (high condition number).
Thus, the linear Hessian is well-conditioned but the logistic Hessian is ill-conditioned.
The function will be called bis557::glm_hw3b()
.
Here, we use only the first-order condition to solve coefficients from a
generalized linear model (GLM) using the gradient descent algorithm.
For each observation $i$, we let: $$ \mathbb{E}[y_i] = g^{-1}(\langle x_i, \beta \rangle), $$ where $g$ is the link function.
The gradient/score of the likelihood function $l$ is: $$ \bigtriangledown_{\beta} l(\beta) = X^{T}(y - \mathbb{E}[y]). $$
The gradient descent algorithm tells us to update $\beta$ so that: $$ \beta \leftarrow \beta - \alpha \bigtriangledown_{\beta} l(\beta), $$ where $\alpha$ is the step size.
The example below uses GLM to solve coefficients with a Poisson-distributed response variable, first using a constant step size.
# Givens (with constant step size) set.seed(2020) n <- 3000; p <- 4; maxiter <- 500; steps <- rep(1e-3, maxiter) # Create covariates X, response variable y, and the true beta coefficients X <- cbind(1, matrix(rnorm(n * (p-1)), ncol = p-1)) beta <- c(-1, 0.2, 0.1, 0.3) y <- rpois(n, lambda = exp(X %*% beta)) # Use GLM to solve for coefficients b_hat <- glm_hw3b(X, y, family = poisson(link = "log"), steps, maxiter) # Compare true vs estimated coefficients print(cbind(beta, b_hat$coefficients))
The estimated coefficients $\hat\beta$ are very close to the actual coefficients $\beta$ for constant step size.
Second, we will use an adaptive step size $\alpha_{t} = O\left( \frac{1}{t} \right)$, for the same data.
# Adaptive Step Size set.seed(2020) steps2 <- 5e-3/(1:maxiter) # Use GLM to solve for coefficients b_hat2 <- glm_hw3b(X, y, family = poisson(link = "log"), steps = steps2, maxiter) # Compare true vs estimated coefficients print(cbind(beta, b_hat2$coefficients))
The estimated coefficients $\hat\beta$ are also very close to the actual coefficients $\beta$ for adaptive step size.
Therefore, the performance between a constant step size and an adaptive step size is very similar, and performs really well depending on the user's choice of step size.
The function will be called bis557::multiclass_hw3c()
.
This is the generalized logistic regression to predict multiple classes.
The approach used is the one-vs-all approach, where for each of the $K$
classes, a separate logistic regression model is deployed (coefficients and
probabilities), where the model predicts whether the observation is in class
$k$ or not (for $k = {1,2,...,K}$).
We use the Newton-Raphson method (second-order).
Then, for each observation, the class with the highest probability is chosen.
The example below predicts the species of the iris using this function with 96% accuracy. Notice that we can use fewer iterations due to the second-order condition (compared to Question 2).
data(iris) species_pred <- multiclass_hw3c(X = iris[,-5], y = iris$Species, maxiter = 60) print(paste0("Prediction Accuracy = ", mean(iris$Species == species_pred)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.