knitr::opts_chunk$set(echo = TRUE)
The condition number for a matrix $A$ is $k(A) = \frac{\sigma_{max}}{\sigma_{min}}$, where $\sigma_{max}$ and $\sigma_{min}$ are the maximum and minimum sigular values of matrix $A$. If the condition number is not too much larger than one, the matrix is well-conditioned, which means that its inverse can be computed with good accuracy. If the condition number is very large, then the matrix is said to be ill-conditioned. In this question, the linear Hessian is $X^tX$, and the logistic variation is $X^tDX$, where $D$ is a diagonal matrix with elements $D_{i,i} = p_i(1-p_i)$.
The $X$ matrix and $p$ vectors are as follows:
X <- matrix(c(1, 10, 10, 1), 2, 2) p <- c(0.6, 1e-4) X p svd_XX <- svd(t(X) %*% X)$d cd_XX <- max(svd_XX)/min(svd_XX) H <- t(X) %*% diag(p*(1-p)) %*% X svd_H <-svd(H)$d cd_H <- max(svd_H)/min(svd_H)
Then the condition number for linear Hessian is r cd_XX
, which indicates as well-conditioned, and the condition number for the logistic variation is r cd_H
, which indicates as ill-conditioned.
Let $f_{\theta}$ denote an exponentil family parametrized by the scalar parameter $\theta$. Define the link function $g$ and assume $E(y_i) = \mu_i = g^{-1}(\eta_i) = g^{-1}(x_i^t\beta)$, $y_i \sim f_{\theta}$.
Then the score function is $\triangledown_{\beta}l = X^t(y - E(y)$
In the gradient descent algorithm, if we use constant step size, the updating step in the algorithm is: $$\beta_t = \beta_{t-1} - \lambda \triangledown_{\beta} l(\beta_{t-1}), $$
where $\lambda$ is the step size.
If we use momentum method to have adaptive step size, the updating step in the algorithm is: $$ v_t = \gamma v_{t-1} + \lambda \triangledown_{\beta} l(\beta_{t-1})$$ $$\beta_t = \beta_{t-1} - v_t$$
Here are the code implementations of embedded glm
function and my glm function with constant and adaptive step sizes:
library(bis557) set.seed(123) n <- 1000; p <- 3 beta <- c(0.2, 2, 1) X <- cbind(1, matrix(rnorm(n * (p-1)), ncol = (p-1) )) mu <- 1 - pcauchy(X %*% beta) Y <- as.numeric(runif(n) > mu) # constant step size fit1 <- my_glm(X, Y, family = binomial(link = "cauchit"), step_option = "constant", lambda = 0.001) # adaptive step size fit2 <- my_glm(X, Y, family = binomial(link = "cauchit"), step_option = "momentum", lambda = 0.001, gamma = 0.2) cbind(fit1$coefficients, fit2$coefficients)
From the comparison of the results, we can see that both methods have similar beta coefficients.
system.time( for(i in 1:50) { my_glm(X, Y, family = binomial(link = "cauchit"), step_option = "constant", lambda = 0.001) } )
system.time( for(i in 1:50){ my_glm(X, Y, family = binomial(link = "cauchit"), step_option = "momentum", lambda = 0.001, gamma = 0.2) } )
From the above resutls, we can see that algorithm with momentum step size are more efficient than the algorithm with constant step size.
The one-vs-all approach is used here. For K-class algorithm, $K$ binary models are fitted. Each model predicts whether an observation is in the $k$th class or not. We use my_glm
function with family = binomial(link = "logit")
to build the model, and compute the coefficients. For the $K$ models, we have $K$ sets of coefficients $\beta_1, \dots, \beta_K$. Here $\beta_k$ is a $p$-vector. To predict new values, apply all of the models and calculate the probability: $P(Y_{hat} = k) = \frac{exp(X\beta_k)}{1 + exp(X\beta_k)}$ for $k$th class. Then the class with the largest probability is chosen as predicted class.
Here is an example using iris
data:
data(iris) X <- model.matrix(Species ~ ., data = iris) Y <- iris$Species fit_m <- my_multiclass(X, Y) predit_accu <- mean(predict(fit_m, X)$Y_hat == Y)
With my_multiclass
function, the accuracy of prediction is r predit_accu
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.