library(bis557)
my_glm <- function(form, data, mu_fun, method, gamma=0.0001, maxit = 1000, tol = 1e-6){ #Extract the independent variables X <- model.matrix(form, data) #Create a dataset dropping NAs in regressors, preparing for extracting a Y vector with proper length data_no_na <- model.frame(form, data) #Identify the name of dependent variable y_name <- as.character(form)[2] #Extract a vector of response variable Y <- as.matrix(data_no_na[,y_name], ncol = 1) beta <- matrix(0, ncol(X), ncol = 1) beta_diff <- c() count <- 0 #Constant if (method==1) { for (i in 1:maxit){ beta_old <- beta eta <- X %*% beta_old mu <- mu_fun(eta) grad <- t(X) %*% (Y-mu) beta <- beta_old + gamma*grad beta_diff <- c(beta_diff, sum((beta - beta_old)^2)) #print(tail(beta_diff, 1)) count <- count+1 if (tail(beta_diff, 1) <= tol){ break } } } #Adaptive if (method==2) { #eta.0 <- X %*% beta #mu.0 <- mu_fun(eta.0) #grad.0 <- t(X) %*% (Y-mu.0) #v_old <- grad.0 v_old <- 0 frac <- 0.9 for (i in 1:maxit){ beta_old <- beta eta <- X %*% beta_old mu <- mu_fun(eta) grad <- t(X) %*% (Y-mu) v_new <- frac*v_old + gamma*grad beta <- beta_old + v_new beta_diff <- c(beta_diff, sum((beta - beta_old)^2)) v_old <- v_new count <- count+1 #print(tail(beta_diff, 1)) if (tail(beta_diff, 1) <= tol){ break } } } list(coefficients=beta, beta_diff=beta_diff, count=count) } multiclass_logistic <- function(X, Y, maxit=100){ #number of class type <- length(unique(Y)) #initialize beta beta <- matrix(0, nrow = type, ncol = ncol(X)) beta_old <- matrix(0, nrow = type, ncol = ncol(X)) for(i in 1:maxit){ for (j in 1:type){ #Updates beta_old[j,] <- beta[j,] p <- 1 / (1 + exp(- X %*% beta[j,])) D <- diag(as.vector(p*(1-p))) XtDX <- crossprod(X, D%*%X) grad <- t(X) %*% (1*(Y==unique(Y)[j]) - matrix(p, ncol = 1)) beta[j,] <- beta_old[j,] + solve(XtDX, grad) } } list(coefficients=beta) } predict_logistic <- function(X, Y, object){ #number of class type <- length(unique(Y)) #coefficients beta <- object$coefficients p <- matrix(0,nrow = nrow(X),ncol = type) sum <- 0 #Denominator used later for (i in 1:type) { sum <- sum + exp(X%*%beta[i,]) } for (j in 1:type){ p[,j] <- exp(X%*%beta[j,])/sum } #A crude classification that pick higher probability pick <- apply(p, 1, which.max) fitted.Y <- unique(Y)[pick] return(list(fitted.Y=fitted.Y, p=p)) } library(palmerpenguins) library(missForest) penguinsi <- penguins penguinsi <- data.frame(missForest(as.data.frame(penguinsi))$ximp)
From the question, it seems that our goal is to give examples of vector $p$ and design matrix $X$ such that the condition number of the linear Hessian $X^TX$ is small (well-consitioned) but the condition number of $X^TDX$ is large (ill-conditioned). The $D$ matrix is a diagonal matrix with diagonal terms $p(1-p)$.$\$ To make large jump in the condition numbers, we need to force the singular value of the matrix have large range. This can be easily achieved by tring to make $D$ matrix with close to 0 diagonal terms or making $p$ approach $0$ or $1$. Here show the example.
set.seed(1222222) X <- matrix(rnorm(30,0,1),ncol=3, nrow=10) X <- cbind(rep(1,10),X) beta <- rep(200,4) p <- as.vector(1/(1+exp(-X%*%beta))) D <- diag(p*(1-p)) #Compare the consition number #XtX kappa(t(X)%*%X) #XtDX kappa(t(X)%*%D%*%X)
It is clear that the linear Hessian is well-conditioned but the logistic variation is not under this case.
In this question, I write a function called "my_glm" that can implement the first-order gradient descent algorithm in two feature. The first one uses a constant step size and the second one uses an adaptive step size which take the advantage of the Momentum optimization algorithm introduced in the webpage:https://ruder.io/optimizing-gradient-descent/$\$ For the adaptive algorithm (Momentum), the updates goes like $$v_t=0.9\times v_{t-1}+\gamma\bigtriangledown_\beta J(\beta)$$ $$\beta=\beta+v_t$$ While the adaptive step size is properly self-adjusted given every new set of coefficient estimates in each iteration, it would be more efficient analytically.$\$
$\$By doing some experiment, I noticed that the constant step size seemed to do bad job in estimating coefficients in logistic models. To make a better comparison, here I try to simulate some data and create a list of outcomes from a poisson distribution generater. Hence, it imitates the process that we have some count data and want to estimate the coefficients in our model by using GLM. The link function and the inverse of it are specified in the functions. Comparison was made between "my_glm" and the original function "glm".
#Check constant step size #Generate some data set.seed(11111) n <- 1000 X <- cbind(rep(1,n),matrix(rnorm(n*3), ncol = 3)) beta <- c(-5, 0.6, 0.3, 1.3) Y <- rpois(n, lambda = exp(X%*%beta)) data.pois <- data.frame(cbind(Y,X[,-1])) form <- Y~. my_beta1 <- my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=1)$coefficients beta_glm <- coef(glm(form, family = "poisson", data.pois)) #Compare table.1<-data.frame(cbind(beta_glm, my_beta1)) colnames(table.1)<-c("GLM beta", "beta under constant step size") table.1
#Check adaptive step size my_beta2 <- my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=2)$coefficients #Compare table.2<-data.frame(cbind(beta_glm, my_beta2)) colnames(table.2)<-c("GLM beta", "beta under adaptive step size") table.2
my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=1)$count my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=2)$count
In the two tables and the two counters above, it is clear that the estimated coefficients under the constant step size spend more iterations to get a difference lower than the thresholds. Also, even if both of the method "converge" on aspect of the criterior set (tolerance), the former one did not give a good estimate of the true coefficients. I guess the problem come from the drawbacks of a constant step size itself, or the simulated data. From this comparison, the algorithm with the adaptive step size is better.
In this question, I use two functions to conduct the multi-class classification process. Firstly, the function "multiclass_logistic" helps to estimate the coefficient estimates for each comparison group. The comparison groups there consist of one specific class of our outcome of interest and all the others which do not belong to the class. Under this setting, the function will produce a matrix whose rows record the corresponding estimates for the comparison groups. Although this might be different from what formal functions such as "lme" do, this estimated coefficients can be easily used to calculated the predicted probabilities conditioning on each observations covariates (More details later). And because it is intractable to check convergence for a matrix in the function, I will manully check its convergence in the code below. Actually, because hessian as an adaptive measure was used, the betas converge to specific values quickly. Then, another function called "predict_logistic" helps to convert the estimates to a series of crude predictions. Other high-level assessments of the accuracy of the prediction are also conducted below.
#Call example data from the package data(penguinsi) test <- penguinsi form <- species ~ bill_length_mm + bill_depth_mm X <- model.matrix(form, test) Y <- penguinsi$species beta <- multiclass_logistic(X,Y,maxit=20)$coefficients beta beta2 <- multiclass_logistic(X,Y,maxit=30)$coefficients beta2
From the output above, it is clear that the estimates converge in less than 20 times iterations.
#Prediction accuracy mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$fitted.Y==Y)
By using crude prediction (assign predicted class to the highest probability), we actually get a pretty high rate of accuracy.
#Marginal probabilities mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$p[,1]) mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$p[,2]) mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$p[,3])
table(penguinsi$species)
The marginal probabilities computed from the prediction are close to the true proportion of each species.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.