In Zebedial/bis557: The package created for BIS557 Homework 1

library(bis557)

my_glm <- function(form, data, mu_fun, method, gamma=0.0001, maxit = 1000, tol = 1e-6){

  #Extract the independent variables
  X <- model.matrix(form, data)
  #Create a dataset dropping NAs in regressors, preparing for extracting a Y vector with proper length
  data_no_na <- model.frame(form, data)
  #Identify the name of dependent variable
  y_name <- as.character(form)[2]
  #Extract a vector of response variable
  Y <- as.matrix(data_no_na[,y_name], ncol = 1)

  beta <- matrix(0, ncol(X), ncol = 1)
  beta_diff <- c()
  count <- 0
  #Constant
  if (method==1) {
    for (i in 1:maxit){
      beta_old <- beta
      eta <- X %*% beta_old
      mu <- mu_fun(eta)

      grad <- t(X) %*% (Y-mu)
      beta <- beta_old + gamma*grad
      beta_diff <- c(beta_diff, sum((beta - beta_old)^2))
      #print(tail(beta_diff, 1))
      count <- count+1

      if (tail(beta_diff, 1) <= tol){
        break
      }
    }
  }
  #Adaptive
  if (method==2) {
    #eta.0 <- X %*% beta
    #mu.0 <- mu_fun(eta.0)
    #grad.0 <- t(X) %*% (Y-mu.0)
    #v_old <- grad.0
    v_old <- 0

    frac <- 0.9

    for (i in 1:maxit){
      beta_old <- beta
      eta <- X %*% beta_old
      mu <- mu_fun(eta)

      grad <- t(X) %*% (Y-mu)

      v_new <- frac*v_old + gamma*grad
      beta <- beta_old + v_new
      beta_diff <- c(beta_diff, sum((beta - beta_old)^2))
      v_old <- v_new
      count <- count+1

      #print(tail(beta_diff, 1))

      if (tail(beta_diff, 1) <= tol){
        break
      }
    }
  }
  list(coefficients=beta, beta_diff=beta_diff, count=count)
}


multiclass_logistic <- function(X, Y, maxit=100){
  #number of class
  type <- length(unique(Y))
  #initialize beta
  beta <- matrix(0, nrow = type, ncol = ncol(X))
  beta_old <-  matrix(0, nrow = type, ncol = ncol(X))

  for(i in 1:maxit){
    for (j in 1:type){
      #Updates
      beta_old[j,] <- beta[j,]
      p <- 1 / (1 + exp(- X %*% beta[j,]))
      D <- diag(as.vector(p*(1-p)))
      XtDX <- crossprod(X, D%*%X)
      grad <- t(X) %*% (1*(Y==unique(Y)[j]) - matrix(p, ncol = 1))
      beta[j,] <- beta_old[j,] + solve(XtDX, grad)
    }
  }
  list(coefficients=beta)
}

predict_logistic <- function(X, Y, object){
  #number of class
  type <- length(unique(Y))
  #coefficients
  beta <- object$coefficients
  p <- matrix(0,nrow = nrow(X),ncol = type)
  sum <- 0
  #Denominator used later
  for (i in 1:type) {
    sum <- sum + exp(X%*%beta[i,])

  }
  for (j in 1:type){
    p[,j] <- exp(X%*%beta[j,])/sum
  }

  #A crude classification that pick higher probability
  pick <- apply(p, 1, which.max)
  fitted.Y <- unique(Y)[pick]

  return(list(fitted.Y=fitted.Y, p=p))
}
library(palmerpenguins)
library(missForest)
penguinsi <- penguins
penguinsi <- data.frame(missForest(as.data.frame(penguinsi))$ximp)

Question 1

From the question, it seems that our goal is to give examples of vector $p$ and design matrix $X$ such that the condition number of the linear Hessian $X^TX$ is small (well-consitioned) but the condition number of $X^TDX$ is large (ill-conditioned). The $D$ matrix is a diagonal matrix with diagonal terms $p(1-p)$.$\$ To make large jump in the condition numbers, we need to force the singular value of the matrix have large range. This can be easily achieved by tring to make $D$ matrix with close to 0 diagonal terms or making $p$ approach $0$ or $1$. Here show the example.

set.seed(1222222)
X <- matrix(rnorm(30,0,1),ncol=3, nrow=10)
X <- cbind(rep(1,10),X)
beta <- rep(200,4)
p <- as.vector(1/(1+exp(-X%*%beta)))

D <- diag(p*(1-p))

#Compare the consition number 
#XtX
kappa(t(X)%*%X)

#XtDX
kappa(t(X)%*%D%*%X)

It is clear that the linear Hessian is well-conditioned but the logistic variation is not under this case.

Question 2

In this question, I write a function called "my_glm" that can implement the first-order gradient descent algorithm in two feature. The first one uses a constant step size and the second one uses an adaptive step size which take the advantage of the Momentum optimization algorithm introduced in the webpage:https://ruder.io/optimizing-gradient-descent/$\$ For the adaptive algorithm (Momentum), the updates goes like $$v_t=0.9\times v_{t-1}+\gamma\bigtriangledown_\beta J(\beta)$$ $$\beta=\beta+v_t$$ While the adaptive step size is properly self-adjusted given every new set of coefficient estimates in each iteration, it would be more efficient analytically.$\$

$\$By doing some experiment, I noticed that the constant step size seemed to do bad job in estimating coefficients in logistic models. To make a better comparison, here I try to simulate some data and create a list of outcomes from a poisson distribution generater. Hence, it imitates the process that we have some count data and want to estimate the coefficients in our model by using GLM. The link function and the inverse of it are specified in the functions. Comparison was made between "my_glm" and the original function "glm".

#Check constant step size
#Generate some data
set.seed(11111)
n <- 1000
X <- cbind(rep(1,n),matrix(rnorm(n*3), ncol = 3))
beta <- c(-5, 0.6, 0.3, 1.3)
Y <- rpois(n, lambda = exp(X%*%beta))
data.pois <- data.frame(cbind(Y,X[,-1]))
form <- Y~.

my_beta1 <- my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=1)$coefficients
beta_glm <- coef(glm(form, family = "poisson", data.pois))
#Compare
table.1<-data.frame(cbind(beta_glm, my_beta1))
colnames(table.1)<-c("GLM beta", "beta under constant step size")
table.1

#Check adaptive step size
my_beta2 <- my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=2)$coefficients
#Compare
table.2<-data.frame(cbind(beta_glm, my_beta2))
colnames(table.2)<-c("GLM beta", "beta under adaptive step size")
table.2

my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=1)$count
my_glm(form, data.pois, mu_fun = function(eta) exp(eta), method=2)$count

In the two tables and the two counters above, it is clear that the estimated coefficients under the constant step size spend more iterations to get a difference lower than the thresholds. Also, even if both of the method "converge" on aspect of the criterior set (tolerance), the former one did not give a good estimate of the true coefficients. I guess the problem come from the drawbacks of a constant step size itself, or the simulated data. From this comparison, the algorithm with the adaptive step size is better.

Question 3

In this question, I use two functions to conduct the multi-class classification process. Firstly, the function "multiclass_logistic" helps to estimate the coefficient estimates for each comparison group. The comparison groups there consist of one specific class of our outcome of interest and all the others which do not belong to the class. Under this setting, the function will produce a matrix whose rows record the corresponding estimates for the comparison groups. Although this might be different from what formal functions such as "lme" do, this estimated coefficients can be easily used to calculated the predicted probabilities conditioning on each observations covariates (More details later). And because it is intractable to check convergence for a matrix in the function, I will manully check its convergence in the code below. Actually, because hessian as an adaptive measure was used, the betas converge to specific values quickly. Then, another function called "predict_logistic" helps to convert the estimates to a series of crude predictions. Other high-level assessments of the accuracy of the prediction are also conducted below.

#Call example data from the package
data(penguinsi)
test <- penguinsi
form <- species ~ bill_length_mm + bill_depth_mm
X <- model.matrix(form, test)
Y <- penguinsi$species
beta <- multiclass_logistic(X,Y,maxit=20)$coefficients
beta
beta2 <- multiclass_logistic(X,Y,maxit=30)$coefficients
beta2

From the output above, it is clear that the estimates converge in less than 20 times iterations.

#Prediction accuracy
mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$fitted.Y==Y)

By using crude prediction (assign predicted class to the highest probability), we actually get a pretty high rate of accuracy.

#Marginal probabilities
mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$p[,1])
mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$p[,2])
mean(predict_logistic(X,Y,multiclass_logistic(X,Y,maxit=23))$p[,3])