knitr::opts_chunk$set(echo = TRUE)
library(bench)
library(ggplot2)
library(tidyr)
library(matlib)
is_available <- require(stats230.Rpackage)
if (!is_available) {
  library(devtools)
  install_github("jcorrett/stats230.Rpackage")
}
library(stats230.Rpackage) 

Problem 1

For this problem, we will analyze heart disease data via logistic regression. To do so, we will use the gradient descent and Newton-Raphson algorithms to fit the logistic regression model to our data set. The functions $\texttt{gd_logistic_regression}$ and $\texttt{nr_logistic_regression}$ that implement these two methods can be found in the github repo for the github package "jcorrett/stats230.Rpackage".

To begin we load our data set. We omit the non-numeric column "famhist" from our data set for simplicity. We also rescale our $X$ data such that each column has mean 0 and variance 1 to avoid scaling-related issues in the logistic regression fit. The code that reads in the data is given as follows:

data <- read.table("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",
    sep=",",head=T,row.names=1)
DF <- data.frame(data)
X <- apply(as.matrix.noquote(DF),2,as.numeric)
ystr <- "chd"
xstr <- c("sbp","tobacco","ldl","adiposity","typea","obesity","alcohol","age")
#xstr <- c("ldl","age")
y <- X[, ystr]
X <- X[, xstr]
X <- scale(X) #standardize to have 0 mean columns w/ var = 1

With the above data, we then run our models to produce our logistic regression fit, reported as the parameter $\beta$, where $\theta = \beta X$ and $$p = \dfrac{1}{1+\exp(-\theta)}$$ in our regression model. We produce 95% confidence intervals for $\beta$ as follows for each method:

alpha <- 1e-3
tol <- 1e-10
gd_df <- gd_logistic_regression(X,y,alpha,tol)
gd_beta <- gd_df[[1]]
gd_llhood <- gd_df[[2]]
gd_beta_CI <- gd_df[[3]]

nr_df <- nr_logistic_regression(X,y,tol)
nr_beta <- nr_df[[1]]
nr_llhood <- nr_df[[2]]
nr_beta_CI <- nr_df[[3]]


gd_results <- cbind(gd_beta-gd_beta_CI,gd_beta+gd_beta_CI)
nr_results <- cbind(nr_beta-nr_beta_CI,nr_beta+nr_beta_CI)
rownames(gd_results) <- xstr
rownames(nr_results) <- xstr
print("Gradient Descent Beta 95% CI:")
gd_results
print("Newton-Raphson Beta 95% CI:")
nr_results

From the above, we see nearly identical results, implying that our algorithms are both converging to the same optimal beta values. We also see that all confidence intervals are on the same order of magnitude, which is expected due to our scaling of the data. To verify convergence, we produce plots of the log likelihood as a function of algorithm iterations as follows:

plot(gd_llhood,type = "b",xlab = "Iteration",ylab = "Log Likelihood",main = "Gradient Descent",xlim = c(1,length(gd_llhood)),ylim = c(-350,-270))
plot(nr_llhood,type = "b",xlab = "Iteration",ylab = "Log Likelihood",main = "Newton-Raphson",xlim=c(1,length(nr_llhood)),ylim = c(-350,-270))

From the above, we see that both algorithms have the same maximum log likelihood. However, we note that the Newton-Raphson method converges in significantly less iterations than gradient descent. This can be attributed to Newton-Raphson having quadratic convergence, which is not the case for gradient descent. To further see this difference in computational speed, we microbenchmark each algorithm as follows:

lb_GD <- mark(gd_logistic_regression(X,y,alpha,tol),max_iterations = 100,min_time = Inf)
lb_NR <- mark(nr_logistic_regression(X,y,tol),max_iterations = 100,min_time = Inf)
p1<-autoplot(lb_GD)+theme(legend.position = "none")+ labs(color = "Legend",title = "Logistic Regression Benchmarking")+theme(plot.title = element_text(hjust = 0.5))
p2<-autoplot(lb_NR)+theme(legend.position = "none")
gridExtra::grid.arrange(p1,p2,nrow=2)

We can see from the above a near order of magnitude difference in computation time, thus demonstrating the speed up attained by using a quadratically converging algorithm.

Problem 2

Next, we are interested in implementing the EM algorithm for the ABO blood type problem. The code for this problem, $\texttt{em_blood}$, can be found on the same github repo as the code for problem 1. For our initial conditions, we use a randomly chosen vector of probabilities $p = (pA,pB,pO)$ and the input sample vector $$n = (nA,nB,nAB,nO) = (6,55,4,35).$$ We expect our code to converge to the probability distribution $p = (0.1,0.3,0.6)$. We run our EM algorithm and print the results as follows:

eps <- runif(2)
eps <- c(eps[1]*eps[2],(1-eps[1])*eps[2],1-eps[2])
p <- sample(eps) #(pA,pB,pO)
n <- c(6,55,4,35) #(nA,nB,nAB,nO)
tol <- 1e-16
blood_data <- em_blood(n,p,tol)
blood_data[[1]] #(pA,pB,pO)

From the above, we see that our algorithm is close to our expected result, with slight discrepancies. If our sample were larger, we would expect our code to converge to the correct solution.

Problem 3

For this problem, we consider the occasional dishonest casino hidden Markov model. We wish to implement the Baum-Welch algorithm to estimate the true parameters of this HMM, given as $$P = \begin{bmatrix}0.98&0.02\0.05&0.95\end{bmatrix}, \quad E = \begin{bmatrix}\frac{1}{6}&\frac{1}{6}&\frac{1}{6}&\frac{1}{6}&\frac{1}{6}&\frac{1}{6}\\frac{1}{10}&\frac{1}{10}&\frac{1}{2}&\frac{1}{10}&\frac{1}{10}&\frac{1}{10}\end{bmatrix},$$ with the initial distribution $v^T = (\frac{1}{2},\frac{1}{2})$. First, we will simulate a sample hidden and observed state time series with 100 samples, plotted as follows:

set.seed(1000)#set fixed seed for curated example
P <- rbind(c(0.98,0.02),c(0.05,0.95))
E <- rbind(c(1/6,1/6,1/6,1/6,1/6,1/6),
           c(1/10,1/10,1/2,1/10,1/10,1/10))
v <- c(1/2,1/2)

sim_data <- sim_HMM(P,E,v,100)
x <- sim_data[[1]]
y <- sim_data[[2]]
plot(x-1,xlab = "t",ylab = "Hidden State",main = "HMM Hidden States")
plot(y,xlab = "t",ylab = "Observed State",main = "HMM Observed States")

In the above, the upper state is the loaded die, and the lower state is the fair die. We see that we start with a loaded die, then switch two more times. This is harder to see in the observed states, but we do note higher frequency of 3's in the times were the die is loaded, as expected.

Next, before implementing Baum-Welch, we must first impliment the forward and backward HMM algorithms to compute marginal probabilities. Code for this is in the github repo. We produce a plot of our marginal probabilities overlayed with our hidden states as follows:

f_data <- forward_HMM(P,E,v,y,length(v))
b_data <- backward_HMM(P,E,v,y,length(v))
a <- f_data[[2]]
b <- b_data[[2]]

p_x <- a*b/colSums(a*b,2)
plot(x-1,col="black",xlab = "t",ylab = "Hidden State",main = "HMM Marginal Probabilities")
lines(p_x[1,],type = "l",col="blue")
lines(p_x[2,],type = "l",col = "red")

From the above, we see the blue curve, $P(x_t = \text{Fair}|y)$, begins by alternated when the die is switched. However, in this example, the curves switch when the die is loaded, possibly due to few 3's being rolled between $t \approx 30$ and $t \approx 50$.

Now that the forward/backward HMM algorithms are implemented and producing expected results, we now can implement the Baum-Welch algorithm. Code is given in the function $\texttt{bw_HMM}$ in the github repo. To begin, we chose a random $P$, with minimum probability of $0.9$ on the diagonal as an intial guess that the casino does not change dice too regularly. We also use an initial guess of $E_ij = \frac{1}{6}$ for all $i,j$, which is an assumption that all dice are fair. Lastly, we use the true initial distribution as our initial guess, since it is not unreasonable to assume that all initial hidden states are evenly distributed. With these initial conditions, we produce the following results with the Baum-Welch algorithm:

eps <- 0.1*runif(2)

P_guess <- rbind(c(1-eps[1],eps[1]),c(eps[2],1-eps[2]))
E_guess <- matrix(1/6,nrow=2,ncol=6)
v_guess <- c(0.5,0.5)
bw_data <- bw_HMM(P_guess,E_guess,v_guess,y,2,6,1e-14)
P_est <- bw_data[[1]]
E_est <- bw_data[[2]]
v_est <- bw_data[[3]]
P_est
E_est
v_est

From the above, we see that our code does a reasonable job at predicting $P$ and $E$, and the predicted $v$ accurately guesses the initial hidden state. However, if we run the algorithm again, we see a different result:

eps <- 0.1*runif(2)

P_guess <- rbind(c(1-eps[1],eps[1]),c(eps[2],1-eps[2]))
E_guess <- matrix(1/6,nrow=2,ncol=6)
v_guess <- c(0.5,0.5)
bw_data <- bw_HMM(P_guess,E_guess,v_guess,y,2,6,1e-14)
P_est <- bw_data[[1]]
E_est <- bw_data[[2]]
v_est <- bw_data[[3]]
P_est
E_est
v_est

With the above results, we see that the algorithm converged to a different, incorrect local optima where it predicted the fair die to be more loaded towards rolling a 3. This shows that our algorithm is very sensitive to the initial condition and can easily converge to a local optimum. We can see if this is corrected by using the true values of $P$, $E$, and $v$ as our initial guesses as follows:

P_guess <- P
E_guess <- E
v_guess <- v
bw_data <- bw_HMM(P_guess,E_guess,v_guess,y,2,6,1e-14)
P_est <- bw_data[[1]]
E_est <- bw_data[[2]]
v_est <- bw_data[[3]]
P_est
E_est
v_est

From the above results, we see that our algorithm is very close to producing the right answer, verifying that it is our choice of initial condition in the previous two runs that led to a difference in results.

Appendix - Github Repository Link

As stated throughout the assignment, all code can be found at Github repository for this R package, jcorrett/stats230.Rpackage, can be accessed online here.



jcorrett/stats230.Rpackage documentation built on March 21, 2022, 5:36 a.m.