knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "png"
)
library(bis557)

Part 1

CASL 2.11 Exercises problem number 5: Consider the simple regression model with only a scaler $x$ and intercept: $y = \beta_0 + \beta_1x$. Using the explicit formula for the inverse of a 2-by-2 matrix, write down the least squares estimators for $\hat{\beta_0}$ and $\hat{\beta_1}$.

Assume there are $n$ observations. Let $\boldsymbol{\beta} = \begin{bmatrix}\beta_0\\beta_1 \end{bmatrix}$, $\boldsymbol{X} = \begin{bmatrix} 1\ x_1 \ \vdots\ \vdots\ 1\ x_n \end{bmatrix}$, $\boldsymbol{Y} = \begin{bmatrix}y_1 \ \vdots\ y_n \end{bmatrix}$\ Then we can rewrite the equation to $\boldsymbol{Y} = \boldsymbol{X} \boldsymbol{\beta}$.\ As we know from lecture:$\boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$.\ \ $$ \begin{aligned} \boldsymbol{X}^T\boldsymbol{X} &= \begin{bmatrix} 1\ \dots \ 1 \ x_1\ \dots \ x_n\end{bmatrix} \begin{bmatrix} 1\ x_1 \ \vdots\ \vdots\ 1\ x_n \end{bmatrix}\ &= \begin{bmatrix} n & \sum_{i=1}^{n}x_i\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_i^2 \end{bmatrix} \end{aligned} $$

Then\ $$ \begin{aligned} (\boldsymbol{X}^T\boldsymbol{X})^{-1} &= \frac{1}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2} \begin{bmatrix} \sum_{i=1}^{n}x_i^2 & -\sum_{i=1}^{n}x_i\ -\sum_{i=1}^{n}x_i & n \end{bmatrix} \end{aligned}
$$

Since\ $$ \begin{aligned} \boldsymbol{X}^T \boldsymbol{Y} &= \begin{bmatrix} 1\ \dots \ 1 \ x_1\ \dots \ x_n\end{bmatrix} \begin{bmatrix}y_1 \ \vdots\ y_n \end{bmatrix} \ &= \begin{bmatrix} \sum_{i=1}^{n}y_i \ \sum_{i=1}^{n}x_iy_i \end{bmatrix} \ \end{aligned} $$

Then\ $$ \begin{aligned} \begin{bmatrix}\hat{\beta_0}\ \hat{\beta_1} \end{bmatrix} &= \frac{1}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2} \begin{bmatrix} \sum_{i=1}^{n}x_i^2 & -\sum_{i=1}^{n}x_i\ -\sum_{i=1}^{n}x_i & n \end{bmatrix} \begin{bmatrix} \sum_{i=1}^{n}y_i \ \sum_{i=1}^{n}x_iy_i \end{bmatrix} \ &= \frac{1}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2} \begin{bmatrix} \sum_{i=1}^{n}x_i^2\sum_{i=1}^{n}y_i -\sum_{i=1}^{n}x_i \sum_{i=1}^{n}x_iy_i\ n\sum_{i=1}^{n}x_iy_i-\sum_{i=1}^{n}x_i\sum_{i=1}^{n}y_i \end{bmatrix} \end{aligned} $$

So\ $\hat{\beta_1} = \frac{-\sum_{i=1}^{n}x_i\sum_{i=1}^{n}y_i + n\sum_{i=1}^{n}x_iy_i}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2}=\frac{ \sum_{i=1}^{n}x_iy_i-\bar{y}\sum_{i=1}^{n}x_i}{\sum_{i=1}^{n}x_i^2 - \bar{x}\sum_{i=1}^{n}x_i}$\

$\hat{\beta_0} = \frac{\bar{y}\sum_{i=1}^{n}x_i^2 -\bar{x}\sum_{i=1}^{n}x_iy_i}{\sum_{i=1}^{n}x_i^2 - \bar{x}\sum_{i=1}^{n}x_i}$

Part 2

Implement a new function fitting the OLS model using gradient descent that calculates the penalty based on the out-of-sample accuracy. Create test code. How does it compare to the OLS model? Include the comparison in your "homework-2" vignette.

We use k-fold cross-validation to get the out-of-sample accuracy. my_grad_descent() is trained by $k-1$ folds of data, then predicts on the remaining data (i.e. test data). Sum of square error (SSE) is calculated for each fold, and mean sum of square error (MSE) is calculated for the k-fold cross-validation. MSE is treated as the out-of-sample accuracy. Here are the test code for this cross-validation function:

data(iris)

cv_fit <- cv_my_grad_descent(form = Sepal.Length ~ ., data = iris, 
                             nfolds = 10, lambda = 0.0001)
cv_fit

Here are the results from the same k-fold cross-validation of the OLS model:

library(rsample) # folds
library(foreach)

folds <- vfold_cv(iris, v = 10)

form <- Sepal.Length ~ .

SSE <- foreach(fold = folds$splits, .combine = c) %do% {
    fit <- lm(formula = form, data = analysis(fold))
    sum(as.vector(assessment(fold)[, as.character(form)[2]] -
                as.vector(predict(fit, assessment(fold))))^2)
}  
lm_rslt <- list(SSE = SSE, MSE = mean(SSE))
lm_rslt

By comparing the above two methods, we found the out-of-sample accuracies are very similar.

Part 3

Implement a ridge regression function taking into account colinear (or nearly colinear) regression variables. Create test code for it. Show that it works in your homework-2 vignette.

dt <- iris
dt$new_Petal.Width <- 5*dt$Petal.Width # create collinearity variable

ridge_fit <- my_lm_ridge(form = Sepal.Length ~ Species + Sepal.Width + Petal.Width + 
                            Petal.Length + new_Petal.Width, data = dt, 
                         lambda = 1.5)

ridge_fit$coefficients

Part 4

Implement your own method and testing for optimizing the ridge parameter $\lambda$. Show that it works in your homework-2 vignette.

library(purrr)
library(tibble)
library(ggplot2)

data <- iris
form <- Sepal.Length ~ Species + Sepal.Width + Petal.Width + 
                       Petal.Length
lambdas <- seq(0, 2, by = 0.01)
folds <- vfold_cv(data)

resids <- foreach(i = seq_along(lambdas)) %do% {
  foreach(fold = folds$splits, .combine = c) %do% {
    fit <- my_lm_ridge(form, analysis(fold), lambdas[i])
    as.vector(assessment(fold)[, as.character(form)[2]] - 
                as.vector(predict(fit, assessment(fold))))
  }
}


MSE <- unlist(lapply(resids, function(x) mean(x^2)))
rslt <- data.frame(lambda = lambdas, MSE = MSE)
ggplot(rslt, aes(x = lambda, y = MSE)) + 
  geom_line()

optimal_lambda <- lambdas[which.min(MSE)]
optimal_lambda

I use the cross-validation method to get the MSE for each lambda, then the lambda with minimal MSE is considered as optimal lambda. In this case, I plot the MSE vs. lambda, and the lambda with minimal MSE is chosen.

Part 5

Consider the LASSO penalty $\frac{1}{2n} ||Y - X \beta||_2^2 + \lambda ||\beta||_1.$ Show that if $|X_j^TY| \leq n \lambda$, then $\widehat \beta_j^{\text{LASSO}}$ must be zero.

We assume there are $p$ covariates, and $X^TX = I_p$ We can rewrite the loss function as: $$ \begin{aligned} L &= \frac{1}{2n} ||Y - X \beta||2^2 + \lambda ||\beta||_1\ &= \frac{1}{2n}(Y^TY + \beta^TX^TX\beta - 2\beta^TX^TY) + \lambda |\beta|\ &= \frac{1}{2n}Y^TY + \frac{1}{2n}\sum{j=1}^{p}(\beta_j^2-2\beta_jX_j^TY+2n\lambda |\beta_j|)\ &= \frac{1}{2n}Y^TY + \frac{1}{2n}\sum_{j=1}^{p}f(\beta_j) \end{aligned} $$

So we can minimuze each $f(\beta_j)$ independently.\

1) If $\beta_j = 0$, $f(\beta_j) = 0$\

2) If $\beta_j > 0$, $f(\beta_j) = \beta_j^2-2\beta_jX_j^TY+2n\lambda \beta_j$\ $\frac{\partial f}{\partial \beta_j} = 2\beta_j-2X_j^TY+2n\lambda = 0$\ $\Rightarrow \beta_j = X_j^TY - n\lambda$\

3) If $\beta_j < 0$, $f(\beta_j) = \beta_j^2-2\beta_jX_j^TY-2n\lambda \beta_j$\ $\frac{\partial f}{\partial \beta_j} = 2\beta_j-2X_j^TY-2n\lambda = 0$\ $\Rightarrow \beta_j = X_j^TY + n\lambda$

Since $f(\beta_j)$ should be minimized, when $\beta_j \ge 0$, if $X_j^TY < n\lambda$, then $\beta_j$ has to be 0.\ Similary, when $\beta_j \le 0$, if $X_j^TY > -n\lambda$, then $\beta_j$ has to be 0.\ As result, if $|X_j^TY| < n\lambda$, $\hat{\beta_j} = 0$.



tqchen07/bis557 documentation built on Dec. 21, 2020, 3:06 a.m.