knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(bis557)
Assume there are $n$ observations. Then, let $Y$ be an $n \times 1$ matrix, $X$ be an $n \times 2$ matrix, and $\beta$ be a $2 \times 1$ matrix.
Then, we have: $$ \begin{split} \hat\beta &= \begin{pmatrix} \hat\beta_0 \ \hat\beta_1 \end{pmatrix} \ &= (X^T X)^{-1} X^T Y \ &= \begin{pmatrix} n & \sum x_i \ \sum x_i & \sum x_i^2 \ \end{pmatrix} ^ {-1} \begin{pmatrix} \sum y_i \ \sum x_i y_i \end{pmatrix} \ &= \frac{1}{n \sum x_i^2 - n^2 \bar{X}^2} \begin{pmatrix} \sum x_i^2 & -\sum x_i \ -\sum x_i & n \ \end{pmatrix} \begin{pmatrix} \sum y_i \ \sum x_i y_i \end{pmatrix} \ &= \frac{1}{n \sum x_i^2 - n^2 \bar{X}^2} \begin{pmatrix} n \overline{X^2} & -n \bar{X} \ -n \bar{X} & n \ \end{pmatrix} \begin{pmatrix} n \bar{Y} \ \sum x_i y_i \end{pmatrix} \ &= \frac{1}{\sum x_i^2 - n \bar{X}^2} \begin{pmatrix} \overline{X^2} & -\bar{X} \ -\bar{X} & 1 \ \end{pmatrix} \begin{pmatrix} n \bar{Y} \ \sum x_i y_i \end{pmatrix} \ &= \frac{1}{\sum x_i^2 - 2\bar{X} \sum x_i + n \bar{X}^2} \begin{pmatrix} n \overline{X^2} \bar{Y} - \bar{X} \sum x_i y_i \ -n \bar{X} \bar{Y} + \sum x_i y_i \end{pmatrix} \ &= \frac{1}{\sum (x_i - \bar{X})^2} \begin{pmatrix} \bar{Y}\sum x_i^2 - n\bar{X}^2\bar{Y} + n\bar{X}^2\bar{Y} - \bar{X} \sum x_i y_i \ \sum x_i y_i - \bar{Y}\sum x_i - \bar{X}\sum y_i + n\bar{X}\bar{Y} \end{pmatrix} \ &= \frac{1}{\sum (x_i - \bar{X})^2} \begin{pmatrix} \bar{Y}(\sum x_i^2 - n\bar{X}^2) - \bar{X}(-n\bar{X}\bar{Y} + \sum x_i y_i) \ \sum (x_i - \bar{X})(y_i - \bar{Y}) \end{pmatrix} \ &= \frac{1}{\sum (x_i - \bar{X})^2} \begin{pmatrix} \bar{Y} \sum (x_i - \bar{X})^2 - \bar{X}\sum (x_i - \bar{X})(y_i - \bar{Y}) \ \sum (x_i - \bar{X})(y_i - \bar{Y}) \end{pmatrix} \ &= \begin{pmatrix} \frac{\bar{Y} \sum (x_i - \bar{X})^2}{\sum (x_i - \bar{X})^2} - \bar{X}\frac{\sum (x_i - \bar{X})(y_i - \bar{Y})}{\sum (x_i - \bar{X})^2} \ \frac{\sum (x_i - \bar{X})(y_i - \bar{Y})}{\sum (x_i - \bar{X})^2} \end{pmatrix} \ &= \begin{pmatrix} \bar{Y} - \hat\beta_1 \bar{X} \ \frac{\sum (x_i - \bar{X})(y_i - \bar{Y})}{\sum (x_i - \bar{X})^2} \end{pmatrix}. \end{split} $$
Therefore, we have: $$ \hat\beta_1 = \frac{\sum (x_i - \bar{X})(y_i - \bar{Y})}{\sum (x_i - \bar{X})^2}; \hat\beta_0 = \bar{Y} - \hat\beta_1 \bar{X}. $$
The function will be called bis557::ols_gd_hw2b()
.
Here, we will compare this function to the OLS model lm()
.
# Use formula to calculate coefficients and penalty data(iris) b_gd <- bis557::ols_gd_hw2b(form = Sepal.Length ~ ., d = iris[,-5], b_0 = rep(1e-9,4), learn_rate = 2, max_iter = 2e4) print(cbind("lm()" = lm(Sepal.Length ~ ., iris[,-5])$coefficients, "ols_gd_hw2b()" = b_gd$coefficients)) cat("\n") print(paste0("CV_penalty_MSE = ", sprintf("%.4f", b_gd$CV_penalty_MSE)))
Here, we see that the penalty based on the out-of-sample 10-fold CV MSE is estimated to be around $0.16$.
The estimated coefficients using gradient descent vs. the true coefficients are slightly different, since there are many local minima to be optimized.
We will use the function bis557::ridge_hw2c()
for ridge regression, where the
penalty $L$ equals:
$$
L = \frac{1}{2n}||Y - X\beta||_2^2 + \lambda ||\beta||_2^2
$$
From the textbook, we solve using the formula: $$ \hat\beta_{ridge} = (X^T X + \lambda I_p)^{-1} X^T Y $$
Remember that for SVD, we have $X = U \Sigma V^T$. Then (from the textbook), another way to write the estimated coefficients is: $$ \hat\beta_{ridge} = V \cdot \text{Diag}\left(\frac{\sigma_1}{\sigma_1^2 + \lambda}, \cdots \right) U^T Y $$
We show that as $\lambda \rightarrow \infty$, then $\hat\beta_{ridge} \rightarrow 0$.
# Show ridge regularization data(iris) b_ridge <- bis557::ridge_hw2c(form = Sepal.Length ~ ., d = iris, lambda_val = 10) print(cbind("lm()" = lm(Sepal.Length ~ ., iris)$coefficients, "lam=10" = b_ridge$coefficients)) cat("\n") # Show that ridge regression works for colinear regression variables data(lm_patho) b_patho <- bis557::ridge_hw2c(form = y ~ ., d = lm_patho, lambda_val = 1) print(cbind("lm()" = lm(y ~ ., lm_patho)$coefficients, "lam=1" = b_patho$coefficients))
The method for optimizing the ridge parameter $\lambda$ is below.
We will use the function bis557::ridge_hw2d()
.
# Show the most optimal lambda data(iris) b_ridge <- bis557::ridge_hw2d(form = Sepal.Length ~ ., d = iris[,-5], lambda_vals = 10^seq(-3, 2, length = 200)) print(b_ridge) cat("\n") b_best <- bis557::ridge_hw2c(form = Sepal.Length ~ ., d = iris[,-5], lambda_val = b_ridge$min_lambda) print(cbind("lm()" = lm(Sepal.Length ~ ., iris[,-5])$coefficients, "best_lambda" = b_best$coefficients))
Here, we let $j$ be a predictor. Here, $Y$ is a column vector with length $n$, and $\beta$ is a column vector with length $p$. Also, $X$ is a matrix with dimension $n \times p$. For notation purposes, let $X_j$ be the $j$-th column of $X$.
Then, we would minimize (using the gradient w.r.t. $\beta$): $$ L(\beta) = \frac{1}{2n} ||Y - X\beta ||2^2 + \lambda ||\beta||_1 = \frac{1}{2n} ||Y - \sum{j=1}^{p} \beta_j X_j ||2^2 + \lambda \sum{j=1}^{p} |\beta_j|. $$
So, for each partial derivative, we have (separate cases for $\beta_j>0$, $\beta_j<0$, and $\beta_j=0$): $$ \begin{split} \frac{\partial L}{\partial \beta_j} &= 0 \ &= \frac{1}{n} (X_j^T X_j \beta_j - X_j^T Y) + \lambda\cdot\text{sign}(\beta_j). \end{split} $$
For $\beta_j>0$, we have: $$ \hat\beta_j^{LASSO} = \left(\frac{X_j^T X_j}{n}\right)^{-1} \left(\frac{X_j^T Y}{n} - \lambda\right). $$
For $\beta_j<0$, we have: $$ \hat\beta_j^{LASSO} = \left(\frac{X_j^T X_j}{n}\right)^{-1} \left(\frac{X_j^T Y}{n} + \lambda\right). $$
These both work only when $\left|\frac{X_j^T Y}{n}\right|>\lambda$, so that the signs are consistent. We know that the norm $X_j^T X_j \geq 0$. Otherwise, we have $\beta_j=0$ when $\left|\frac{X_j^T X_j}{n}\right|\leq\lambda$.
Therefore, we can say that the solution is: $$ \hat\beta_j^{LASSO} = \text{sign} \left(\frac{X_j^T Y}{n}\right) \cdot \left(\frac{X_j^T X_j}{n}\right)^{-1} \cdot \max\left(\left|\frac{X_j^T Y}{n}\right| - \lambda, 0 \right). $$
Thus, the statement "$\hat\beta_j^{LASSO}=0$ if $|X_j^T Y| \leq n\lambda$" is true.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.