knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(bis557) devtools::load_all()
Denote data points as $(x_i, y_i)$ with $i=1..n$. Then the design matrix $X$ is written as $$ \begin{pmatrix} 1 & x_1\ \vdots & \vdots \ 1 & x_n \end{pmatrix} $$ Then $$ X^TX = \begin{pmatrix} n & \sum_i x_i\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} $$ Its inverse $$ (X^TX)^{-1} = \frac{1}{n\sum_i x_i^2-(\sum_i x_i)^2}\begin{pmatrix} \sum_i x_i^2 & -\sum_i x_i\ -\sum_i x_i & n \end{pmatrix} $$ Then the estimator $\hat{\beta}$ is given by $$ \begin{split} \hat{\beta}=(X^TX)^{-1}X^Ty &= \frac{1}{n\sum_i x_i^2-(\sum_i x_i)^2}\begin{pmatrix} \sum_i x_i^2 & -\sum_i x_i\ -\sum_i x_i & n \end{pmatrix} \begin{pmatrix} 1 & ... & 1\ x_1 & ... & x_n \end{pmatrix} \begin{pmatrix} y_1 \ \vdots \ y_n \end{pmatrix} \ &= \frac{1}{n\sum_i x_i^2-(\sum_i x_i)^2}\begin{pmatrix} \sum_i x_i^2 & -\sum_i x_i\ -\sum_i x_i & n \end{pmatrix} \begin{pmatrix} \sum_i y_i \ \sum_i x_iy_i \end{pmatrix} \ &=\frac{1}{n\sum_i x_i^2-(\sum_i x_i)^2}\begin{pmatrix} (\sum_i x_i^2)(\sum_i y_i) -(\sum_i x_i)(\sum_i x_iy_i)\ -(\sum_i x_i)(\sum_i y_i) + n (\sum_i x_iy_i) \end{pmatrix} \ &=\frac{1}{\sum_i (x_i-\bar{x})^2}\begin{pmatrix} (\sum_i x_i^2)(\sum_i y_i) -(\sum_i x_i)(\sum_i x_iy_i)\ \sum_i (x_i-\bar{x})(y_i-\bar{y}) \end{pmatrix} \end{split} $$ Then we have $\hat{\beta_0}=\frac{(\sum_i x_i^2)(\sum_i y_i) -(\sum_i x_i)(\sum_i x_iy_i)}{\sum_i (x_i-\bar{x})^2}$ and $\hat{\beta_0}=\frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i (x_i-\bar{x})^2}$.
Here we compare the fitted parameters using gradient_descent.R and gradient_descent_outofsample.R, and notice that the results are close but different:
data(iris) fit_gradient_descent <- gradient_descent(Sepal.Length ~ Petal.Width, iris) print(fit_gradient_descent$coefficients) fit_gradient_descent_outofsample <- gradient_descent_outofsample(Sepal.Length ~ Petal.Width, iris) print(fit_gradient_descent_outofsample$coefficients)
Here a toy dataset is created and collinearity is added by hand.
n <- 500 p <- 5 beta <- c(1, 2,-1, 0,-2) # the actual coefficients X <- matrix(rnorm(n*p), nrow=n, ncol = p) X[,1] <- X[,1]*0.05 + X[,2]*0.95 # make 1st and 2nd columns collinear y <- X %*% beta + rnorm(n) fit_my_rdige <- my_Ridge(X, y, lambda=1.0) print(fit_my_rdige$coefficients) # print the fitted coefficients
Here cross-validation is used to select the best lambda with my_Ridge function:
n <- 500 p <- 5 beta <- c(1, 2,-1, 0,-2) X <- matrix(rnorm(n*p), nrow=n, ncol = p) X[,1] <- X[,1]*0.1 + X[,2]*0.9 y <- X %*% beta + rnorm(n) lambda_optim <- optimal_lambda(X, y, lambda=seq(0.1,10,by=0.5), n=5) print(lambda_optim)
Assume that the design matrix is normalized beforehand. The LASSO penalty can be expanded into a summation of cost functions for each predictor: $$ \begin{aligned} f(\beta)&=\frac{1}{2n} ||Y - X \beta||_2^2 + \lambda ||\beta||_1 \ &=\frac{1}{2n}(Y^TY+\beta^TX^TX\beta-2\beta^TX^TY) + \lambda ||\beta||_1 \ &=\frac{1}{2n}Y^TY + \sum_j(\frac{1}{2n}\beta_j^2-\frac{1}{2n}2\beta_jX_j^TY+\lambda|\beta_j|) \end{aligned} $$ When $\beta_j\geq0$, the cost function associated with predictor $j$ is given by
$$ f_j(\beta_j)=\frac{1}{2n}\beta_j^2+\beta_j(\lambda-\frac{1}{n}X_j^TY) $$
Setting its derivative to 0 gives $\beta_j=X_j^TY-n\lambda$ when $\lambda\leq\frac{1}{n}X_j^TY$. On the other hand, if $\lambda\geq\frac{1}{n}X_j^TY$, we must have $\beta_j=0$.
Similarly, if When $\beta_j\leq0$, we would have the estimate $\beta_j=X_j^TY+n\lambda$ when $\lambda\geq\frac{1}{n}X_j^TY$. On the other hand, if $\lambda\geq-\frac{1}{n}X_j^TY$, we must have $\beta_j=0$.
Taking these two scenarios into account, we conclude that if $|X_j^TY| \leq n \lambda$, then $\widehat \beta^{\text{LASSO}}$ must be zero.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.