knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Our aim is to minimize the function
$$S=\sum_{i=1}^{n} (y_{i}-\beta_{0}-\beta_{1}x_{i})^2$$ with respect to $\beta_{0}$ and $\beta_{1}$.
Taking derivatives with respect to $\beta_{0}$ and $\beta_{1}$, and setting the derivatives to zero,
$$\frac{\partial S(\beta_{0},\beta_{1})}{\partial\beta_{0}}=-2\sum(y_{i}-\beta_{0}-\beta_{1}x_{i})=0$$
and
$$\frac{\partial S(\beta_{0},\beta_{1})}{\partial\beta_{1}}=-2\sum(y_{i}-\beta_{0}-\beta_{1}x_{i})x_{i}=0$$
then we can get the following two equations (the normal equations):
$$n\beta_{0}+(\sum x_{i})\beta_{1}=\sum y_{i}$$ $$(\sum x_{i})\beta_{0}+(\sum x_{i}^2)\beta_{1}=\sum x_{i}y_{i}$$
Let $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ denote the solutions for $\beta_{0}$ and $\beta_{1}$ in the normal equations. These solutions are given by,
$$\hat{\beta_{1}}=\frac{\sum(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sum(x_{i}-\overline{x})^2}$$
$$\hat{\beta_{0}}=\overline{y}-\hat{\beta_{1}\overline{x}}$$
where $\overline{y}=\frac{\sum y_{i}}{n}$ and $\overline{x}=\frac{\sum x_{i}}{n}$.
casl_ols_svd <- function(X, y) { svd_output <- svd(X) r <- sum(svd_output$d > .Machine$double.eps) U <- svd_output$u[, 1:r] V <- svd_output$v[, 1:r] beta <- V %*% (t(U) %*% y / svd_output$d[1:r]) beta } set.seed(4) n <- 1000; p <- 25 beta <- c(1, rep(0, p - 1)) X <- matrix(rnorm(n * p), ncol = p) svals <- svd(X)$d max(svals) / min(svals) N <- 1e4; l2_errors <- rep(0, N) for (k in 1:N) { y <- X %*% beta + rnorm(n) betahat <- casl_ols_svd(X, y) l2_errors[k] <- sqrt(sum((betahat - beta)^2)) } mean(l2_errors) alpha <- 0.001 X[,1] <- X[,1] * alpha + X[,2] * (1 - alpha) svals <- svd(X)$d max(svals) / min(svals) N <- 1e4; l2_errors <- rep(0, N) for (k in 1:N) { y <- X %*% beta + rnorm(n) betahat <- solve(crossprod(X), crossprod(X, y)) l2_errors[k] <- sqrt(sum((betahat - beta)^2)) } mean(l2_errors)
casl_ols_svd <- function(X, y) { svd_output <- svd(X) r <- sum(svd_output$d > .Machine$double.eps) U <- svd_output$u[, 1:r] V <- svd_output$v[, 1:r] beta <- V %*% (t(U) %*% y / svd_output$d[1:r]) beta } set.seed(4) n <- 1000; p <- 25 beta <- c(1, rep(0, p - 1)) X <- matrix(rnorm(n * p), ncol = p) svals <- svd(X)$d max(svals) / min(svals) N <- 1e4; l2_errors <- rep(0, N) for (k in 1:N) { y <- X %*% beta + rnorm(n) betahat <- solve( crossprod(X) + diag(rep(1, ncol(X))) ) %*% t(X) %*% y l2_errors[k] <- sqrt(sum((betahat - beta)^2)) } mean(l2_errors) alpha <- 0.001 X[,1] <- X[,1] * alpha + X[,2] * (1 - alpha) svals <- svd(X)$d max(svals) / min(svals) N <- 1e4; l2_errors <- rep(0, N) for (k in 1:N) { y <- X %*% beta + rnorm(n) betahat <- solve( crossprod(X) + diag(rep(1, ncol(X))) ) %*% t(X) %*% y l2_errors[k] <- sqrt(sum((betahat - beta)^2)) } mean(l2_errors)
Conclusion Table:
Regression Methods|Condition Number|MSE ------------------|----------------|------------- Simple OLS |1.319->2121.32 |0.158->37.284 Ridge Regression |1.319->2121.32 |0.158->0.724
As can be seen from the table, if the conditional number of a new matrix is approximately 2000 times as high as the original matrix, then the MSE using Simple OLS methods will dramatically increase;while the MSE using Ridge Regression methods will increase just a little bit. This proves that using ridge regression can increase numerical staibility and decrease statistical error.
Minimizing the function given is equal to minimizing the following function
$f(\beta)=(y-X\beta)^{T}(y-X\beta)+2n\lambda||\beta||_{1}$.
Let's unfold the above function as:
$$f(\beta)=y^{T}y-2\beta^{T}X^{T}y+\beta^{T}X^{T}X\beta+2n\lambda||\beta||{1}\=y^{T}y+\sum{j}(\beta_j^2-2\beta_{j}X_{j}y+2n\lambda|\beta_{j}|)\=y^{T}y+\sum_{j}f_{j}(\beta_{j})$$,
where $f_{j}(\beta_{j})=\beta_j^2-2\beta_{j}X_{j}y+2n\lambda|\beta_{j}|$
in order to minimize the function, we can choose $\beta_{j}$ to minimize it independently. Thus, the problem now is what $\beta_{j}$ can minimize $f_{j}(\beta_{j})$?
If $\beta_{j}\ge0$:$f_{j}(\beta_{j})=\beta_{j}^2+\beta_{j}(2n\lambda-2X_{j}^{T}y)$
If $\beta_{j}\le0$:$f_{j}(\beta_{j})=\beta_{j}^2-\beta_{j}(2n\lambda+2X_{j}^{T}y)$
Given that $|X_{j}^{T}y|\le n\lambda$, we can actually draw a rough plot of $f_{j}(\beta_{j})$. It's simply a parabola going upwards, achieving the minimum value at point 0.
Thus, if $|X_{j}^{T}y|\le n\lambda$, then in Lasso Regression, $\hat{\beta_{j}}$ must be 0.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.