knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dev = "png" )
library(bis557)
Assume there are $n$ observations. Let $\boldsymbol{\beta} = \begin{bmatrix}\beta_0\\beta_1 \end{bmatrix}$, $\boldsymbol{X} = \begin{bmatrix} 1\ x_1 \ \vdots\ \vdots\ 1\ x_n \end{bmatrix}$, $\boldsymbol{Y} = \begin{bmatrix}y_1 \ \vdots\ y_n \end{bmatrix}$\ Then we can rewrite the equation to $\boldsymbol{Y} = \boldsymbol{X} \boldsymbol{\beta}$.\ As we know from lecture:$\boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$.\ \ $$ \begin{aligned} \boldsymbol{X}^T\boldsymbol{X} &= \begin{bmatrix} 1\ \dots \ 1 \ x_1\ \dots \ x_n\end{bmatrix} \begin{bmatrix} 1\ x_1 \ \vdots\ \vdots\ 1\ x_n \end{bmatrix}\ &= \begin{bmatrix} n & \sum_{i=1}^{n}x_i\ \sum_{i=1}^{n}x_i & \sum_{i=1}^{n}x_i^2 \end{bmatrix} \end{aligned} $$
Then\
$$
\begin{aligned}
(\boldsymbol{X}^T\boldsymbol{X})^{-1}
&= \frac{1}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2}
\begin{bmatrix}
\sum_{i=1}^{n}x_i^2 & -\sum_{i=1}^{n}x_i\
-\sum_{i=1}^{n}x_i & n
\end{bmatrix}
\end{aligned}
$$
Since\ $$ \begin{aligned} \boldsymbol{X}^T \boldsymbol{Y} &= \begin{bmatrix} 1\ \dots \ 1 \ x_1\ \dots \ x_n\end{bmatrix} \begin{bmatrix}y_1 \ \vdots\ y_n \end{bmatrix} \ &= \begin{bmatrix} \sum_{i=1}^{n}y_i \ \sum_{i=1}^{n}x_iy_i \end{bmatrix} \ \end{aligned} $$
Then\ $$ \begin{aligned} \begin{bmatrix}\hat{\beta_0}\ \hat{\beta_1} \end{bmatrix} &= \frac{1}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2} \begin{bmatrix} \sum_{i=1}^{n}x_i^2 & -\sum_{i=1}^{n}x_i\ -\sum_{i=1}^{n}x_i & n \end{bmatrix} \begin{bmatrix} \sum_{i=1}^{n}y_i \ \sum_{i=1}^{n}x_iy_i \end{bmatrix} \ &= \frac{1}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2} \begin{bmatrix} \sum_{i=1}^{n}x_i^2\sum_{i=1}^{n}y_i -\sum_{i=1}^{n}x_i \sum_{i=1}^{n}x_iy_i\ n\sum_{i=1}^{n}x_iy_i-\sum_{i=1}^{n}x_i\sum_{i=1}^{n}y_i \end{bmatrix} \end{aligned} $$
So\ $\hat{\beta_1} = \frac{-\sum_{i=1}^{n}x_i\sum_{i=1}^{n}y_i + n\sum_{i=1}^{n}x_iy_i}{n\sum_{i=1}^{n}x_i^2 - (\sum_{i=1}^{n}x_i)^2}=\frac{ \sum_{i=1}^{n}x_iy_i-\bar{y}\sum_{i=1}^{n}x_i}{\sum_{i=1}^{n}x_i^2 - \bar{x}\sum_{i=1}^{n}x_i}$\
$\hat{\beta_0} = \frac{\bar{y}\sum_{i=1}^{n}x_i^2 -\bar{x}\sum_{i=1}^{n}x_iy_i}{\sum_{i=1}^{n}x_i^2 - \bar{x}\sum_{i=1}^{n}x_i}$
We use k-fold cross-validation to get the out-of-sample accuracy. my_grad_descent()
is trained by $k-1$ folds of data, then predicts on the remaining data (i.e. test data). Sum of square error (SSE) is calculated for each fold, and mean sum of square error (MSE) is calculated for the k-fold cross-validation. MSE is treated as the out-of-sample accuracy. Here are the test code for this cross-validation function:
data(iris) cv_fit <- cv_my_grad_descent(form = Sepal.Length ~ ., data = iris, nfolds = 10, lambda = 0.0001) cv_fit
Here are the results from the same k-fold cross-validation of the OLS model:
library(rsample) # folds library(foreach) folds <- vfold_cv(iris, v = 10) form <- Sepal.Length ~ . SSE <- foreach(fold = folds$splits, .combine = c) %do% { fit <- lm(formula = form, data = analysis(fold)) sum(as.vector(assessment(fold)[, as.character(form)[2]] - as.vector(predict(fit, assessment(fold))))^2) } lm_rslt <- list(SSE = SSE, MSE = mean(SSE)) lm_rslt
By comparing the above two methods, we found the out-of-sample accuracies are very similar.
dt <- iris dt$new_Petal.Width <- 5*dt$Petal.Width # create collinearity variable ridge_fit <- my_lm_ridge(form = Sepal.Length ~ Species + Sepal.Width + Petal.Width + Petal.Length + new_Petal.Width, data = dt, lambda = 1.5) ridge_fit$coefficients
library(purrr) library(tibble) library(ggplot2) data <- iris form <- Sepal.Length ~ Species + Sepal.Width + Petal.Width + Petal.Length lambdas <- seq(0, 2, by = 0.01) folds <- vfold_cv(data) resids <- foreach(i = seq_along(lambdas)) %do% { foreach(fold = folds$splits, .combine = c) %do% { fit <- my_lm_ridge(form, analysis(fold), lambdas[i]) as.vector(assessment(fold)[, as.character(form)[2]] - as.vector(predict(fit, assessment(fold)))) } } MSE <- unlist(lapply(resids, function(x) mean(x^2))) rslt <- data.frame(lambda = lambdas, MSE = MSE) ggplot(rslt, aes(x = lambda, y = MSE)) + geom_line() optimal_lambda <- lambdas[which.min(MSE)] optimal_lambda
I use the cross-validation method to get the MSE for each lambda, then the lambda with minimal MSE is considered as optimal lambda. In this case, I plot the MSE vs. lambda, and the lambda with minimal MSE is chosen.
We assume there are $p$ covariates, and $X^TX = I_p$ We can rewrite the loss function as: $$ \begin{aligned} L &= \frac{1}{2n} ||Y - X \beta||2^2 + \lambda ||\beta||_1\ &= \frac{1}{2n}(Y^TY + \beta^TX^TX\beta - 2\beta^TX^TY) + \lambda |\beta|\ &= \frac{1}{2n}Y^TY + \frac{1}{2n}\sum{j=1}^{p}(\beta_j^2-2\beta_jX_j^TY+2n\lambda |\beta_j|)\ &= \frac{1}{2n}Y^TY + \frac{1}{2n}\sum_{j=1}^{p}f(\beta_j) \end{aligned} $$
So we can minimuze each $f(\beta_j)$ independently.\
1) If $\beta_j = 0$, $f(\beta_j) = 0$\
2) If $\beta_j > 0$, $f(\beta_j) = \beta_j^2-2\beta_jX_j^TY+2n\lambda \beta_j$\ $\frac{\partial f}{\partial \beta_j} = 2\beta_j-2X_j^TY+2n\lambda = 0$\ $\Rightarrow \beta_j = X_j^TY - n\lambda$\
3) If $\beta_j < 0$, $f(\beta_j) = \beta_j^2-2\beta_jX_j^TY-2n\lambda \beta_j$\ $\frac{\partial f}{\partial \beta_j} = 2\beta_j-2X_j^TY-2n\lambda = 0$\ $\Rightarrow \beta_j = X_j^TY + n\lambda$
Since $f(\beta_j)$ should be minimized, when $\beta_j \ge 0$, if $X_j^TY < n\lambda$, then $\beta_j$ has to be 0.\ Similary, when $\beta_j \le 0$, if $X_j^TY > -n\lambda$, then $\beta_j$ has to be 0.\ As result, if $|X_j^TY| < n\lambda$, $\hat{\beta_j} = 0$.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.