knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Using the explicit formula for the inverse of a 2-by-2 matrix, write down the least squares estimators for $\hat{\beta_0}$ and $\hat{\beta_1}$.
We assume that We have n pairs of observations ($x_i$,$y_i$), then our design matrix is:
$$
\begin{bmatrix}
1 & x_1 \
1 & x_2 \
.& . \
1 & x_n\
\end{bmatrix}
$$
Then we know:
$$
X^TX=\begin{bmatrix}
n & \Sigma x_i \
\Sigma x_i & \Sigma x_i^2 \
\end{bmatrix}
$$
By the formula of 2x2 matrix inverse:
$$
(X^TX)^{-1}=\frac{1}{n\Sigma x_i^2-(\Sigma x_i)^2}\begin{bmatrix}
\Sigma x_i^2 & -\Sigma x_i \
-\Sigma x_i & n\
\end{bmatrix}
$$
Then we have:
$$
\beta = (X^TX)^{-1}X^TY=\frac{1}{n\Sigma x_i^2-(\Sigma x_i)^2}\begin{bmatrix}
y_1(\Sigma x_i^2-nx_1\bar{X})+y_2(\Sigma x_i^2-nx_2\bar{X})+...+ y_n(\Sigma x_i^2-nx_n\bar{X}) \
n(x_1y_1+...+x_ny_n)-n\bar{X}(y_1+...+y_n) \
\end{bmatrix}
$$
By simplyfing the vector elements, we have:
$$
\hat{\beta_0}=\frac{\Sigma y_i\Sigma x_i^2-n\bar{X}\Sigma y_i}{n\Sigma x_i^2-(\Sigma x_i)^2}
$$
$$
\hat{\beta_1}=\frac{n\Sigma x_i y_i-\Sigma x_i\Sigma y_i}{n\Sigma x_i^2-(\Sigma x_i)^2}
$$
2. Here we implement OLS with gradient descent but compute out-of-sample prediction errors. We first randomize our data an divide our data into 80% training and 20% testing data. Then we fit OLS using training data and measure the prediction error using testing data. The corresponding function is my_lm_gd_out_of_sample()
:
bis557::my_lm_gd_out_of_sample(Sepal.Length~.,iris,eta=0.05,iters=100000,seed=1234)
If we use the whole dataset to train the $\beta$ and get the in-sample prediction error, we have:
betahat <- bis557::my_gd(Sepal.Length~.,iris) form <- Sepal.Length~. d1 <- model.frame(form,iris) X <- model.matrix(form,d1,contrasts.arg = contrasts) y_name <- as.character(form)[2] Y <- matrix(d1[,y_name],ncol=1) Yhat <- X %*% betahat mean((Y-Yhat)^2)
We can see that in-sample prediction error is smaller than the out-of-sample prediction error. This make sense.
my_ridge()
works both on the iris regression dataset with colinear variables and without colinear variables. We compare our function with the golden standard lm()
.(1)When $\lambda=0$, the solution of my_ridge()
and lm()
should be same:
library(bis557) bis557::my_ridge(Sepal.Length~.,iris[-5],lambda=0,tol=1e-8) lm(Sepal.Length~.,iris[-5])$coefficients
We can see that there are exactly the same.
(2)When $\lambda$ is increasing, we expect that all the coefficients should shirnk towards 0:
lambda1 <- 10^(c(-3:6)) for(i in 1:length(lambda1)){ res <- bis557::my_ridge(Sepal.Length~.,iris[-5],lambda=lambda1[i],tol=1e-8) print(res$coefficients) }
We can see that all coefficients are shrinked towards 0 as $\lambda$ increases from $10^{-3}$ to $10^{6}$.
(3)We add colinear variable into the dataset and check whether my_ridge()
can run:
iris_new <- cbind(iris[,-5],0.1*iris$Petal.Width) bis557::my_ridge(Sepal.Length~.,iris_new,lambda=0,tol=1e-8) lm(Sepal.Length~.,iris_new)$coefficients
We can see the coefficients are quite close to lm()
.
cv_ridge()
which will implement 10-fold cross validation and find the optimal $\lambda$ minimizing the mean out-of-sample prediction errors. I also create a generic function predict.my_ridge()
to dispatch the predict()
to my object class. We can see the plot of out-of-sample error vs. $\lambda$.bis557::cv_ridge(Sepal.Length ~ .,iris,lambdas = seq(0,0.1,by=0.001),seed = 13)
Suppose $\beta_j>0$, we take the derivative and get: $$ \begin{equation} \frac{1}{n} \hat{\beta}{j}-\frac{1}{n} X{j}^TY+\lambda=0 \Rightarrow \hat{\beta}{j}= X{j}^TY-n\lambda > 0 \end{equation} $$ Suppose $\beta_j<0$, we take the derivative and get:
$$ \frac{1}{n} \hat{\beta}{j}-\frac{1}{n} X{j}^TY-\lambda=0 \Rightarrow \hat{\beta}{j}= n\lambda+X{j}^TY < 0 $$ If $|X_j^TY| \leq n\lambda$, then $X_{j}^TY-n\lambda < 0$ must hold, which is contradict to the case when $\beta_j > 0$.
If $|X_j^TY| \leq n\lambda$, we have $-n\lambda \leq X_j^TY \leq n\lambda$,then $n\lambda+ X_j^TY \geq 0$ must hold, which is contradict to the case when $\beta_j < 0$.
Based on these 2 contradiction, the only case we can have is that $\hat{\beta_j}=0$. Hence, we proved that when $|X_j^TY| \leq n\lambda$, $\hat{\beta_{LASSO_{j}}}$ must be 0.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.