In wuyinfeistella/bis557: Linear model and the gradient descent

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(bis557)

CASL 2.11 Exercise 5

$$ X = \left[\begin{array}{c}{1} & {x_{1}}\ {1} & {x_{2}} \ {\vdots} & {\vdots} \ {1} & {x_{ n}} \end{array}\right]{n \times 2}
Y= \left[\begin{array}{c} {y{1}}\ {y_{2}} \ {\vdots} \ {y_{ n}} \end{array}\right]_{n \times 1}
$$

And then $$

\begin{aligned} \left[\begin{array}{l}{\hat{\beta}{0}} \ {\hat{\beta}{1}}\end{array}\right]=\left(X^{\prime} X\right)^{-1} X^{\prime} Y =\left(X^{\prime} X\right)^{-1} (X^{\prime} Y)

=& \frac{1}{n \sum_{i=1}^{n}\left( x_{i}-\bar{x}\right)^{2}}\left[\begin{array}{cc}{\sum_{i=1}^{n} x_{i}^{2}} & {-n \bar{x}} \ {-n \bar{x}} & {n}\end{array}\right]\left[\begin{array}{c}{n \bar{y}} \{\sum_{i=1}^{n} x_{i} y_{i}}\end{array}\right] \

=& \frac{1}{n \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\left[\begin{array}{c}{n \bar{y} \sum_{i=1}^{n} x_{i}^{2} -n \bar{x} \sum_{i=1}^{n} x_{i} y_{i}}\ {-n \bar{x} n \bar{y}} +n \sum_{i=1}^{n} x_{i} y_{i}\end{array}\right] \

=& \frac{1}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\left[\begin{array}{c}{\bar{y} \sum_{i=1}^{n} x_{i}^{2}-\bar{y} \bar{x}^{2}+\bar{y} \bar{x}^{2}-\bar{x} \sum_{i=1}^{n} x_{i} y_{i}} \ {\sum_{i=1}^{n}\left(x_{i} y_{i}-\bar{x} \bar{y}\right)}\end{array}\right] \

=& \frac{1}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\left[\begin{array}{c}{\bar{y} \sum_{i=1}^{n} (x_{i}-\bar{x})^2-\bar{x} \sum_{i=1}^{n} x_{i} y_{i}} \ {\sum_{i=1}^{n}\left(x_{i} y_{i}-\bar{x} \bar{y}\right)}\end{array}\right] \

=& \left[\begin{array}{c}{\bar{y}-\hat{\beta}{1} \bar{x}} \ {\frac{\sum{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}\end{array}\right] \end{aligned}

$$

Implement a new function fitting the OLS model using gradient descent that calculates the penalty based on the out-of-sample accuracy. Create test code. How does it compare to the OLS model? Include the comparison in your "homework-2" vignette.

data(iris)
gradient_descent(Sepal.Length~. , iris)
lm(Sepal.Length~. , iris)$coefficients

According to the output, we can see the coefficients are pretty much close to each other for OLS and gradient descent, but not exactly the same. They have a little bit difference.

Implement a ridge regression function taking into account colinear (or nearly colinear) regression variables. Create test code for it. Show that it works in your homework-2 vignette.

# i don't know why I can get the results on my Rstudio envrionment, but when I was trying to knit, it says can't find the function. So I just paste my code here. 
ridge <- function(X,Y,lambda) {
  svd_x <- svd(X)

  Sigma <- diag(svd_x$d)

  lambda_I <- diag(rep(lambda, length(svd_x$d)))

  beta <- svd_x$v %*% solve(Sigma^2+lambda_I) %*% Sigma %*% t(svd_x$u) %*% Y

  fitted.model = list(coefficients = beta, lambda = lambda_I)

  return(fitted.model)
}

X = matrix(c(rep(1, 10), seq(2,20,2)+rnorm(10)), ncol=2)
y = c(seq(10,100,10)+rnorm(10))
ridge(X, y, lambda=0.01)

Implement your own method and testing for optimizing the ridge parameter $\lambda$. Show that it works in your homework-2 vignette.

optim_lamb <- function(X, y, lambdas, folds){
   b<-NULL
   #assign folds to the data
   fold <- sample(rep(1:folds, len = length(y)))

   #create test and train data
   for (i in 1:folds){
     trainX = X[which(fold !=i ),]
     trainy = y[which(fold !=i )]
     testX = X[which(fold ==i ),]
     testy = y[which(fold ==i )]
   }


error<- NULL
b<-NULL
sse<-NULL
final_lamb = NULL

 for (i in 1:length(lambdas)){
     b <- cbind(b, ridge(trainX, trainy, lambdas[i])$coefficients)
    }

pred_y <- testX %*% b
sse <- apply( (pred_y-testy)^2 ,2, mean)
final_lamb <- lambdas[which.min(sse)]

print(final_lamb)

}

X = matrix(c(rep(1, 10), seq(2,20,2)+rnorm(10)), ncol=2)
y = c(seq(10,100,10)+rnorm(10))
optim_lamb(X,y, lambdas=5:10, folds=5)

Consider the LASSO penalty
$$ \frac{1}{2n} ||Y - X \beta||_2^2 + \lambda ||\beta||_1. $$ Show that if $|X_j^TY| \leq n \lambda$, then $\widehat \beta^{\text{LASSO}}$ must be zero:

$$ f(\beta)=\frac{1}{2n} \sum_{i=1}^{n}(y_{i}-\sum_{j=1}^{p} x_{ij} \beta_{j})^2+ \lambda \sum_{j=1}^{p}|\beta_{j}| $$

Since $\widehat \beta^{\text{LASSO}}$ is where $f(\beta)$ is minimized, then for each $\widehat \beta^{\text{LASSO}}$:

$$ \frac{d f(\beta)}{d \beta_{j}}=0 $$

$$ -\frac{1}{n}\sum_{i=1}^{n} (y_{i}-\beta_{j} x_{ij}) x_{ij}+\lambda \frac{d}{d \beta_{j}}|\beta_{j}|=0 $$

Scenario 1: $\beta_{j} \geq 0$, then the equation above became

$$ -\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\beta_{j} x_{ij}) x_{ij}+\lambda = 0 $$

$$ \widehat \beta^{\text{LASSO}}= \frac{ \sum_{i=1}^{n} x_{ij}y_i -n\lambda }{\sum_{i=1}^{n}x_{ij}^2} =\frac{X_{j}^{T} Y - n\lambda}{X_{j}^{T} {X}} \geq 0 $$ Therefore, we can only have $|X_{j}^{T}Y| = n\lambda$, and

$$ \widehat \beta^{\text{LASSO}}= \frac{X_{j}^{T} Y - n\lambda}{X_{j}^{T} {X}} = 0 $$

Scenario 2: When $\beta_{j} < 0$:

$$ \frac{1}{n}\sum_{i=1}^{n}(y_{i}-\beta_{j} x_{ij}) x_{ij} + \lambda = 0 $$ $$ \widehat \beta^{\text{LASSO}}= \frac{ \sum_{i=1}^{n} x_{ij}y_i +n\lambda }{\sum_{i=1}^{n}x_{ij}^2}=\frac{X_{j}^{T} Y + n\lambda}{X_{j}^{T} {X}} \leq 0 $$ Therefore, we have $$ \left|X_{j}^{T} Y\right| \geq -X_{j}^{T} Y \geq n\lambda $$ and that can only be $$ \left|X_{j}^{T} Y\right| = -X_{j}^{T} Y = n\lambda \ \widehat \beta^{\text{LASSO}}= \frac{X_{j}^{T} Y + n\lambda}{X_{j}^{T} {X}} = 0 $$

wuyinfeistella/bis557 documentation built on Jan. 1, 2021, 12:52 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com