knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Question 1

Our aim is to minimize the function

$$S=\sum_{i=1}^{n} (y_{i}-\beta_{0}-\beta_{1}x_{i})^2$$ with respect to $\beta_{0}$ and $\beta_{1}$.

Taking derivatives with respect to $\beta_{0}$ and $\beta_{1}$, and setting the derivatives to zero,

$$\frac{\partial S(\beta_{0},\beta_{1})}{\partial\beta_{0}}=-2\sum(y_{i}-\beta_{0}-\beta_{1}x_{i})=0$$

and

$$\frac{\partial S(\beta_{0},\beta_{1})}{\partial\beta_{1}}=-2\sum(y_{i}-\beta_{0}-\beta_{1}x_{i})x_{i}=0$$

then we can get the following two equations (the normal equations):

$$n\beta_{0}+(\sum x_{i})\beta_{1}=\sum y_{i}$$ $$(\sum x_{i})\beta_{0}+(\sum x_{i}^2)\beta_{1}=\sum x_{i}y_{i}$$

Let $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ denote the solutions for $\beta_{0}$ and $\beta_{1}$ in the normal equations. These solutions are given by,

$$\hat{\beta_{1}}=\frac{\sum(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sum(x_{i}-\overline{x})^2}$$

$$\hat{\beta_{0}}=\overline{y}-\hat{\beta_{1}\overline{x}}$$

where $\overline{y}=\frac{\sum y_{i}}{n}$ and $\overline{x}=\frac{\sum x_{i}}{n}$.

Question 4

Reproduce the result

casl_ols_svd <-
function(X, y)
{
svd_output <- svd(X)
r <- sum(svd_output$d > .Machine$double.eps)
U <- svd_output$u[, 1:r]
V <- svd_output$v[, 1:r]
beta <- V %*% (t(U) %*% y / svd_output$d[1:r])
beta
}

set.seed(4)
n <- 1000; p <- 25
beta <- c(1, rep(0, p - 1))
X <- matrix(rnorm(n * p), ncol = p)

svals <- svd(X)$d
max(svals) / min(svals)

N <- 1e4; l2_errors <- rep(0, N)
for (k in 1:N) {
y <- X %*% beta + rnorm(n)
betahat <- casl_ols_svd(X, y)
l2_errors[k] <- sqrt(sum((betahat - beta)^2))
}
mean(l2_errors)

alpha <- 0.001
X[,1] <- X[,1] * alpha + X[,2] * (1 - alpha)
svals <- svd(X)$d
max(svals) / min(svals)

N <- 1e4; l2_errors <- rep(0, N)
for (k in 1:N) {
y <- X %*% beta + rnorm(n)
betahat <- solve(crossprod(X), crossprod(X, y))
l2_errors[k] <- sqrt(sum((betahat - beta)^2))
}
mean(l2_errors)

Use ridge regression to increase numerical stability

casl_ols_svd <-
function(X, y)
{
svd_output <- svd(X)
r <- sum(svd_output$d > .Machine$double.eps)
U <- svd_output$u[, 1:r]
V <- svd_output$v[, 1:r]
beta <- V %*% (t(U) %*% y / svd_output$d[1:r])
beta
}

set.seed(4)
n <- 1000; p <- 25
beta <- c(1, rep(0, p - 1))
X <- matrix(rnorm(n * p), ncol = p)

svals <- svd(X)$d
max(svals) / min(svals)

N <- 1e4; l2_errors <- rep(0, N)
for (k in 1:N) {
y <- X %*% beta + rnorm(n)
betahat <- solve( crossprod(X) + diag(rep(1, ncol(X))) ) %*% t(X) %*% y
l2_errors[k] <- sqrt(sum((betahat - beta)^2))
}
mean(l2_errors)

alpha <- 0.001
X[,1] <- X[,1] * alpha + X[,2] * (1 - alpha)
svals <- svd(X)$d
max(svals) / min(svals)

N <- 1e4; l2_errors <- rep(0, N)
for (k in 1:N) {
y <- X %*% beta + rnorm(n)
betahat <- solve( crossprod(X) + diag(rep(1, ncol(X))) ) %*% t(X) %*% y
l2_errors[k] <- sqrt(sum((betahat - beta)^2))
}
mean(l2_errors)

Conclusion Table:

Regression Methods|Condition Number|MSE ------------------|----------------|------------- Simple OLS |1.319->2121.32 |0.158->37.284 Ridge Regression |1.319->2121.32 |0.158->0.724

As can be seen from the table, if the conditional number of a new matrix is approximately 2000 times as high as the original matrix, then the MSE using Simple OLS methods will dramatically increase;while the MSE using Ridge Regression methods will increase just a little bit. This proves that using ridge regression can increase numerical staibility and decrease statistical error.

Question 5

Minimizing the function given is equal to minimizing the following function

$f(\beta)=(y-X\beta)^{T}(y-X\beta)+2n\lambda||\beta||_{1}$.

Let's unfold the above function as:

$$f(\beta)=y^{T}y-2\beta^{T}X^{T}y+\beta^{T}X^{T}X\beta+2n\lambda||\beta||{1}\=y^{T}y+\sum{j}(\beta_j^2-2\beta_{j}X_{j}y+2n\lambda|\beta_{j}|)\=y^{T}y+\sum_{j}f_{j}(\beta_{j})$$,

where $f_{j}(\beta_{j})=\beta_j^2-2\beta_{j}X_{j}y+2n\lambda|\beta_{j}|$

in order to minimize the function, we can choose $\beta_{j}$ to minimize it independently. Thus, the problem now is what $\beta_{j}$ can minimize $f_{j}(\beta_{j})$?

If $\beta_{j}\ge0$:$f_{j}(\beta_{j})=\beta_{j}^2+\beta_{j}(2n\lambda-2X_{j}^{T}y)$

If $\beta_{j}\le0$:$f_{j}(\beta_{j})=\beta_{j}^2-\beta_{j}(2n\lambda+2X_{j}^{T}y)$

Given that $|X_{j}^{T}y|\le n\lambda$, we can actually draw a rough plot of $f_{j}(\beta_{j})$. It's simply a parabola going upwards, achieving the minimum value at point 0.

Thus, if $|X_{j}^{T}y|\le n\lambda$, then in Lasso Regression, $\hat{\beta_{j}}$ must be 0.



Jiachen1027/bis557 documentation built on Oct. 30, 2019, 7:41 p.m.