We start by loading in the powopt
package from GitHub and set a seed because we'll be doing some random number generation.
library(devtools) install_github("maryclare/powopt") library(powopt) set.seed(100)
powThresh
The powThresh
function takes the arguments z
, lambda
$> 0$ and q
$> 0$ and returns the solution to the following scalar optimization problem:
$$ \text{minimize}_\beta\frac{1}{2}\left(\beta - z\right)^2 + \lambda \left|\beta\right|^q. $$
We can use powThresh
to replicate the second panel of Figure 2 of Mazumder, Friedman and Hastie (2011),^[Mazumder, Rahul, Jerome H. Friedman, and Trevor Hastie. "Sparsenet: Coordinate descent with nonconvex penalties." Journal of the American Statistical Association 106.495 (2011): 1125-1138.] which plots values of the threshold function computed by powThresh
over a range of $z$ for $\lambda = 1$ for several values of $q$.
qs <- c(0.001, 0.3, 0.7, 1) zs <- seq(0, 3, by = 0.001) plot(zs, powThresh(zs, lambda = 1, q = qs[1]), type = "l", ylim = c(0, 3), main = "Threshold Functions", xlab = expression(beta), ylab = "") for (i in 2:4) { lines(zs, powThresh(zs, lambda = 1, q = qs[i]), lty = i, col = i) } legend("topleft", col = 1:4, lty = 1:4, legend = qs, cex = 0.75)
This plot is isn't very interpretable if we recall that a penalty can be interpreted as a distribution for $\beta$. Specifically, we can interpret a power penalty of the form $\lambda\left|\beta\right|^q$ as a generalized normal distribution for $\beta$ with variance $\Gamma\left(3/q\right)\lambda^{-2/q}/\Gamma\left(1/q\right)$. This means that in the previous plot, we are comparing the optimal values of $\beta$ prior distributions with different variances and different penalty powers.
Another more interpretable plot we can make using powThresh
compares threshold functions for values of $q$ that hold the variance of the distribution of elements of $\beta$ constant at $1$ by setting $\lambda = \left(\Gamma\left(1/q\right)/\Gamma\left(3/q\right)\right)^{-q/2}$ for each value of $q$.
qs <- c(0.25, 0.5, 1, 1.25, 2, 3) sigma.sq.beta <- 1 zs <- seq(0, 5, by = 0.001) plot(zs, powThresh(zs, lambda = (sigma.sq.beta*gamma(1/qs[1])/gamma(3/qs[1]))^(-qs[1]/2), q = qs[1]), type = "l", xlab = "z", ylab = expression(beta)) for (i in 2:length(qs)) { q <- qs[i] lines(zs, powThresh(zs, lambda = (sigma.sq.beta*gamma(1/q)/gamma(3/q))^(-q/2), q = q), lty = i, col = i) } legend("topleft", legend = qs, col = 1:length(qs), lty = 1:length(qs), cex = 0.75)
powThresh
DoesWhen $q \leq 1$, the solution is computed using the method described in Marjanovic and Solo (2014).^[Marjanovic, Goran, and Victor Solo. "$l_{q}$ Sparsity Penalized Linear Regression With Cyclic Descent." IEEE Transactions on Signal Processing 62.6 (2014): 1464-1475.] When $q > 1$, the solution is computed as follows. Like in Marjanovic and Solo (2014), we assume that the optimal value of $\gamma$ will have the same sign as $z$. This allows us to reduce the original problem to the simpler problem: $$ \text{minimize}_{\gamma > 0}\frac{1}{2}\left(\gamma - \left|z\right|\right)^2 + \lambda \gamma^q. $$ The optimal $\gamma$ solves: $$ \gamma + \lambda q \gamma^{q - 1} = \left|z\right|. $$ $\gamma + \lambda q \gamma^{q - 1}$ is monotonically increasing for $\gamma > 0$. Accordingly, we can find the optimal value of $\gamma$ by bisection. Because $\gamma + \lambda q \gamma^{q - 1} \geq 0$ and $\gamma + \lambda q \gamma^{q - 1} \geq \gamma$ for all $\gamma \geq 0$, we can be sure that the optimal value of $\gamma$ is in the interval $\left[0,\left|z\right|\right]$. Taking $l^{\left(0\right)} = 0$, $u^{\left(0\right)} = \left|z\right|$ and $\gamma^{\left(0\right)} = l^{\left(0\right)} + \left(u^{\left(0\right)} - l^{\left(0\right)}\right)/2$ and setting $k = 0$, the bisection algorithm is as follows:
powCD
Generally, we won't be using powThresh
on its own, rather we'll be using it as part of a coordinate descent algorithm that solves a penalized regression problem. powCD
implements coordinate descent, using powThresh
, to solve the following optimization problem over a $p \times 1$ vector $\boldsymbol \beta$:
$$
\text{minimize}_{\boldsymbol \beta}\frac{1}{2\sigma^2}\left|\left|\boldsymbol y - \boldsymbol X \boldsymbol \beta\right|\right|^2_2 + \lambda \left|\left|\boldsymbol \beta\right|\right|_q^q.
$$
We can try it out on some simulated data, using an orthogonal design matrix $\boldsymbol X$ to start and setting $\lambda$ given $q = 2$ to ensure that the variance implied by the penalty is equal to $1$, the true variance of our simulated elements of $\boldsymbol \beta$. We use the default specifications of the arguments of powCD
; they should be sufficient when $\boldsymbol X$ is orthogonal. As a sanity check, we compare our estimates to the closed form solution that is available when $q = 2$ (the ridge regression solution).
n <- 5 p <- 3 X <- svd(matrix(rnorm(n*p), nrow = n, ncol = p))$u beta <- rnorm(p) y <- X%*%beta + rnorm(n) q <- 2 lambda <- (gamma(1/q)/gamma(3/q))^(-q/2) fit <- powCD(X = X, y = y, sigma.sq = 1, lambda = lambda, q = q) plot(fit, solve(crossprod(X) + diag(p))%*%crossprod(X, y), xlab = "Coordinate Descent", ylab = "Closed Form", main = expression(paste("Estimate of ", beta, sep = ""))) abline(a = 0, b = 1)
Using the same data but setting $q = 1000$ and resetting $\lambda$ given $q = 1000$ to ensure that the variance implied by the penalty is still equal to $1$, we use powCD
again with the default specifications. For such large $q$ but the same variance, our estimates should be converging to the unpenalized OLS estimates. We compare our coordinate descent estimates to the closed form OLS solution and see that they are in fact quite similar, as we would hope.
q <- 1000 lambda <- (gamma(1/q)/gamma(3/q))^(-q/2) fit <- powCD(X = X, y = y, sigma.sq = 1, lambda = lambda, q = q) plot(fit, solve(crossprod(X))%*%crossprod(X, y), xlab = "Coordinate Descent", ylab = "Closed Form", main = expression(paste("Estimate of ", beta, sep = ""))) abline(a = 0, b = 1)
As a last example with an orthogonal $\boldsymbol X$ matrix, we'll plot solution paths for $q = 1$ and $q = 0.5$ over different values of $\lambda$.
qs <- c(1, 0.5) lambdas <- seq(0.1, 4, length.out = 100) fits <- array(dim = c(p, length(lambdas), length(qs))) for (i in 1:length(lambdas)) { for (j in 1:length(qs)) { fits[, i, j] <- powCD(X = X, y = y, sigma.sq = 1, lambda = lambdas[i], q = qs[j]) } } for (j in 1:length(qs)) { plot(log(lambdas), fits[1, , j], ylim = range(fits), type = "l", xlab = expression(paste("log(", lambda, ")", sep = "")), ylab = expression(beta), main = paste("q=", q, "\n")) for (i in 2:p) { lines(log(lambdas), fits[i, , j], col = i) } } legend("topleft", c(expression(beta[1]), expression(beta[2]), expression(beta[3])), col = 1:3, lty = 1, cex = 0.75)
In these four examples, we used the default specifications of powCD
. However, we can adjust various aspects of the coordinate descent algorithm as follows:
max.iter
is $10000$. This refers to the number of iterations of coordinate descent performed (where a single iteration involves cycling through all of the elements of $\boldsymbol \beta$). Increasing this value allows us to perform more iterations of coordinate descent.print.iter
is FALSE
. If we set it to TRUE
, the powCD
will print an update each iteration. Generally the updates are a nuisance, but they can be helpful for monitoring progress when the dimensions of the data are large and coordinate descent is slow.tol
is $10^{-7}$. tol
determines how small the difference in objective function values must be for us to conclude the coordinate descent algorithm has converged. Note that convergence of the objective function does not guarantee that the optimality conditions have been satisfied; optimality conditions need to be checked separately if needed.ridge.eps
. The default, ridge.eps
$=0$, indicates that we start at the OLS solution. Otherwise, we start at $\tilde{\boldsymbol \beta} = \left(\boldsymbol X'\boldsymbol X + \texttt{ridge.eps} \boldsymbol I_p\right)^{-1}\boldsymbol X'\boldsymbol y$. If $\boldsymbol X$ is not full rank, the default value of ridge.eps
may need to be rather large for $\boldsymbol X'\boldsymbol X'$ to be invertible. In this case, we may want to increase ridge.eps
. If $\boldsymbol X'\boldsymbol X$ is diagonal and a value of ridge.eps
$\neq 0$ is given, powCD
resets ridge.eps
to zero.rand.restart
to an integer greater than zero. When rand.restart
is set to some nonzero value, e.g. $K$, coordinate descent will be repeated $K + 1$ times, where the first time is from the ridge solution and the latter $K$ are from random starting values of $\boldsymbol \beta$. powCD
returns the value of $\boldsymbol \beta$ that gives the smallest value of the objective function. The last point is worth demonstrating with an example. Let's simulate some new data with a non-orthogonal $\boldsymbol X$ and fit the penalized regression problem with $q = 0.5$ and $\lambda = \left(\Gamma\left(2\right)/\Gamma\left(6\right)\right)^{-1/4}$.
n <- 10 p <- 8 X <- matrix(rnorm(n*p), nrow = n, ncol = p) beta <- rnorm(p) y <- X%*%beta + rnorm(n) q <- 0.5 lambda <- (gamma(1/q)/gamma(3/q))^(-q/2) fit1 <- powCD(X = X, y = y, sigma.sq = 1, lambda = lambda, q = q, rand.restart = 1) fit2 <- powCD(X = X, y = y, sigma.sq = 1, lambda = lambda, q = q, rand.restart = 1) plot(fit1, fit2, xlab = "Coordinate Descent 1", ylab = "Coordinate Descent 2", main = expression(paste("Estimate of ", beta, sep = ""))) abline(a = 0, b = 1)
We see that despite $\boldsymbol X$ being full rank, we do not get convergence to the global minimum for this value of $\boldsymbol X$. In cases like this, we might be able to get to the value of $\boldsymbol \beta$ that yields the global minimum by using many random restarts. However, it may not be clear how many random restarts to use. We consider solutions obtained using $K_k = 1, \dots, 10$ random restarts. Letting $\hat{\boldsymbol \beta}^{\left(k\right)}$ be the estimate of $\boldsymbol \beta$ obtained from using $K_k$ random restarts, we plot $\left(K_k, \left|\left|\hat{\boldsymbol \beta}^{\left(k\right)} - \hat{\boldsymbol \beta}^{\left(k - 1\right)}\right|\right|^2_2/p\right)$ for $K_k = 2,\dots, 10$.
n <- 10 p <- 8 X <- matrix(rnorm(n*p), nrow = n, ncol = p) beta <- rnorm(p) y <- X%*%beta + rnorm(n) q <- 0.5 lambda <- (gamma(1/q)/gamma(3/q))^(-q/2) rand.restarts <- seq(1, 10, by = 1) fits <- matrix(nrow = p, ncol = length(rand.restarts)) diff <- numeric(length(rand.restarts) - 1) for (i in 1:length(rand.restarts)) { fits[, i] <- powCD(X = X, y = y, sigma.sq = 1, lambda = lambda, q = q, rand.restart = rand.restarts[i]) if (i > 1) { diff[i - 1] <- mean((fits[, i] - fits[, i - 1])^2) } } plot(rand.restarts[2:length(rand.restarts)], diff, xlab = "# Random Restarts", ylab = "MSE Compared to Previous Estimate")
It looks like using 10 random restarts is probably more than enough, although in these situations we can never be sure we've reached the global minimum.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.