knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.width = 7, fig.height=6 )
The adelie
package provides extremely efficient procedures for
fitting the entire group lasso and group elastic net regularization
path for GLMs, multinomial, the Cox model and multi-task Gaussian
models. adelie
is similar to the R package glmnet
in scope of
models, and in computational speed.
In fact, if there are no groups, adelie
fits the same class of
models as in glmnet
, with similar computational speed.
The R package adelie
is built from the same C++ code as used in the
corresponding adelie Python
package.
See @yang2024adelie for details.
The theory and algorithms in this implementation are inspired by
@glmnet, @coxnet, @strongrules and @block.
In this notebook, we give a brief overview of the group elastic net
problem that adelie
solves, and describe some of the features of
this implementation. Then we develop a detailed example for fitting
pairwise interactions, mimicking the glinternet
R package (@glinternet)
The single-response group elastic net problem is given by $$ \begin{align} \mathrm{minimize}{\beta, \beta_0} \quad& \ell(\eta) + \lambda \sum\limits{g=1}^G \omega_g \left( \alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2 \right) \\text{subject to}\quad& \eta = \eta^0 + \beta_0 \mathbf{1} + X \beta \end{align} $$ where
As an example, the Gaussian GLM
(glm.gaussian()
)
defines the loss function as
$$
\begin{align}
\ell(\eta)
&=
\sum\limits_{i=1}^n w_i \left(
-y_i \eta_i + \frac{\eta_i^2}{2}
\right)
\end{align}
$$
where
$w \geq 0$ is the observation weight vector,
$y$ is the response vector,
and $\eta$ is the linear prediction vector as in the optimization
problem above. This is equivalent to the weighted sum-of-squared errors
$$
\begin{align}
\mbox{RSS}
&=
\frac12\sum\limits_{i=1}^n w_i \left(
y_i -\eta_i \right)^2,
\end{align}
$$
since the term in $\sum w_iy_i^2$ is a constant for the
optimization.
By default we set the weights $w_i=1/n$, and if supplied they are renormalized to sum to 1.
Specifically for the Gaussian GLM, we employ a specialized optimizer based on coordinate descent to solve the group elastic net problem.
The Gaussian GLM is written in "exponential family" form, with
$\eta_i$ the natural parameter. Other GLMs such as binomial
(glm.binomial()
) and Poisson (glm.poisson()
) have similar
expressions for the negative log-likelihood.
For other general GLMs as well as the Cox model (glm.cox()
), we use a proximal Newton method,
which leads to an iteratively reweighted least squares (IRLS) algorithm,
That is, we iteratively perform a quadratic approximation to $\ell(\cdot)$,
which yields a sequence of Gaussian GLM group elastic net problems
that we solve using our special solver based on coordinate descent.
The multi-response group elastic net problem is given by
$$
\begin{align}
\mathrm{minimize}{B,\; \beta_0} \quad&
\ell(\eta) + \lambda \sum\limits{g=1}^G \omega_g \left(
\alpha \|B_g\|_F + \frac{1-\alpha}{2} \|B_g\|_F^2
\right)
\\text{subject to}\quad&
\eta = \eta^0 + \mathbf{1}\beta_0^\top + X B
\end{align}
$$
The model arises in "multitask" learning --- glm.multigaussian()
---
as
well as multinomial (multiclass) logistic regression
--- glm.multinomial()
. This way, we have possibly different linear
predictions for each of $K$ responses, and hence
$\eta$ is $n\times K$. The coefficient "matrices" for each group $B_g$ are
penalized using a Frobenius norm.
Note that if an intercept is included in the model, an intercept is added for each response.
We use an alternative but equivalent representation in the adelie
software, by
"flattening" the coefficient matrices. Specifically we solve
$$
\begin{align}
\mathrm{minimize}{\beta,\; \beta_0} \quad&
\ell(\eta) + \lambda \sum\limits{g=1}^G \omega_g \left(
\alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2
\right)
\\text{subject to}\quad&
\mathrm{vec}(\eta^\top) = \mathrm{vec}(\eta^{0\top}) + (\mathbf{1}
\otimes I_K) \beta_0 + (X \otimes I_K) \beta
\end{align}
$$
where $\mathrm{vec}(\cdot)$ is the operator that flattens a column-major matrix into a vector,
and $A \otimes B$ is the Kronecker product operator. Hence $\beta \equiv \mathrm{vec}(B^\top)$.
As indicated above, the multi-response group elastic net problem is technically of the same form
as the single-response group elastic net problem.
In fact, adelie
reuses the single-response solver for multi-response problems
by modifying the inputs appropriately
(e.g. using matrix.kronecker_eye()
) to represent $X \otimes I_K$).
For the MultiGaussian family, we wrap the specialized single-response Gaussian solver
and otherwise for general multi-response GLMs, we wrap the single-response GLM solver.
In this section, we cover the basic usage of adelie
using very
simple examples.
library(adelie)
Before using adelie
, we assume that the user has a feature matrix X
and a response vector y
.
For simplicity, we assume for now that X
is a dense matrix,
although we will later see that X
can be substituted with a custom matrix as well.
For demonstration, we will randomly generate our data.
n = 100 # number of observations p = 1000 # number of features set.seed(5) # seed X = matrix(rnorm(n*p),n,p) y = X[,1:10] %*% rnorm(10) + rnorm(n) * sqrt(10) # makes SNR = 1
The most basic call to grpnet
simply supplies a X
matrix and a glm
object.
For example, glm.gaussian(y=y)
returns a Gaussian glm
object with
response variable y
,
which corresponds to the usual least squares loss function.
By default, grpnet
will fit lasso by setting each feature as its own group.
adelie
is written to treat groups of size 1
in a more optimized manner,
so it is a competitive lasso solver that has computation times similar
to those of the glmnet
package.
Like glmnet
, grpnet
will also generate a sequence of 100 values
for the regularization parameter $\lambda$
evenly spaced on the log scale, by default if the user does not
provide the path.
The largest value of $\lambda$ will be the smallest value such that all the terms
with positive $omega_g$ above in the solution are zero.
Since grpnet
is a path-solver, it will warm-start at the next $\lambda$
using the current solution on the path.
For this reason, we recommend users to supply a sufficiently fine grid of $\lambda$ or use the default path!
fit = grpnet(X=X, glm=glm.gaussian(y=y)) print(fit)
The print()
methods gives a nice summary of the fit.
Note that the solver finished early in this example.
By default, the solver terminates if the latter reaches $90\%$ as a simple heuristic to avoid overfitting.
This threshold can be controlled via the argument adev_tol
.
The output of grpnet
is a an object of class "grpnet", which
contains some simple descriptors of the model, as well as a state
object that represents the state of the optimizer.
For most use-cases, the users do not need to inspect the internals of a state object,
but rather can extract the useful information via the methods
print()
, plot()
, predict()
and coef()
.
plot(fit)
Here we plot the coefficients profiles as a function of
$-\log(\lambda)$. By default grpnet
standardizes the features
internally, and these coefficients are on the scale of the
standardized features. One can avoid standardization via standardize
= FALSE
in the call to grpnet()
.
We can make predictions at new values for the features --- here we use a subset of the original:
pred = predict(fit, newx = X[1:5,],lambda = c(1.5, 1)) pred
If you used standardize = TRUE
, the default, you do not have to
standardize newx
; adelie
will do it for you automatically.
Finally, we would like to choose a good value for $\lambda$, and for
that we use cross-validation. We include a progress_bar
argument,
which can be helpful for big problems. The plot
method for a
cv.grpnet
object displays the average cross-validated deviance of the model,
with approximate standard error bars for the average.
fitcv = cv.grpnet(X,glm.gaussian(y),progress_bar = TRUE) plot(fitcv)
Two vertical dashed lines are included: one at the value of $\lambda$ corresponding to the minimum value, and another at a larger (more conservative) value of $\lambda$, such that the mean error is within one standard error of the minimum.
One can print the fitted cv.glmnet
object:
fitcv
One can also predict directly from it:
pred = predict(fitcv, newx = X) pred = predict(fitcv, newx = X, lambda = "lambda.min")
In the first case the prediction is at the default value of $\lambda$,
which is lambda = "lambda.1se"
. Alternatively, any numeric values of
$\lambda$ can be supplied.
So far this has been a simple Gaussian lasso fit, and the results
would be identical to those produced by glmnet
.
To fit group lasso, the user simply needs to supply a groups
argument that
defines the starting column index of $X$ for each group.
For example, if there are 4
features with two (contiguous) groups of sizes 3
and 1
, respectively,
then groups = c(1, 4)
.
For demonstration, we take the same data as before and group every 10
features.
We then fit group lasso using the same function.
fitg = grpnet( X=X, glm=glm.gaussian(y=y), groups=seq(from = 1, to = p, by=10), ) print(fitg) plot(fitg)
Notice in the printout the active set (df) increases in steps of 10, as expected. Also the coefficient plot colors all coefficients for a group with the sample color.
As before, we can use cross-validation to select the tuning parameter.
fitgcv = cv.grpnet( X, glm.gaussian(y), groups=seq(from = 1, to = p, by=10), progress_bar = TRUE) plot(fitgcv)
In the previous section, we covered how to solve penalized least squares regression.
Nearly all of the content remains the same for fitting penalized GLM regression.
The only difference is in the choice of the glm
object.
For brevity, we only discuss logistic regression as our non-trivial GLM example,
however the following discussion applies for any GLM.
Let's modify our example and generate a binomial response.
eta = X[,1:5] %*% rnorm(5) / sqrt(5) mu = 1 / (1 + exp(-eta)) y = rbinom(n, size = 1, prob = mu)
To solve the group elastic net problem using the logistic loss, we simply provide the binomial GLM object. For simplicity, we fit the lasso problem below.
fitb = grpnet(X, glm.binomial(y)) plot(fitb)
Cross-validation works as before:
fitbcv = cv.grpnet(X,glm.binomial(y)) plot(fitbcv)
We can make predictions from the fitted objects as before.
predict(fitb, newx = X[1:5,],lambda = c(0.13, 0.07))
For GLMs, the predictions are by default on the link ($\eta$) scale. However, one can also make predictions on the inverse link or response scale ($\mu$):
predict(fitb, newx = X[1:5,],lambda = c(0.13, 0.07),type="response")
In this case we get the estimated probability of a 1-class. We can specify coefficient groups just as before.
grpnet
is also able to fit multi-response GLM elastic nets.
Currently, we only support MultiGaussian (least squares) and Multinomial GLMs.
The following code shows an example of fitting a multinomial
regression problem. We use the glm.multinomial()
family, which
expects a matrix response $Y$ with $K$ columns (the number of
classes). Each row of $Y$ is a proportion vector which sums to 1.
In the example below, $Y$ is an indicator matrix, with a single 1 in
each row, the rest being zeros. If instead the data were grouped, and each
row originally represented the number out of $n_i$ in category $k$, we
would first divide each of the counts by $n_i$ turning them into
proportions. Then we would supply weights $w_i=n_i$, normalized to sum
to 1.0 overall.
n = 300 # number of observations p = 100 # number of features K = 4 # number of classes set.seed(7) X = matrix(rnorm(n*p),n,p) eta = X[, 1:5] %*% matrix(rnorm(K*5),5,K)/sqrt(5) probs = exp(eta) probs = probs/rowSums(probs) Y = t(apply(probs,1,function(prob)rmultinom(1, 1, prob))) Y[1:5,]
The fitting proceeds as before. We will directly fit a group lasso, with groups of 5 chosen similarly to before.
grps = seq(from=1, to=p, by = 5) fitm = grpnet(X, glm.multinomial(Y), groups=grps)
print(fitm)
plot(fitm)
The coefficient for each feature is a vector(one per class), so rather than show multiple coefficients, the plot shows the progress of the 2norm of this vector as a function of $\lambda$. Once again, because of the grouping (into groups of 5 here), the groups are colored the same. In this case, the first group has all the action, so appears much earlier than the rest.
fitmcv = cv.grpnet(X,glm.multinomial(Y),groups=grps) plot(fitmcv)
Another important difference from the single-response case is that
the user must be aware of the shape of the returned coefficients, as
returned by coef(fitm)
or coef(fitmcv)
.
names(coef(fitm))
We see that coef()
returns a list.
For multi-response GLMs, the intercepts
component is
included by default for each response,
and hence is a $L \times K$ matrix, where $L$ is the number of
$\lambda$s in the path. The betas
component will be a $L \times pK)$
sparse matrix where every successive $K$ columns correspond to the
coefficients associated with a feature and across all responses.
Fortunately the predict()
method understands this structure, and
behaves as intended.
predict(fitmcv,newx = X[1:3,])
Here by default the prediction is made at lambda = "lambda.1se"
.
Typically for the multinomial we would be more interested in the predicted probabilities:
predict(fitmcv,newx = X[1:3,], lambda="lambda.min", type = "response")
Each of the rows sum to 1.
Our final example will be for censored survival data, and we will fit Cox proportional hazards model.
We create an example with a $500 \times 100$ sparse $X$ matrix.
In this example we have right censoring. So the "subject" exits at
times in y
, and the binary status
is 1 of the exit is a "death",
else 0 if censored.
set.seed(9) n <- 500 p <- 100 X <- matrix(rnorm(n*p), n, p) X[sample.int(n * p, size = 0.5 * n * p)] <- 0 X_sparse <- Matrix::Matrix(X, sparse = TRUE) nzc <- p / 4 beta <- rnorm(nzc) fx <- X[, seq(nzc)] %*% beta / 3 hx <- exp(fx) y <- rexp(n,hx) status <- rbinom(n = n, prob = 0.3, size = 1)
We now fit an elastic-net model using the glm.cox()
family,
with every set of 5 coefficients in a group.
groups = seq(from = 1, to = p, by = 5) fitcv <- cv.grpnet(X_sparse, glm.cox(stop = y, status = status), alpha = 0.5, groups = groups) par(mfrow = c(1,2)) plot(fitcv) plot(fitcv$grpnet.fit)
The family glm.cox()
accommodates start-stop data and strata as
well. It handles ties by default using the "Efron" method. See
?glm.cox
for further details.
Adelie supports box constraints (lower and upper bounds on coefficients. Here we demosntrate with a few examples. Currently only works with scalar response models.
For our first example we will fit a non-negative lasso, using randomly generated data. Note that even though it seems the constraint object is the same for each singleton group, they need to be created separately; i.e. we cannot make one, and replicate it. This is somewhat counter-intuitive, but these are the rules.
n <- 100 p <- 30 set.seed(1) X <- matrix(rnorm(n * p), n, p) y <- X[,c(1:5)] %*% rnorm(5)/3 + rnorm(n) fit <- grpnet(X, glm.gaussian(y)) constrs = lapply(1:p, function(i) constraint.box(lower = 0, upper = Inf)) fit.constr = grpnet(X, glm.gaussian(y), constraints = constrs) par(mfrow=c(1,2)) plot(fit) plot(fit.constr)
Next we create a group of the first four variables. We first fit the model unconstrained, and observe the final coefficients for the first four variables. Then we put very specific constraints on each of the four coefficients in that first group, and leave the rest of the variables alone. We add blue broken lines to show these constraints. Note that we can replicate NULL constraints.
fit <- grpnet(X, glm.gaussian(y), groups=c(1,5:30)) beta <- coef(fit)$beta[100, 1:4] print(beta) lower = lower=c(-Inf,-Inf,-Inf,-.2) upper=c(0.2,0.05,0.4,Inf) constrs = rep(list(NULL),27) # there are 27 groups constrs[[1]] = constraint.box(lower=lower,upper=upper) fit.constr = grpnet(X, glm.gaussian(y), groups=c(1,5:30), constraints = constrs) par(mfrow=c(1,2)) plot(fit) plot(fit.constr) abline(h=lower,col="blue",lty=2) abline(h=upper,col="blue",lty=2)
In group elastic net problems, the matrix object plays a crucial role in the performance of the solver. It becomes apparent in our optimization algorithm (and our benchmark analysis) that most of the runtime lies in interacting with the matrix object, e.g. computing inner-products. Hence, a highly performant matrix object implementing a select set of methods that the solver requires will yield tremendous speed gains overall. In addition, we have found numerous examples where a matrix class admits some special structure that can be exploited for further speed and memory gains. One simple example is a large sparse matrix, which cannot fit in memory as a dense matrix. Another example is genomics datasets which are not only sparse, but only take on 3 possible integer values (see [Examples], to come), and generally have over 160 billion entries with 30\% non-zero entries.
For these reasons, we found it fruitful to abstract out the matrix
class, and
adelie
provides a few examples.
We discuss below some ways a user may interact with these classes.
We will go through a few matrix methods that will be useful in group lasso modeling.
The simplest example of such a matrix is a dense matrix.
Let us first construct a dense matrix and wrap it using matrix.dense
.
n = 4 p = 2 set.seed(0) X_dense = matrix(rnorm(n*p),n,p) print(X_dense) Xd = matrix.dense(X_dense) print(Xd)
If this is all we are going to pass to grpnet()
, then we could have
saved ourselves the trouble, because it will do so anyway in the call
to grpnet
. But we will
see that it can still become useful when we combine matrices.
Similarly, we can wrap a sparse matrix using matrix.sparse
. We will
make a simple sparse matrix using the function Matrix::sparseMatrix()
. We
will use the market matrix format, where we supply i, j, x
triples.
d = data.frame( i = c(2, 1, 4, 1, 2, 4), j= c(1, 2, 2, 3, 3, 3), x = c(0.184, 0.330, 0.738, 0.576, -0.305, 0.390) ) print(d) require(Matrix) X_sparse = with(d, sparseMatrix(i, j, x=x, dims=c(4,3))) print(X_sparse) Xs = matrix.sparse(X_sparse) print(Xs)
Likewise, if this was all we were to pass to grpnet
, we again could
have saved the trouble, because it would convert an S4 sparse matrix automatically.
Suppose we have both a dense matrix and a sparse matrix. We can
concatenate matrix classes together in adelie
. This enables the
software to use the most efficient methods for each when performing
matrix multiplies.
Xds = matrix.concatenate(list(Xd, Xs)) print(Xds) Xds$rows; Xds$cols print(Xds)
By default we concatenated by columns (as in cbind()
). We could
alternatively concatenate by rows, using the axis=1
argument, but
that would fail here since the dimensions are not compatible.
We currently do not have code that allows one to view these matrix objects;
that may come in future versions of the package. Each matrix object
has attributes
, which typically contain the original matrix, or
matrices if it was concatenated.
Sometimes we have a mixture of categorical and quantitative variables.
In this case we typically expand the categorical variables using
one_hot encoding (a.k.a dummy variables). Consider the simple
example where we have X_dense
as above, along with a factor with 3
levels.
X_mixed = cbind(X_dense, c(1,0,2,1)) print(X_mixed) levels = c(1,1,3)
Along with this matrix we have an integer vector of levels, which says
how many levels in each column, with quantitative variables depicted
as 1.
Notice for factors we represent the values for the factor starting at
0
in adelie
.
Xoh = matrix.one_hot(X_mixed, levels = levels) Xoh$cols
We would like to see what is inside X0h
. Since we dont have a print
method yet, we will use a bit of trickery to do
that, by multiplying it by the identity. Since only row-sparse
matrices are allowed (they are meant to be adelie coefficient matrices), we will make one.
eye= as(as(diag(5), "generalMatrix"), "RsparseMatrix") Xoh$sp_tmul(eye)
We typically wish to standardize the features before fitting an
adelie
model, with a few exceptions. If some columns represent the
one-hot vectors for a factor variable, we generally do not
standardize them. Lets use that case as an example. Each matrix class
has its own standardize method.
Xohs = matrix.standardize(Xoh) Xohs$sp_tmul(eye)
The standardization information is stored in the attributes of the matrix object.
print(attributes(Xohs))
We now go through some applications where we use adelie
and its
group lasso facilities as a building block. Currently just one
example, but more to follow.
In regression settings, we may want to include pairwise interaction terms amongst a subset of features to capture some non-linearity.
Moreover, we would like to perform feature selection on the interaction terms as well.
However, to achieve an interpretable model, we would like to also impose a hierarchy such that interaction terms are only included in the model if the main effects are included.
Michael Lim and Trevor Hastie provide a formulation of this problem using group lasso where the group structure imposes the hierarchy and the group lasso penalty allows for feature selection.
For further details see their R package glinternet
and the reference:
Learning interactions via hierarchical group-lasso regularization
We implement their approach in adelie
, and provide two advantages:
adelie
response familes. The glinternet
package handles two cases only -
Gaussian and binomial.We first demonstrate the functions that are provided in the adelie
package
for fitting these models.
Then we develop our approach from scratch using grpnet
as the work
engine, which helps to explain how the model works.
We will work under a simulation setting. We draw $n$ independent samples $Z_i \in \mathbb{R}^d$ where the continuous features are sampled from a standard normal and the discrete features are sampled uniformly. Remember that factors start at 0, and the maximum value is levels-1.
require(adelie) n=1000 d_cont = 10 # number of continuous features d_disc = 10 # number of discrete features set.seed(3) # random seed Z_cont = matrix(rnorm(n*d_cont),n,d_cont) levels = sample(2:10,d_disc,replace=TRUE) Z_disc = matrix(0,n,d_disc) for(i in seq(d_disc))Z_disc[,i] = sample(0:(levels[i]-1),n,replace=TRUE)
It is customary to first center and scale the continuous features so that they have mean 0 and standard deviation 1.
sd0 = function(x)sd(x)*sqrt(1-1/length(x)) # SD formula with divition by n rather n-1 Z_cont_means = apply(Z_cont,2,mean) Z_cont_stds = apply(Z_cont,2,sd0) Z_cont = scale(Z_cont, Z_cont_means,Z_cont_stds)
We now combine these matrices to form one big matrix, and a vector of levels that describes them all.
Z = cbind(Z_cont,Z_disc) levels = c(rep(1,d_cont),levels) print(levels)
We make a response which has a main effect in each of a continuous feature and discrete one, and their interaction.
xmat = model.matrix(~Z_cont[,1]*factor(Z_disc[,2])) nc=ncol(xmat) print(nc) set.seed(4) beta = rnorm(nc) y = xmat%*%beta+rnorm(n)*2.5
We now demonstrate the tools we have constructed to fit interaction models. In the next section, we describe a manual construction, in order to get some insight of what is under the hood.
We provide a function glintnet
, deliberately omitting the "er" in
the middle to avoid confusion.
In the following we consider a model with main effects, and
possible interactions between the first variable, and all the others.
set.seed(2) fit = glintnet(Z,glm.gaussian(y),levels=levels,intr_keys = 1) cvfit = cv.glintnet(Z,glm.gaussian(y),levels=levels,intr_keys = 1) par(mfrow=c(1,2)) plot(fit) plot(cvfit)
The generating model has 12 non-zero coefficients (including
intercept), and the more conservative lambda.1se
appears to have
found these.
We have some tools for poking in and seeing what was fit. The true model has main effects and interactions between variables 1 (continuous) and 12 (6-level factor).
predict(cvfit,type="nonzero")
There is a main-effect for variable 12 (6 coefficients), and an interaction term between variable 12 and variable 1. This counts actually 12 coefficients - 6 for the interaction, and 6 again for the main effect. The algorithm uses the overlap group lasso to ensure that when interactions are present, so are the main effects. Details can be found in the reference.
Although the model has found the correct interactions, digging in to see the coefficents is less helpful. The following printout is truncated.
coef(cvfit)
The coefficient matrix is quite wide, and seeing what goes with what is a little cumbersome. We will provide helpful tools in future to make these more manageable.
Perhaps we made it too easy, by specifying that the only interactions
could be with variable 1 (intr_keys=1
).
Now we repeat the process giving no hints to glintnet
.
set.seed(2) fit2 = glintnet(Z,glm.gaussian(y),levels=levels) cvfit2 = cv.glintnet(Z,glm.gaussian(y),levels=levels) par(mfrow=c(1,2)) plot(fit2) plot(cvfit2)
And the nonzero variables:
predict(cvfit2,type="nonzero")
It still found them! We do not expect this to happen all the time, but is encouraging that it found the true signals here.
The structure of glintnet
and cv.glintnet
mirrors that of grpnet
and cv.grpnet
- they have a few more arguments to specify the
interactions, if needed.
Importantly, they have all the methods such as print()
, coef()
, and
predict()
.
predict(cvfit, newx=Z[1:3,]) predict(cvfit, newx=Z[1:3,],lambda="lambda.min") print(cvfit)
The print method for a glintnet
fit adds some detail:
print(fit)
Here we go through details that are encapsulated in the glintnet
function. Reader not interested in the innards should skip this section.
The model starts of with a one-hot encoded version of Z
, which it
then augments with additional blocks of interaction terms.
We are now in position to construct the full feature matrix to fit a
group lasso model. As a demonstration, suppose we (correctly) believe
that there is a true interaction term containing the first continuous
feature, but we do not know the other feature. We therefore wish to
construct an interaction between the first continuous feature against
all other features. It is easy to specify this pairing, as shown below
via intr_keys
and intr_values
. The following code constructs the interaction matrix
X_intr
.
X_int = matrix.interaction(Z,intr_keys = 1, intr_values=list(NULL),levels=levels)
This function creates for each element of intr_keys
, Hadamard
products with each of the variables listed in int_values
:
In linear model termonology, these matrices are exactly what is needed to fit a main effect + interaction model for the pair of variables. Read the reference for further understanding of this particular construction.
To put all groups of features on the same relative “scale”, we must further center and scale all interaction terms between two continuous features. Then, it can be shown that interaction block between two discrete features induce a (Frobenius) norm of 1, a discrete and continuous feature induce a norm of $\sqrt{2}$, and two continuous features induce a norm of $\sqrt{3}$. These values will be used as penalty factors later when we call the group lasso solver. We first compute the necessary centers and scales.
pairs = attributes(X_int)[["_pairs"]] # base 0 pair_levels = apply(pairs,1,function(i)levels[i+1]) # it is transposed is_cont_cont = apply(pair_levels,2,prod) == 1 cont_cont_pairs = pairs[is_cont_cont,] cont_cont = Z[,cont_cont_pairs[,1]+1]*Z[,cont_cont_pairs[,2]+1] # base 0 centers = rep(0,X_int$cols) scales = rep(1,X_int$cols) cont_cont_indices = X_int$groups[is_cont_cont]+2 # base 0 centers[cont_cont_indices+1] = apply(cont_cont,2,mean) # base 0 index, so add 1 scales[cont_cont_indices+1] = apply(cont_cont,2,sd) # base 0 index, so add 1
So far we have just focussed on the interaction terms. Glinternet requires additional main-effect terms in case the model chooses to ignore the interactions.
Now, we construct the full feature matrix including the one-hot
encoded main effects as well as the standardized version of the
interaction terms using the centers and scales defined above.
Note that we have already standardized the quantitative features in
the original matrix Z
.
X_one_hot = matrix.one_hot(Z, levels) X_int_std = matrix.standardize( X_int, centers=centers, scales=scales) X = matrix.concatenate( list( X_one_hot, X_int_std) ) X$cols
Before calling grpnet
we must prepare the grouping and penalty
factor information. Once again, we must remember the 0 base issue.
groups = c( as.vector(X_one_hot$groups), X_one_hot$cols + as.vector(X_int$groups)) +1 # the +1 is the base 0 issue is_cont_disc = apply(pair_levels-1,2,function(x)xor(x[1],x[2])) penalty_int = rep(1, length(X_int$groups)) penalty_int[is_cont_cont] = sqrt(3) penalty_int[is_cont_disc] = sqrt(2) penalty = c( rep(1, length(X_one_hot$groups)), penalty_int)
Finally we can call the group-lasso solver with our prepared inputs.
set.seed(2) fit = grpnet(X,glm.gaussian(y),groups=groups,penalty=penalty, standardize=FALSE) cv.fit = cv.grpnet(X,glm.gaussian(y),groups=groups,penalty=penalty, standardize=FALSE) par(mfrow=c(1,2)) plot(fit) plot(cv.fit)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.