Introduction to the `seagull` package"

knitr::opts_chunk$set(fig.width = 10, fig.height = 5, 
                      echo = FALSE, collapse = TRUE, comment = "#>")

Introduction

seagull is a package that fits a mixed model via (fitted) sparse-group lasso (SGL) regularization. Limiting cases of this penalty are the lasso and the (fitted) group lasso, which are also available in this package. Solutions are obtained via \textit{proximal gradient descent} on a grid of values for the regularization parameter $\lambda$. The grid search for this value is implemented using \textit{warm starts}. And since proximal gradient descent is an iterative procedure, the step size between consecutive iterations is a crucial parameter for convergence. In order to determine this step size, \textit{backtracking line search} is implemented. The only exception to this is fitted SGL, as it is implemented via proximal-averaged gradient descent. Thus, the step size for this particular algorithm needs determination prior to the call.

We assume the following underlying mixed model:

$$ y = X b + Z u + e, $$

where $b$ is a vector of \textit{fixed effects} and $u$ a vector of \textit{random effects}. The inclusion of fixed effects is not mandatory in order for the algorithms to work.

Let $n$, $p$, and $L$ be the number of observations, random effects, and groups, respectively. seagull solves the following SGL problem:

$$ \min_{(b, u)} \frac{1}{2n} \left\Vert y - X b - Z u \right\Vert^2_2 + \alpha \lambda \left\Vert \mathrm{diag} \left{ \omega^F_{1} , \ldots , \omega^F_{p} \right} u \right\Vert_1 + (1 - \alpha) \lambda \sum^L_{l=1} \omega^G_l \left\Vert u^{(l)} \right\Vert_2. $$

Each $\omega$ represents a weight, either for a particular feature ($F$) or group ($G$). The \textit{mixing parameter} $\alpha$ may range between 0 and 1. And as can be seen in the above formula, $\alpha = 1$ leads to the lasso, whereas $\alpha = 0$ leads to the group lasso. However, the fitted variants of the group lasso and SGL differ by substituting the expression $\left\Vert u^{(l)} \right\Vert_2$ with $\left\Vert Z^{(l)} u^{(l)} \right\Vert_2$. Where $Z^{(l)}$ are the columns of $Z$ which correspond to $u^{(l)}$. The expression $Z^{(l)} u^{(l)}$ is also known as \textit{fitted values}.

Example

Once installed, the package shall be loaded in the common way. A simulated example data set is available and loaded just as easily:

library("seagull")
data("seagull_data")

The data consists of a response variable $y$ named phenotypes, a design matrix $Z$ named genotypes, and a variable called group. $Z$ consists of r dim(genotypes)[1] observations and r dim(genotypes)[2] explanatory variables. These variables consist of genotype data from single nucleotide polymorphism marker. The variable $y$ is another matrix, which harbors information of r dim(phenotypes)[2] traits. (One trait per column.) The variable group stores information about group assignments of each explanatory variable.

To fit the SGL, we can just call:

fit_sgl1 <- seagull(y = phenotypes[, 1], Z = genotypes, groups = groups)

The created output is a list with the following attributes:

attributes(fit_sgl1)

The last attribute of this list (loops_lambda) determines the number of different values for the penalty parameter $\lambda$ that were used throughout the grid search. Since we didn't specialize this value when calling the function seagull, it was set to its default, i.e. loops_lambda = r fit_sgl1$loops_lambda. This value determines the number of rows of the variables random_effects and iterations (and also fixed_effects if present). So, if we want to visualize the results from a certain $\lambda$ along the grid, say the very last solution, where $\lambda$ is at its smallest, we could do this as follows:

last_solution <- fit_sgl1$loops_lambda
plot(x = seq(1, dim(genotypes)[2], 1),
     y = fit_sgl1$random_effects[last_solution,],
     xlab = "position", ylab = "effect estimate",
     col = "gray80", pch = 16)

In total, there were r dim(genotypes)[2] features to be estimated. We can check the number of features of this last solution which remained to be exactly equal to zero:

print(c("The number of ZEROS in the last solution is: ", length(which(fit_sgl1$random_effects[last_solution,]==0))))

As this is the result based on the smallest $\lambda$, the solution is noisy. (By that we mean: The number of zeros among the estimates is low compared to the total number of estimates.) We will compare this to a solution that was obtained with a larger value:

last_solution <- fit_sgl1$loops_lambda
plot(x = seq(1, dim(genotypes)[2], 1),
     y = fit_sgl1$random_effects[last_solution,],
     xlab = "position", ylab = "effect estimate",
     col = "gray80", pch = 16)
points(x = seq(1, dim(genotypes)[2], 1),
       y = fit_sgl1$random_effects[20,],
       pch = 16)
print(c("The number of ZEROS in the solution in line 20 is: ", length(which(fit_sgl1$random_effects[20,]==0))))

This is a significantly larger number of zeros among the estimates than what we saw previously. In fact, for this value of $\lambda$ the solution is sparse. Which we expect from a selection operator such as these lasso variants.

Input parameters

In this section we present a comprehensive list of input parameters: We will describe these variables and give additional comments about their use.

Note: The above mentioned variables will only be centered or standardized prior to the calculations, if specified by the user via standardize = TRUE (see below).

Note: If standardize = TRUE, the results will be transformed back and presented on the original scale of the problem.

Further notes

We want to point out a correlation to another lasso variant, called \textit{Integrative lasso with Penalty Factors, or short IPF-LASSO}. If we go back to the implemented SGL problem and its limiting case for the lasso, i.e., $\alpha = 1$, we obtain:

$$ \min_{(b, u)} \frac{1}{2n} \left\Vert y - X b - Z u \right\Vert^2_2 + \lambda \left\Vert \mathrm{diag} \left{ \omega^F_{1} , \ldots , \omega^F_{p} \right} u \right\Vert_1. $$

If we suppose that no fixed effects are present in the model (i.e., $b=0$) and we multiply the entire expression by the factor $2n$, we get:

$$ \min_{(b, u)} \left\Vert y - Z u \right\Vert^2_2 + 2 n \lambda \left\Vert \mathrm{diag} \left{ \omega^F_{1} , \ldots , \omega^F_{p} \right} u \right\Vert_1, $$

or equivalently:

$$ \min_{(b, u)} \left\Vert y - Z u \right\Vert^2_2 + 2 n \lambda \sum^p_{j=1} \left\vert \omega^F_{j} u_j \right\vert. $$

The weights for features $\omega^F$ are assumed to be positive. So, we can simplify the last expression by introducing $\lambda_j = 2 n \lambda \omega^F_j$:

$$ \min_{(b, u)} \left\Vert y - Z u \right\Vert^2_2 + \sum^p_{j=1} \lambda_j \left\vert u_j \right\vert. $$

As a last step, we assume that $u$ is obtained from $M$ different sources ("modalities") and we let all $\lambda$'s which belong to the same modality have the same value. Then the last term can be written as a sum over modalities $m$:

$$ \sum^p_{j=1} \lambda_j \left\vert u_j \right\vert = \sum^M_{m=1} \lambda_m \left\Vert u^{(m)} \right\Vert_1. $$

And this immediately leads to the IPF-LASSO. So in the seagull package, this particular lasso variant is implicitly included for mixed models. The weights for features $\omega^F$ (weights_u) just need to be set accordingly, i.e., the same weight for features that belong to the same modality.



Try the seagull package in your browser

Any scripts or data that you put into this service are public.

seagull documentation built on April 20, 2021, 5:06 p.m.