In Marie-PerrotDockes/VariSel: Perform variable selection for different model types

knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE, message = FALSE,
                      dev = "cairo_pdf")
require(VariSel)
require(tidyverse)
require(tidymodels)
require(viridis)
require(furrr)
require(rlang)
require(knitr)
require(ggsci)
require(BlockCov)
require(rsample)
theme_set(theme_bw() + 
            theme(strip.background = element_rect(fill = "white", color ="gray"), 
                  strip.text = element_text(color = "black"),
                  text = element_text(face="bold", family="LM Roman 10"),
                  plot.title = element_text(hjust = 0.5, size =32)))
scale_fill_continuous <- function(...) scale_fill_viridis()
scale_fill_discrete <- function(...) scale_fill_viridis(discrete = TRUE)
scale_colour_discrete <- function(...) scale_color_viridis()
scale_colour_discrete <- function(...) scale_color_viridis(discrete = TRUE)
require(dplyr)

Introduction

Statistical modelling

\begin{itemize} \item \textbf{Dataset description:}

\begin{itemize} \item $\boldsymbol{X}$: $n\times p$ design matrix \item $\boldsymbol{Y}$: $n\times q$ response matrix \end{itemize}

\item \textbf{Question:} Which variables influence the responses?

\item \textbf{Approach:} \begin{itemize} \item Variable selection in $$ \boldsymbol{Y}=\boldsymbol{XB}+\boldsymbol{E}, $$ where \begin{itemize} \item $\boldsymbol{B}$: $p\times q$ \textbf{sparse} coefficients matrix \item $\boldsymbol{E}$: $n\times q$ error matrix with $$\forall i\in{1,\dots, n}, \; (E_{i,1},\dots,E_{i,q})\stackrel{iid}{\sim}~\mathcal{N}(0,\boldsymbol{\Sigma}_q)$$

\end{itemize} \item We take the dependence into account by estimating $\boldsymbol{\Sigma}_q$. \end{itemize} \end{itemize}

Differents penalties : for different point of view

\begin{itemize} \item \textbf{Lasso} : \textit{select variables without taking into account potential links.} \begin{equation} \widehat{b}L = \Am{b} \left\lbrace||y-\mathcal{X} b||_2^2 + \lambda ||b||_1 \right\rbrace, \end{equation} \item \textbf{Group-Lasso} : \textit{select a group of variables.} \begin{equation} \label{grouplasso} \widehat{b}G = \Am{b_1, \dots, b_L} \left\lbrace||y-\sum_{1 \leq \ell \leq L}\mathcal{X}{(\ell)} b{(\ell)}||2^2 + \lambda\sum{1 \leq \ell \leq L}\sqrt{p_\ell} ||b_\ell||_2\right\rbrace, \end{equation} \item \textbf{Fused-Lasso} : \textit{influence a group of variables to have the same coefficient.} \begin{equation}\label{fuselasso} \widehat{b}F =\Am{b}||y-\mathcal{X} b||2^2 + \left\lbrace\lambda_1\sum{(i,j) \in \mathcal{G}} |b_i - b_j|+ \lambda_2 ||b||_1\right\rbrace, \end{equation} \end{itemize}

This differents penalties are here in univariate. In order to use them and in order to take into account the dependance that may exist among variables we propose the following transformation :

$\boldsymbol{Y}\widehat{\boldsymbol{\Sigma}}_q^{-1/2}
=\boldsymbol{X}\boldsymbol{B}\widehat{\boldsymbol{\Sigma}}_q^{-1/2} ++ \boldsymbol{E}\widehat{\boldsymbol{\Sigma}}_q^{-1/2}$ \begin{align} {\mathcal{Y}}&= vec(\boldsymbol{Y}\widehat{\boldsymbol{\Sigma}}_q^{-1/2})
=vec(\boldsymbol{X}\boldsymbol{B}\widehat{\boldsymbol{\Sigma}}_q^{-1/2}) +vec(\boldsymbol{E}\widehat{\boldsymbol{\Sigma}}_q^{-1/2})\ &=\textcolor{blue}{((\widehat{\boldsymbol{\Sigma}}_q^{-1/2})'\otimes \boldsymbol{X})}\textcolor{orange}{vec(\boldsymbol{B})} +\textcolor{darkred}{vec(\boldsymbol{E}\widehat{\boldsymbol{\Sigma}}_q^{-1/2})}\ &=\textcolor{blue}{\mathcal{X}}\textcolor{orange}{\mathcal{B}}+\textcolor{darkred}{\mathcal{E}}. \end{align}

This transformation need the estimation of the square root of the inverse of the covariance matrix $\boldsymbol{\Sigma}_q$.

Package Installation

devtools::install_github("Marie-PerrotDockes/VariSel")

First exemples : Iris dataset

library(car)
iris %>% 
  head() %>% 
  kable()

Associtaion between sepal characteristics and petal characteristics

The aim of this exemple is to select association between the Sepal (Length and width) and the Petal (Length and Width).

Construction of the matrices :

Y <- iris %>% select( starts_with("Sepal"))
Petal_char <- iris %>% select( starts_with("Petal"))

Coefficient estimation using a lasso criterion:

mod <-  train_VariSel( Y = Y, 
                       X = Petal_char, 
                       type ="lasso_univ")

plot(mod)

It start by selecting an association between the petal length and the sepal length and then an association between the petal lenght and the sepal widthand so on..

Let us now see what happened when we use group lasso to force the association to be between one sepal characteristics with all petal characteristics.

mod_group <-  train_VariSel( Y = Y, 
                       X = Petal_char, 
                       type ="group_multi_regr", 
                       sepx = "\\.")

The argument ~sepx = "\."~ mean that the name of X will be seperate into a group name and a characteristics name. Two variables having the same group name will be selected together. Here both Petal.length and Petal.Width start with Petal hence they will be selected together.

plot(mod_group)

Let's now compare differents types grouping or not the petal characteristics and influence it or not to have the same coefficients.

col <- pal_uchicago()(6)
m2 <-  train_VariSel( Y = Y, 
                       X = Petal_char, 
                       type ="group_multi_both", 
                       sepx = "\\.")
m3 <-  train_VariSel( Y = Y, 
                       X = Petal_char, 
                       type ="fused_multi_both", 
                       sepx = "\\.")
m4 <-  train_VariSel( Y = Y, 
                       X = Petal_char, 
                       type ="fused_multi_regr", 
                       sepx = "\\.")

mods = list(mod_group,m2,m3,m4)
p2 <- compar_path(mods)

We can also propose different associations between petal characteristics and sepal charcateristics depending on the species.

mod_group_species <-  train_VariSel( Y = Y, 
                       regressors =  Petal_char, 
                       group = iris$Species,
                       type ="group_multi_regr")

plot(mod_group_species)

col <- pal_uchicago()(6)

mod_group_resp_species <-  train_VariSel( Y = Y,
                       regressors = Petal_char, 
                       group = iris$Species, 
                       type ="group_multi_both")

mod_fused_multi_species <-  train_VariSel( Y = Y,
                       regressors = Petal_char, 
                       group = iris$Species, 
                       type ="fused_multi_regr")
mod_fused_multi_species_resp <-  train_VariSel( Y = Y,
                       regressors = Petal_char, 
                       group = iris$Species, 
                       type ="fused_multi_both")

compar_path(mods = list(mod_group_species,
                        mod_group_resp_species,
                        mod_fused_multi_species,
                        mod_fused_multi_species_resp))

mod_lasso <-   train_VariSel( Y = Y,
                       regressors = Petal_char, 
                       group = iris$Species, 
                       type ="lasso_multi")