Network-regularized regression models

Introduction {-}

This vignette covers an introduction to network-regularized generalized linear regression models (GLMs). Network-regularization makes use of graphs to model interactions of covariables and responses in generalized linear regression models and penalizes the coefficients in GLMs using this information.

Edgenet

edgenet uses a $(p \times p)$-dimensional affinity matrix $\mathbf{G}X \in \mathbb{R}+^{p \times p}$ to model the interaction of $p$ covariables in $\mathbf{X} \in \mathbb{R}^{n \times p}$ and analogously a matrix $\mathbf{G}Y \in \mathbb{R}+^{q \times q}$ to model interactions of $q$ response variables in $\mathbf{Y} \in \mathbb{R}^{n \times q}$. The affinity matrices are used for regularization of regression coefficients like this:

\begin{equation} \small \begin{split} \hat{\mathbf{B}} = \arg \min_{\mathbf{B}} \, & -\ell(\mathbf{B}) + \lambda ||\mathbf{B} ||1 \ & + \frac{\psi_1}{2} \sum{i=1}^p \sum_{j=1}^p || \boldsymbol \beta_{i,} - \boldsymbol \beta_{j,} ||2^2 (\mathbf{G}_X){i,j} \ & + \frac{\psi_2}{2} \sum_{i=1}^q \sum_{j=1}^q || \boldsymbol \beta_{,i} - \boldsymbol \beta_{,j} ||2^2 (\mathbf{G}_Y){i,j} \ = \arg \min_{\mathbf{B}} \, & -\ell(\mathbf{B}) + \lambda ||\mathbf{B} ||1 \ & + \psi_1 \text{tr}\left( \mathbf{B}^T \left( \mathbf{D}{\mathbf{G}X} - \mathbf{G}_X \right) \mathbf{B} \right) \ & + \psi_2 \text{tr}\left( \mathbf{B} \left( \mathbf{D}{\mathbf{G}_Y} - \mathbf{G}_Y \right) \mathbf{B}^T \right) \end{split} \label{equ:graphreg} \end{equation}

Matrix $\mathbf{B} \in \mathbb{R}^{(p + 1) \times q}$ is the matrix of regression coefficients to be estimated (including an intercept). Vectors $\boldsymbol \beta_{i, }$ and $\boldsymbol \beta_{, i}$ are the $i$-th row or column of $\mathbf{B}$, respectively. Shrinkage parameters $\lambda$, $\psi_1$ and $\psi_2$ are fixed or need to be estimated (e.g., using cross-validation). The matrices $\mathbf{D}_{\mathbf{G}} - \mathbf{G}$ are the combinatorial (or normalized) graph Laplacians of an affinity matrix $\mathbf{G}$ [@chung1997spectral].

Group Lasso

TODO [@yuan2006model] and [@meier2008group]

Families

So far netReg supports the following exponential family random variables:

The log-likelihood function $\ell(\mathbf{B})$ over $q$ response variables is defined as:

\begin{equation} \small \ell(\mathbf{B}) = \sum_{j}^q \sum_i^n \log \ \mathbb{P}\left({y}{i, j} \mid h\left(\mathbf{x}{i,} \cdot \beta_{,j}\right), \phi \right) \end{equation}

where $h$ is the inverse of a link function, such as the logarithm, and $\phi$ is a dispersion parameter.

Fitting a model to data

The following section shows how to use network regularization models. We shall use edgenet, but the calls for the other methods are analogous and examples can be found in the method documentation. We first load some libraries:

library(tensorflow)
library(tfprobability)
library(netReg)

Set some parameters and create affinity matrices:

# parameters
n <- 100
p <- 10
q <- 10

# affinity matrices
G.X <- abs(rWishart(1, 10, diag(p))[,,1])
G.Y <- abs(rWishart(1, 10, diag(q))[,,1])

We created the affinity matrix absolutely random, although in practice a real (biological) observed affinity matrix should be used, because in the end the affinity matrix decides the shrinkage of the coefficients.

The actual fit is straightforward. We create Gaussian data first and then fit the model:

# data
X <- matrix(rnorm(n * p), n)
B <- matrix(rnorm(p * q), p)
Y <- X %*% B + matrix(rnorm(n * q, 0, 0.1), n)

fit <- edgenet(X=X, Y=Y, G.X=G.X, G.Y=G.Y, family=gaussian, maxit=10)
summary(fit)

The fit object contains information about coefficients, intercepts etc.

coef(fit)[,1:5]

Having the coefficients estimated we are able to predict novel data-sets:

pred  <- predict(fit, X)
pred[1:5, 1:5]

The binomial case is the same only with a different family argument:

try({
  edgenet(X=X, Y=Y, G.X=G.X, G.Y=G.Y, family=binomial, maxit=11)  
})
# data
X <- matrix(rnorm(n * p), n)
B <- matrix(rnorm(p * q), p)

eta <- 1 / (1 + exp(-X %*% B))
Y.binom <- do.call("cbind", lapply(seq(10), function(.) rbinom(n, 1, eta[,.])))

fit <- edgenet(X=X, Y=Y, G.X=G.X, G.Y=G.Y, family=binomial, maxit=10)
summary(fit)

The Poisson case of course works analogously:

# data
X <- matrix(rnorm(n * p), n)
B <- matrix(rnorm(p * q), p)

eta <- exp(-X %*% B)
Y.pois <- do.call("cbind", lapply(seq(10), function(.) rpois(n, eta[,.])))

fit <- edgenet(X=X, Y=Y.pois, G.X=G.X, G.Y=G.Y, family=poisson, maxit=10)
summary(fit)

Model selection

In most cases we do not have the optimal shrinkage parameters $\lambda$, $\psi_{1}$ and $\psi_{2}$. For these settings you can use netReg's included model selection. We use Powell's BOBYQA algorithm [@powell2009bobyqa] for gradient-free optimization in a coss-validation framework. Doing the model selection only requires calling cv.edgenet:

cv <- cv.edgenet(X=X, Y=Y, G.X=G.Y, G.Y, family=gaussian, optim.maxit=10, maxit=10)
summary(cv)

The cross-validation and BOBYQA is quite computationally intensive, so we set optim.maxit to $10$. I recommend using TensorFlow on a GPU for that, since edgenet needs to compute many matrix multiplications.

The cv.edgenet object also contains a fit with the optimal parameters.

summary(cv$fit)

You can obtain the coefficients like this:

coef(cv)[,1:5]

Furthermore, you can also directly predict using a cv.edgenet object:

pred  <- predict(cv, X)
pred[1:5, 1:5]

A biological example

This section explains how to fit a linear model and do parameter estimation.

At first we load the library and some data:

library(netReg)
data("yeast")

ls(yeast)

X <- yeast$X
Y <- yeast$Y
G.Y <- yeast$GY

The yeast data $\mathbf{X}$ and $\mathbf{Y}$ set is taken and adopted from [@brem2005genetic], [@storey2005multiple], and [@cheng2014graph]. $\mathbf{GY}$ is taken from BioGRID.

$\mathbf{X}$ is a $(n \times p)$ matrix of genetic markers where $n$ is the number of samples (112) and $p$ is the number of markers. $\mathbf{Y}$ is a $(n \times q)$ matrix of expression values for $q$ yeast genes. $n$ is again the numer of samples (112). $\mathbf{GY}$ is a $(q \times q)$ adjacency matrix representing protein-protein interactions for the $q$ response variables.

We only use a smaller network in order to be able to print the results here.

fit <- edgenet(X=X, Y=Y, G.Y=G.Y, lambda=5, family=gaussian, maxit=10, thresh=1e-3)
summary(fit)

For the response variables we use an affinity matrix that represents biological relationships with $\mathbf{GY}$. We promote sparsity by setting $\lambda = 5$ and put a weak (default) penalty on similar coefficients by setting $\psi_{gy} = 1$. Other than that we used standard parameters in this case.

The fit object contains information about coefficients and intercepts. Having the coefficients estimated we are able to predict novel data-sets:

X.new <- matrix(rnorm(10 * ncol(X)), 10)
pred  <- predict(fit, X.new)
pred[1:10, 1:5]

References



Try the netReg package in your browser

Any scripts or data that you put into this service are public.

netReg documentation built on Nov. 8, 2020, 5:17 p.m.