bllim: EM Algorithm for Block diagonal Gaussian Locally Linear...
In xLLiM: High Dimensional Locally-Linear Mapping

bllim

R Documentation

EM Algorithm for Block diagonal Gaussian Locally Linear Mapping

Description

EM Algorithm for Block diagonal Gaussian Locally Linear Mapping

Usage

bllim(tapp,yapp,in_K,in_r=NULL,ninit=20,maxiter=100,verb=0,in_theta=NULL,plot=TRUE)

Arguments

`tapp`	An `L x N` matrix of training responses with variables in rows and subjects in columns
`yapp`	An `D x N` matrix of training covariates with variables in rows and subjects in columns
`in_K`	Initial number of components or number of clusters
`in_r`	Initial assignments (default NULL). If NULL, the model is initialized with the best initialisation among 20, computed by a joint Gaussian mixture model on both response and covariates.
`ninit`	Number of random initializations. Not used of `in_r` is specified. Default is 20 and the random initialization which maximizes the likelihood is retained.
`maxiter`	Maximum number of iterations (default 100). The algorithm stops if the number of iterations exceeds `maxiter` or if the difference of likelihood between two iterations is smaller than a threshold fixed to `0.001 (max(LL)-min(LL))` where `LL` is the vector of log-likelihoods at the successive iterations.
`verb`	Verbosity: print out the progression of the algorithm. If `verb=0`, there is no print, if `verb=1`, the progression is printed out. Default is 0.
`in_theta`	Initial parameters (default NULL), same structure as the output of this function. The EM algorithm can be initialized either with initial assignments or initial parameters values.
`plot`	Displays plots to allow user to check that the slope heuristics can be applied confidently to select the conditional block structure of predictors, as in the `capushe-package` package. Default is TRUE.

Details

The BLLiM model implemented in this function adresses the following non-linear mapping issue:

E(Y | X=x) = g(x),

where Y is a L-vector of multivariate responses and X is a large D-vector of covariates' profiles such that D \gg L. As gllim and sllim, the bllim function aims at estimating the non linear regression function g.

First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation p(X | Y) is specified in a way that the forward relation of interest p(Y | X) can be deduced in closed-from. Under some hypothesis on covariance structures, the large number D of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear g regression function by a piecewise affine function. Therefore, a hidden discrete variable Z is introduced, in order to divide the space into K regions such that an affine model holds between responses Y and variables X in each region k:

X = \sum_{k=1}^K I_{Z=k} (A_k Y + b_k + E_k)

where A_k is a D \times L matrix of coeffcients for regression k, b_k is a D-vector of intercepts and E_k is a Gaussian noise with covariance matrix \Sigma_k.

BLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density (X | Y):

p(X | Y=y,Z=k;\theta) = N(X; A_kx+b_k,\Sigma_k)

p(Y | Z=k; \theta) = N(Y; c_k,\Gamma_k)

p(Z=k)=\pi_k

where \Sigma_k is a D \times D block diagonal covariance structure automatically learnt from data. \theta is the set of parameters \theta=(\pi_k,c_k,\Gamma_k,A_k,b_k,\Sigma_k)_{k=1}^K. The forward conditional density of interest p(Y | X) is deduced from these equations and is also a Gaussian mixture of regression model.

For a given number of affine components (or clusters) K and a given block structure, the number of parameters to estimate is:

(K-1)+ K(DL+D+L+ nbpar_{\Sigma}+L(L+1)/2)

where L is the dimension of the response, D is the dimension of covariates and nbpar_{\Sigma} is the total number of parameters in the large covariance matrix \Sigma_k in each cluster. This number of parameters depends on the number and size of blocks in each matrices.

Two hyperparameters must be estimated to run BLLiM:

Number of mixtures components (or clusters) K: we propose to use BIC criterion or slope heuristics as implemented in capushe-package
For a given number of clusters K, the block structure of large covariance matrices specific of each cluster: the size and the number of blocks of each \Sigma_k matrix is automatically learnt from data, using an extension of the shock procedure (see shock-package). This procedure is based on a successive thresholding of sample conditional covariance matrix within clusters, building a collection of block structure candidates. The final block structure is retained using slope heuristics.

BLLiM is not only a prediction model but also an interpretable tool. For example, it is useful for the analysis of transcriptomic data. Indeed, if covariates are genes and response is a phenotype, the model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals.

Value

Returns a list with the following elements:

`LLf`	Final log-likelihood
`LL`	Log-likelihood value at each iteration of the EM algorithm
`pi`	A vector of length `K` of mixture weights i.e. prior probabilities for each component
`c`	An `(L x K)` matrix of means of responses (Y)
`Gamma`	An `(L x L x K)` array of `K` matrices of covariances of responses (Y)
`A`	An `(D x L x K)` array of `K` matrices of linear transformation matrices
`b`	An `(D x K)` matrix in which affine transformation vectors are in columns
`Sigma`	An `(D x D x K)` array of covariances of `X`
`r`	An `(N x K)` matrix of posterior probabilities
`nbpar`	The number of parameters estimated in the model

Author(s)

Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)

References

[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.

Examples

data(data.xllim)

## Setting 5 components in the model
K = 5

## the model can be initialized by running an EM algorithm for Gaussian Mixtures (EMGM)
r = emgm(data.xllim, init=K); 
## and then the gllim model is estimated
responses = data.xllim[1:2,] # 2 responses in rows and 100 observations in columns
covariates = data.xllim[3:52,] # 50 covariates in rows and 100 observations in columns

## if initialization is not specified, the model is automatically initialized by EMGM
# mod = bllim(responses,covariates,in_K=K)

## Prediction can be performed using prediction function gllim_inverse_map
# pred = gllim_inverse_map(covariates,mod)$x_exp

xLLiM documentation built on Nov. 2, 2023, 5:17 p.m.