| bllim | R Documentation |
EM Algorithm for Block diagonal Gaussian Locally Linear Mapping
bllim(tapp,yapp,in_K,in_r=NULL,ninit=20,maxiter=100,verb=0,in_theta=NULL,plot=TRUE)
tapp |
An |
yapp |
An |
in_K |
Initial number of components or number of clusters |
in_r |
Initial assignments (default NULL). If NULL, the model is initialized with the best initialisation among 20, computed by a joint Gaussian mixture model on both response and covariates. |
ninit |
Number of random initializations. Not used of |
maxiter |
Maximum number of iterations (default 100). The algorithm stops if the number of iterations exceeds |
verb |
Verbosity: print out the progression of the algorithm. If |
in_theta |
Initial parameters (default NULL), same structure as the output of this function. The EM algorithm can be initialized either with initial assignments or initial parameters values. |
plot |
Displays plots to allow user to check that the slope heuristics can be applied confidently to select the conditional block structure of predictors, as in the |
The BLLiM model implemented in this function adresses the following non-linear mapping issue:
E(Y | X=x) = g(x),
where Y is a L-vector of multivariate responses and X is a large D-vector of covariates' profiles such that D \gg L. As gllim and sllim, the bllim function aims at estimating the non linear regression function g.
First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation p(X | Y) is specified in a way that the forward relation of interest p(Y | X) can be deduced in closed-from. Under some hypothesis on covariance structures, the large number D of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear g regression function by a piecewise affine function. Therefore, a hidden discrete variable Z is introduced, in order to divide the space into K regions such that an affine model holds between responses Y and variables X in each region k:
X = \sum_{k=1}^K I_{Z=k} (A_k Y + b_k + E_k)
where A_k is a D \times L matrix of coeffcients for regression k, b_k is a D-vector of intercepts and E_k is a Gaussian noise with covariance matrix \Sigma_k.
BLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density (X | Y):
p(X | Y=y,Z=k;\theta) = N(X; A_kx+b_k,\Sigma_k)
p(Y | Z=k; \theta) = N(Y; c_k,\Gamma_k)
p(Z=k)=\pi_k
where \Sigma_k is a D \times D block diagonal covariance structure automatically learnt from data. \theta is the set of parameters \theta=(\pi_k,c_k,\Gamma_k,A_k,b_k,\Sigma_k)_{k=1}^K.
The forward conditional density of interest p(Y | X) is deduced from these equations and is also a Gaussian mixture of regression model.
For a given number of affine components (or clusters) K and a given block structure, the number of parameters to estimate is:
(K-1)+ K(DL+D+L+ nbpar_{\Sigma}+L(L+1)/2)
where L is the dimension of the response, D is the dimension of covariates and nbpar_{\Sigma} is the total number of parameters in the large covariance matrix \Sigma_k in each cluster. This number of parameters depends on the number and size of blocks in each matrices.
Two hyperparameters must be estimated to run BLLiM:
Number of mixtures components (or clusters) K: we propose to use BIC criterion or slope heuristics as implemented in capushe-package
For a given number of clusters K, the block structure of large covariance matrices specific of each cluster: the size and the number of blocks of each \Sigma_k matrix is automatically learnt from data, using an extension of the shock procedure (see shock-package). This procedure is based on a successive thresholding of sample conditional covariance matrix within clusters, building a collection of block structure candidates. The final block structure is retained using slope heuristics.
BLLiM is not only a prediction model but also an interpretable tool. For example, it is useful for the analysis of transcriptomic data. Indeed, if covariates are genes and response is a phenotype, the model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals.
Returns a list with the following elements:
LLf |
Final log-likelihood |
LL |
Log-likelihood value at each iteration of the EM algorithm |
pi |
A vector of length |
c |
An |
Gamma |
An |
A |
An |
b |
An |
Sigma |
An |
r |
An |
nbpar |
The number of parameters estimated in the model |
Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)
[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.
xLLiM-package, emgm, gllim_inverse_map,capushe-package,shock-package
data(data.xllim)
## Setting 5 components in the model
K = 5
## the model can be initialized by running an EM algorithm for Gaussian Mixtures (EMGM)
r = emgm(data.xllim, init=K);
## and then the gllim model is estimated
responses = data.xllim[1:2,] # 2 responses in rows and 100 observations in columns
covariates = data.xllim[3:52,] # 50 covariates in rows and 100 observations in columns
## if initialization is not specified, the model is automatically initialized by EMGM
# mod = bllim(responses,covariates,in_K=K)
## Prediction can be performed using prediction function gllim_inverse_map
# pred = gllim_inverse_map(covariates,mod)$x_exp
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.