Elastic Net Penalized Structural Equation Models

Introduction




This vignette demonstrates the usage of sparseSEM package via a simulated network as well as a Gene Regulatory Network (GRN) in budding yeast (Saccharomyces cerevisiae).

simulated network

The simulated network is relatively small and sparse for demonstration purpose. A total of 30 nodes with 30 edges were simulated to form a directed acyclic graph (DAG) following the description in Huang 2014 [1]:

Yeast Gene Regulatory Network (GRN)

The data contains expression levels of 6,126 yeast ORFs and 2,956 genetic markers from 112 yeast segregants from the cross of a laboratory strain (BY4716) and a wild strain (RM11-1a) [2].

The following steps were followed to arrived at the final dataset in data(yeast):

The gene expression profile of 3,380 ORFs and genetic marker of 1,162 eQTLs were then used for network inference by sparseSEM.



Back to Top

Quick Start

We will give users a general idea of the package by using the simulated example to demonstrates the basis package usage. Through running the main functions and examining the outputs, users may have a better idea on how the package works, what functions are available, which parameters to choose, as well as where to seek help. More information on the GRN will be provided in later sections.

Let us first clear up the workspace and load the sparseSEM package:

rm(list = ls())
library(sparseSEM)

The simulated network includes matrices of X, Y , B, and Missing with 30 rows and 200 columns giving the following measurements:

data(B);
data(Y);
data(X);
data(Missing);
cat("dimenstion of Y: ",dim(Y) )
cat("dimenstion of X: ",dim(X) )

Note that both B and Missing are optional to run all functions in the package.

We fit the model using the most basic call to elasticNetSEM:

set.seed(1)
output = elasticNetSEM(Y, X, Missing, B, verbose = 1); 
names(output)

"output" is a list containing all the relevant information of the fitted model. Users can examine the output by directly looking at each element in the list. Particularly, the sparse regression coefficients can be extracted as showed below:

fit_SEM = output$weight

Here we can use package plot.matrix to visually examine the sparseSEM estimated network structure vs. the simulated ground truth:

library('plot.matrix')
par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1))
plot(B)
plot(fit_SEM)

The left plot is the simulation ground truth, and the right plot is the sparseSEM inferred network. The plot shows that 29 out of the 30 edges in the network are captured by sparseSEM (misses 1 edge connect node_2 (column 2) to node_26 (row 26)). Meanwhile, the estimated edge weights are relatively smaller in the fitted model due to the elastic net penalty.

Cross Validation (CV)

The hyperparameters in elasitc net are specified as (alpha, lambda) which control the number of non-zero effects to be selected, and cross-validation (CV) is perhaps the simplest and most widely used method in deciding their values. while elasticNetSEM performs all the computation behind the scene, elasticNetSEMcv is the function to allow users to control more details of CV, which can be called using the following code.

set.seed(1)
cvfit = elasticNetSEMcv(Y, X, Missing, B, alpha_factors = c(0.75, 0.5, 0.25),
                      lambda_factors=c(0.1, 0.01, 0.001), kFold = 5, verbose  = 1);
names(cvfit)

elasticNetSEMcv returns a cvfit object, which is a list with all the ingredients of CV and the final fit results using CV selected hyperparameters. We can print out the CV results, selected hyperparameters and the corresponding coefficients. For example, result using different hyperparameters and the corresponding prediction errors are shown below:

head(cvfit$cv)

Note that the values for hyperparameters (alpha, lambda) are alpha_factor and lambda_factor, which are the weights of the actual (alpha, lambda) and should be in range of [0,1]. I.e., $\alpha = alpha_factor \alpha_0$ and $\lambda = lambda_factor \lambda_0$. Here $\alpha_0 = 1$ as in elastic net, $0\leq \alpha \leq 1$. However, $\lambda_0$ is algorithm computed value such that there is one and only one non-zero element in matrix $\hat{\mathbf{B}}$. See Huang (2014) [1] for details on the lasso discarding rule for SEM.

As shown in the plot below, due to the granular choice of alpha_factor and lambda_factor, the performance result shows that all 30 edges are captured:

par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1))
plot(B)
plot(cvfit$fit$weight)

With ground truth B provided, the functions also compute the performance statistics in `fit$statistics$, which is 6x1 array keeping record of:

            1. correct positive 
            2. total positive   
            3. false positive   
            4. positive detected  
            5. Power of detection (PD) = correct positive/total positive  
            6. False Discovery Rate (FDR) = false positive/positive detected
cvfit$fit$statistics

User may also notice that the above two examples only have a few lambdas computed than the specified lambda_factors parameter (20 lambdas by default). This is due to the early stop rule implemented [1], i.e., if current lambda leads to Mean Square Error (MSE) larger than that of the previous lambda plus 1 Standand Error (STE) unit, then the selection path will not continue. See the selection path and discarding rules for more details [1].

Stability Selection (STS)

With a twist of the hyperparameters, the above plots showed that the non-zero edges can be slightly different. Stability Selection [5,6] is an algorithm to achieve stable effect selection with controlled False Discovery Rate (FDR). The algorithm requires bootstrapping (typically n = 100 rounds), each round randomly selected half of the dataset to run through the lasso/elastic net selection path. Then the stable effects are those consistently selected in different bootstraps measured by the bounded pre-comparison error rate and FDR.

The following code shows the example of running STS:

tStart = proc.time()
set.seed(0)
output = enSEM_stability_selection(Y,X, Missing,B,
                                 alpha_factors = seq(1,0.1, -0.1), 
                                 lambda_factors =10^seq(-0.2,-3,-0.2), 
                                 kFold = 5,
                                 nBootstrap = 100,
                                 verbose = -1)
tEnd = proc.time()
simTime = tEnd - tStart;
print(simTime)
names(output)
cat("nSTS = ", length(which(output$STS !=0)))

One key part of STS method is that, the hyperparameters generally are set toward higher penalty to allow few non-zero edges. This is due to the fact that, if most of the hyperparameters are with small penalty, the number of non-zero edges will be large such that no matter how the threhold (pi) [5, 6] is set, there will be large number of false positives.

The above example shows that we set the values of lambda_factors to not too small, and all true effects are captured with no false positives as shown below:

B[which(B!=0)] =1
par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1))
plot(B)
plot(output$STS)

Note that STS output a binary 0 or 1 $B_{i j}$ indicating if there was an edge from node $j$ to node $i$. Parallel bootstrapping is also implemented with R package parallel:

library(parallel)
cl<-makeCluster(4,type="SOCK")
clusterEvalQ(cl,{library(sparseSEM)})
output = enSEM_stability_selection_parallel(Y,X, Missing,B,
                                      alpha_factors = seq(1,0.1, -0.1), 
                                      lambda_factors =10^seq(-0.2,-3,-0.2), 
                                      kFold = 3,
                                      nBootstrap = 100,
                                      verbose = -1,
                                      clusters = cl)
stopCluster(cl) 



Back to Top

Yeast GRN

We next turn to the real world scenario to apply sparseSEM to the yeast GRN dataset analysis. While the above simulated network has 30 nodes, the yeast GRN has 3380 nodes, which increases computation time significantly. Roughly the computation time for the yeast GRN is about 30 min with elasticNetSEM, and 1.5 hour with enSEM_stability_selection running on an Intel i5 CPU.

rm(list = ls())
library(sparseSEM)
data(yeast)
output = elasticNetSEM(Y, X, verbose = 1)
# STS
STS = enSEM_stability_selection(Y,X,verbose = -1)

For even larger network, a parallel computation version of sparseSEM is also available at sparseSEM GitHub Repo.

Back to Top

References

  1. Huang, A. (2014). "Sparse model learning for inferring genotype and phenotype associations." Ph.D Dissertation Chapter 7. University of Miami(1186).

  2. Brem RB, Yvert $\mathrm{G}$, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science 2002 296:752-755

  3. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature $2003, \mathbf{4 2 3}: 241-254$

  4. Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:1572-1577

  5. Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.

  6. Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.



Try the sparseSEM package in your browser

Any scripts or data that you put into this service are public.

sparseSEM documentation built on Aug. 9, 2023, 5:07 p.m.