This vignette demonstrates the usage of sparseSEM
package via a simulated network as well as a Gene Regulatory Network (GRN) in budding yeast (Saccharomyces cerevisiae).
The simulated network is relatively small and sparse for demonstration purpose. A total of 30 nodes with 30 edges were simulated to form a directed acyclic graph (DAG) following the description in Huang 2014 [1]:
The data contains expression levels of 6,126 yeast ORFs and 2,956 genetic markers from 112 yeast segregants from the cross of a laboratory strain (BY4716) and a wild strain (RM11-1a) [2].
The following steps were followed to arrived at the final dataset in data(yeast)
:
q-value
following the procedure described in [4]. The gene expression profile of 3,380 ORFs and genetic marker of 1,162 eQTLs were then used for network inference by sparseSEM
.
We will give users a general idea of the package by using the simulated example to demonstrates the basis package usage. Through running the main functions and examining the outputs, users may have a better idea on how the package works, what functions are available, which parameters to choose, as well as where to seek help. More information on the GRN will be provided in later sections.
Let us first clear up the workspace and load the sparseSEM
package:
rm(list = ls()) library(sparseSEM)
The simulated network includes matrices of X
, Y
, B
, and Missing
with 30 rows and 200 columns giving the following measurements:
Y
: The observed node response data with dimension of M (nodes) by N (samples);X
: The network node attribute (for causal effect inference) with dimension of M by N;B
: A sparse M by M matrix which contains the actual edges among the network of M nodes.Missing
: M by N matrix corresponding to elements of Y to denote any missing values.data(B); data(Y); data(X); data(Missing); cat("dimenstion of Y: ",dim(Y) ) cat("dimenstion of X: ",dim(X) )
Note that both B
and Missing
are optional to run all functions in the package.
We fit the model using the most basic call to elasticNetSEM
:
set.seed(1) output = elasticNetSEM(Y, X, Missing, B, verbose = 1); names(output)
"output" is a list containing all the relevant information of the fitted model. Users can examine the output by directly looking at each element in the list. Particularly, the sparse regression coefficients can be extracted as showed below:
fit_SEM = output$weight
Here we can use package plot.matrix
to visually examine the sparseSEM
estimated network structure vs. the simulated ground truth:
library('plot.matrix') par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1)) plot(B) plot(fit_SEM)
The left plot is the simulation ground truth, and the right plot is the sparseSEM
inferred network. The plot shows that 29 out of the 30 edges in the network are captured by sparseSEM
(misses 1 edge connect node_2 (column 2) to node_26 (row 26)). Meanwhile, the estimated edge weights are relatively smaller in the fitted model due to the elastic net
penalty.
The hyperparameters in elasitc net
are specified as (alpha, lambda) which control the number of non-zero effects to be selected, and cross-validation (CV) is perhaps the simplest and most widely used method in deciding their values. while elasticNetSEM
performs all the computation behind the scene, elasticNetSEMcv
is the function to allow users to control more details of CV, which can be called using the following code.
set.seed(1) cvfit = elasticNetSEMcv(Y, X, Missing, B, alpha_factors = c(0.75, 0.5, 0.25), lambda_factors=c(0.1, 0.01, 0.001), kFold = 5, verbose = 1); names(cvfit)
elasticNetSEMcv
returns a cvfit
object, which is a list with all the ingredients of CV and the final fit results using CV selected hyperparameters. We can print out the CV results, selected hyperparameters and the corresponding coefficients. For example, result using different hyperparameters and the corresponding prediction errors are shown below:
head(cvfit$cv)
Note that the values for hyperparameters (alpha, lambda) are alpha_factor
and lambda_factor
, which are the weights of the actual (alpha, lambda) and should be in range of [0,1]. I.e., $\alpha = alpha_factor \alpha_0$ and $\lambda = lambda_factor \lambda_0$. Here $\alpha_0 = 1$ as in elastic net
, $0\leq \alpha \leq 1$. However, $\lambda_0$ is algorithm computed value such that there is one and only one non-zero element in matrix $\hat{\mathbf{B}}$. See Huang (2014) [1] for details on the lasso
discarding rule for SEM.
As shown in the plot below, due to the granular choice of alpha_factor
and lambda_factor
, the performance result shows that all 30 edges are captured:
par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1)) plot(B) plot(cvfit$fit$weight)
With ground truth B
provided, the functions also compute the performance statistics in `fit$statistics$, which is 6x1 array keeping record of:
1. correct positive 2. total positive 3. false positive 4. positive detected 5. Power of detection (PD) = correct positive/total positive 6. False Discovery Rate (FDR) = false positive/positive detected
cvfit$fit$statistics
User may also notice that the above two examples only have a few lambdas
computed than the specified lambda_factors
parameter (20 lambdas
by default). This is due to the early stop rule implemented [1], i.e., if current lambda
leads to Mean Square Error (MSE) larger than that of the previous lambda
plus 1 Standand Error (STE) unit, then the selection path will not continue. See the selection path and discarding rules for more details [1].
With a twist of the hyperparameters, the above plots showed that the non-zero edges can be slightly different. Stability Selection [5,6] is an algorithm to achieve stable effect selection with controlled False Discovery Rate (FDR). The algorithm requires bootstrapping (typically n = 100 rounds), each round randomly selected half of the dataset to run through the lasso/elastic net selection path. Then the stable effects are those consistently selected in different bootstraps measured by the bounded pre-comparison error rate and FDR.
The following code shows the example of running STS:
tStart = proc.time() set.seed(0) output = enSEM_stability_selection(Y,X, Missing,B, alpha_factors = seq(1,0.1, -0.1), lambda_factors =10^seq(-0.2,-3,-0.2), kFold = 5, nBootstrap = 100, verbose = -1) tEnd = proc.time() simTime = tEnd - tStart; print(simTime) names(output) cat("nSTS = ", length(which(output$STS !=0)))
One key part of STS method is that, the hyperparameters generally are set toward higher penalty to allow few non-zero edges. This is due to the fact that, if most of the hyperparameters are with small penalty, the number of non-zero edges will be large such that no matter how the threhold (pi) [5, 6] is set, there will be large number of false positives.
The above example shows that we set the values of lambda_factors to not too small, and all true effects are captured with no false positives as shown below:
B[which(B!=0)] =1 par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1)) plot(B) plot(output$STS)
Note that STS output a binary 0 or 1 $B_{i j}$ indicating if there was an edge from node $j$ to node $i$. Parallel bootstrapping is also implemented with R package parallel
:
library(parallel) cl<-makeCluster(4,type="SOCK") clusterEvalQ(cl,{library(sparseSEM)}) output = enSEM_stability_selection_parallel(Y,X, Missing,B, alpha_factors = seq(1,0.1, -0.1), lambda_factors =10^seq(-0.2,-3,-0.2), kFold = 3, nBootstrap = 100, verbose = -1, clusters = cl) stopCluster(cl)
We next turn to the real world scenario to apply sparseSEM
to the yeast GRN dataset analysis. While the above simulated network has 30 nodes, the yeast GRN has 3380 nodes, which increases computation time significantly. Roughly the computation time for the yeast GRN is about 30 min with elasticNetSEM
, and 1.5 hour with enSEM_stability_selection
running on an Intel i5 CPU.
rm(list = ls()) library(sparseSEM) data(yeast) output = elasticNetSEM(Y, X, verbose = 1) # STS STS = enSEM_stability_selection(Y,X,verbose = -1)
For even larger network, a parallel computation version of sparseSEM
is also available at sparseSEM GitHub Repo.
Huang, A. (2014). "Sparse model learning for inferring genotype and phenotype associations." Ph.D Dissertation Chapter 7. University of Miami(1186).
Brem RB, Yvert $\mathrm{G}$, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science 2002 296:752-755
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature $2003, \mathbf{4 2 3}: 241-254$
Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:1572-1577
Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.
Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.