NetGSA | R Documentation |
Tests the significance of pre-defined sets of genes (pathways) with respect to an outcome variable, such as the condition indicator (e.g. cancer vs. normal, etc.), based on the underlying biological networks.
NetGSA(A, x, group, pathways, lklMethod = "REHE",
sampling=FALSE, sample_n = NULL, sample_p = NULL, minsize=5,
eta = 0.1, lim4kappa = 500)
A |
A list of weighted adjacency matrices. Typically returned from |
x |
The |
group |
Vector of class indicators of length |
pathways |
The npath by |
lklMethod |
Method used for variance component calculation: options are |
sampling |
(Logical) whether to subsample the observations and/or variables. See details. |
sample_n |
The ratio for subsampling the observations if |
sample_p |
The ratio for subsampling the variables if |
minsize |
Minimum number of genes in pathways to be considered. |
eta |
Approximation limit for the Influence matrix. See 'Details'. |
lim4kappa |
Limit for condition number (used to adjust |
The function NetGSA
carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It can be used for gene set (pathway) enrichment analysis where the data come from K
heterogeneous conditions, where K
, or more. NetGSA differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks. Therefore, when the networks encoded in A
are empty, one should instead consider alternative approaches such as Gene Set Analysis (Efron and Tibshirani, 2007).
The NetGSA method is formulated in terms of a mixed linear model. Let X
represent the rearrangement of data x
into an np \times 1
column vector.
X=\Psi \beta + \Pi \gamma + \epsilon
where \beta
is the vector of fixed effects, \gamma
and \epsilon
are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices, which determine the influence matrix under each condition. The influence matrices further determine the design matrices \Psi
and \Pi
in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and is calculated from the adjacency matrix (A[[k]]
for the k
-th condition). A small value of eta
is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa
.)
The problem is then to test the null hypothesis \ell\beta = 0
against the alternative \ell\beta \neq 0
, where \ell
is a contrast vector, optimally defined through the underlying networks.
For a one-sample or two-sample test, the test statistic T
for each gene set has approximately a t-distribution under the null, whose degrees of freedom are estimated using the Satterthwaite approximation method. When analyzing complex experiments involving multiple conditions, often multiple contrast vectors of interest are considered for a specific subnetwork. Alternatively, one can combine the contrast vectors into a contrast matrix L
. A different test statistic F
will be used. Under the null, F
has an F-distribution, whose degrees of freedom are calculated based on the contrast matrix L
as well as variances of \gamma
and \epsilon
. The fixed effects \beta
are estimated by generalized least squares, and the estimate depends on estimated variance components of \gamma
and \epsilon
.
Estimation of the variance components (\sigma^2_{\epsilon}
and \sigma^2_{\gamma}
) can be done in several different ways after profiling out \sigma^2_{\epsilon}
, including REML/ML
which uses Newton's method or HE/REHE
which is based on the Haseman-Elston regression method. The latter notes the fact that Var(X)=\sigma^2_{\gamma}\Pi*\Pi' + \sigma^2_{\epsilon}I
, and uses an ordinary least squares to solve for the unknown coefficients after vectorizing both sides. In particular, REHE
uses nonnegative least squares for the regression and therefore ensures nonnegative estimate of the variance components. Due to the simple formulation, HE/REHE
also allows subsampling with respect to both the samples and the variables, and is recommended especially when the problem is large (i.e. large p
and/or large n
).
The pathway membership information is stored in pathways
, which should be a matrix of npath
x p
. See prepareAdjMat
for details on how to prepare a suitable pathway membership object.
This function can deal with both directed and undirected networks, which are specified via the option directed
. Note NetGSA
uses slightly different procedures to calculate the influence matrices for directed and undirected networks.
In either case, the user can still apply NetGSA
if only partial information on the adjacency matrices is available. The functions netEst.undir
and netEst.dir
provide details on how to estimate the weighted adjacency matrices from data based on available network information.
A list with components
results |
A data frame with pathway names, pathway sizes, p-values and false discovery rate corrected q-values, and test statistic for all pathways. |
beta |
Vector of fixed effects of length |
s2.epsilon |
Variance of the random errors |
s2.gamma |
Variance of the random effects |
graph |
List of components needed in |
Ali Shojaie and Jing Ma
Ma, J., Shojaie, A. & Michailidis, G. (2016) Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 32(20):165–3174. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/bioinformatics/btw410")}
Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22. https://pubmed.ncbi.nlm.nih.gov/20597848/.
Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3131840/
prepareAdjMat
, netEst.dir
, netEst.undir
## load the data
data("breastcancer2012_subset")
## consider genes from just 2 pathways
genenames <- unique(c(pathways[["Adipocytokine signaling pathway"]],
pathways[["Adrenergic signaling in cardiomyocytes"]]))
sx <- x[match(rownames(x), genenames, nomatch = 0L) > 0L,]
db_edges <- obtainEdgeList(rownames(sx), databases = c("kegg", "reactome"))
adj_cluster <- prepareAdjMat(sx, group, databases = db_edges, cluster = TRUE)
out_cluster <- NetGSA(adj_cluster[["Adj"]], sx, group,
pathways_mat[c(1,2), rownames(sx)], lklMethod = "REHE", sampling = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.