netgsa: Network-Based Gene Set Analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

Read the network information from any of the graphite databases specified by the user and construct the adjacency matrices needed for NetGSA. This function also allows for clustering. See details for more information

1 2	prepareAdjMat(x, group, databases = NULL, cluster = TRUE, file_e=NULL, file_ne=NULL, lambda_c=1, penalize_diag=TRUE, eta=0.5)

`x`	The p x n data matrix with rows referring to genes and columns to samples. Row names should be unique and have gene ID types appended to them. The id and gene number must be separated by a colon. E.g. "ENTREZID:127550"
`group`	Vector of class indicators of length n. Identifies the condition for each of the n samples
`databases`	(Optional) Either (1) the result of a call to `obtainEdgeList` or (2) a character vector of graphite databases you wish to search for edges. Since one can search in multiple databases with different identifiers, converts genes using `AnnotationDbi::select` and convert metabolites using `graphite:::metabolites()`. Databases are also used to specify non-edges. If `NULL` no external database information will be used. See Details for more information
`cluster`	(Optional) Logical indicating whether or not to cluster genes to estimate adjacency matrix. If not specified, set to TRUE if there are > 2,500 genes (p > 2,500). The main use of clustering is to speed up calculation time. If the dimension of the problem, or equivalently the total number of unique genes across all pathways, is large, `prepareAdjMat` may be slow. If clustering is set to TRUE, the 0-1 adjacency matrix is used to detect clusters of genes within the connected components. Once gene clusterings are chosen, the weighted adjacency matrices are estimated for each cluster separately using `netEst.undir` or `netEst.dir`. Thus, the adjacency matrix for the full network is block diagonal with the blocks being the adjacency matrices from the clusters. Any edges between clusters are set to 0, so this can be thought of as an approximate weighted adjacency matrix. Six clustering algorithms from the `igraph` package are considered: `cluster_walktrap`, `cluster_leading_eigen`, `cluster_fast_greedy`, `cluster_label_prop`, `cluster_infomap`, and `cluster_louvain`. Clustering is performed on each connected component of size >1,000 genes. To ensure increases in speed, algorithms which produce a maximum cluster size of < 1,000 genes are considered first. Among those, the algorithm with the smallest edge loss is chosen. If all algorithms have a maximum cluster size > 1,000 genes the one with the smallest maximum cluster size is chosen. Edge loss is defined as the number of edges between genes of different clusters. These edges are "lost" since they are set to 0 in the block diagonal adjacency matrix. If clustering is set to FALSE, the 0-1 adjacency matrix is used to detect connected components and the weighted adjacency matrices are estimated for each connected component. Singleton clusters are combined into one cluster. This should not affect performance much since the gene in a singleton cluster should not have any edges to other genes.
`file_e`	(Optional) The name of the file which the list of edges is to read from. This file is read in with `data.table::fread`. Must have 4 columns in the following order. The columns do not necessarily need to be named, but they must be in this specific order: 1st column - Source gene (base_gene_src), e.g. "7534"" 2nd column - Gene identifier of the source gene (base_id_src), e.g. "ENTREZID" 3rd column - Destination gene (base_gene_dest), e.g. "8607" 4th column - Gene identifier of the destination gene (base_id_dest) e.g. "UNIPROT" This information cannot conflict with the user specified non-edges. That is, one cannot have the same edge in `file_e` and `file_ne`. In the case where the graph is undirected everything will be converted to an undirected edge or non-edge. Thus if the user specifies A->B as a directed non-edge it will be changed to an undirected non-edge if the graph is undirected. See Details for more information.
`file_ne`	(Optional) The name of the file which the list of non-edges is to read from. This file is read in with `data.table::fread`. The edges in this file are negative in the sense that the corresponding vertices are not connected. Format of the file must be the same as `file_e`. Again, each observation is assumed to be a directed edge. Thus for a negative undirected edge, input two separate negative edges. In the case of conflicting information between `file_ne` and edges identified in a database, user non-edges are used. That is if the user specifies A->B in `file_ne`, but there is an edge between A->B in KEGG, the information in KEGG will be ignored and A->B will be treated as a non-edge. In the case where the graph is undirected everything will be converted to an undirected edge or non-edge. Thus if the user specifies A->B as a directed non-edge it will be changed to an undirected non-edge if the graph is undirected. See Details for more information.
`lambda_c`	(Non-negative) a vector or constant. `lambda_c` is multiplied by a constant depending on the data to determine the actual tuning parameter, `lambda`, used in estimating the network. If `lambda_c` is a vector, the optimal `lambda` will be chosen from this vector using `bic.netEst.undir`. Note that `lambda` is only used if the network is undirected. If the network is directed, the default value in `netEst.dir` is used instead . By default, `lambda_c` is set to 1. See `netEst.undir` and `netEst.dir` for more details.
`penalize_diag`	Logical. Whether or not to penalize diagonal entries when estimating weighted adjacency matrix. If TRUE a small penalty is used, otherwise no penalty is used.
`eta`	(Non-negative) a small constant needed for estimating the edge weights. By default, `eta` is set to 0.5. See `netEst.undir` for more details.

The function prepareAdjMat accepts both network information from user specified sources as well as a list of graphite databases to search for edges in. prepareAdjMat calculates the 0-1 adjacency matrices and runs netEst.undir or netEst.dir if the graph is undirected or directed.

When searching for network information, prepareAdjMat makes some important assumptions about edges and non-edges. As already stated, the first is that in the case of conflicting information, user specified non-edges are given precedence.

prepareAdjMat uses obtainEdgeList to standardize and search the graphite databases for edges. For more information see ?obtainEdgeList. prepareAdjMat also uses database information to identify non-edges. If two genes are identified in the databases edges but there is no edge between them this will be coded as a non-edge. The rationale is that if there was an edge between these two genes it would be present.

prepareAdjMat assumes no information about genes not identified in databases edgelists. That is, if the user passes gene A, but gene A is not found in any of the edges in databases no information about Gene A is assumed. Gene A will have neither edges nor non-edges.

Once all the network and clustering information has been compiled, prepareAdjMat estimates the network. prepareAdjMat will automatically detect directed graphs, rearrange them to the correct order and use netEst.dir to estimate the network. When the graph is undirected netEst.undir will be used. For more information on these methods see ?netEst.dir and ?netEst.undir.

Importantly, prepareAdjMat returns the list of weighted adjacency matrices to be used as an input in NetGSA.

A list with components

`Adj`	A list of weighted adjacency matrices estimated from either `netEst.undir` or `netEst.dir`. That is `length(Adj) = length(unique(group))`. One list of weighted adjacency matrix will be returned for each condition in group. If cluster = TRUE is specified, the length of the list of adjacency matrices for each condition will be the same length as the number of clusters. The structure of Adj is Adj[[condition_number]][[cluster_adj_matrix]]. Note that even when `cluster = FALSE` the connected components are used as clusters. The last element which is needed for plotting and is passed through to the output of `NetGSA` is `edgelist`.
`invcov`	A list of inverse covariance matrices estimated from either `netEst.undir` or `netEst.dir`. That is `length(invcov) = length(unique(group))`. One list of inverse covariance matrix will be returned for each condition in group. If cluster = TRUE is specified, the length of the list of inverse covariance matrices for each condition will be the same length as the number of clusters. The structure of invcov is invcov[[condition_number]][[cluster_adj_matrix]]
`lambda`	A list of values of tuning parameters used for each condition in `group`. If cluster = TRUE is specified, the length of the list of tuning parameters for each condition will be the same length as the number of clusters.

Michael Hellstern

Ma, J., Shojaie, A. & Michailidis, G. (2016) Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 32(20):165–3174.

NetGSA, netEst.dir, netEst.undir

## load the data
data("breastcancer2012")

## consider genes from the "ErbB signaling pathway" and "Jak-STAT signaling pathway"
genenames    <- unique(c(pathways[[24]], pathways[[52]]))
sx           <- x[match(rownames(x), genenames, nomatch = 0L) > 0L,]

adj_cluster    <- prepareAdjMat(sx, group, databases = c("kegg", "reactome", "biocarta"), cluster = TRUE)
adj_no_cluster <- prepareAdjMat(sx, group, databases = c("kegg", "reactome", "biocarta"), cluster = FALSE)