Introduction to dnapath

reset_options_scipen <- getOption("scipen")
reset_options_digits <- getOption("digits")
options(scipen = 1, digits = 5)
library(dnapath)
set.seed(12345)

This is a brief introduction to the dnapath package. This package integrates known pathway information into the differential network analysis of gene expression data. It allows for any network inference method to be used, and provides wrapper functions for several popular methods; these all start with run_ for ease of searching. Various helper function such as summary() and plot() are implemented for summarizing and visualizing the differential network results. This package is a companion to the paper:

Grimes, T., Potter, S. S., & Datta, S. (2019). Integrating gene regulatory pathways into differential network analysis of gene expression data. Scientific reports, 9(1), 1-12.


Installation

You can install dnapath from CRAN:

install.packages("dnapath")

Data

The package contains two datasets, meso and cancer_pathways. The "Package data" vignette shows how these two datasets were created, and it illustrates the use of various methods in the dnapath package.

These are the two primary inputs to the dnapath() method -- a gene expression dataset and a list of pathways -- and are used to demonstrate the package usage in this vignette.

Note: the dnapath package is not limited to cancer data. The analysis demonstrated in this introduction can be applied to any gene expression datasets from any species.

Overview of Functions

The main function of the package is dnapath(), which performs the differential network analysis on a gene expression dataset over a user-specified list of pathways. The output of this function is a dnapath_list object (or dnapath object if only one pathway is considered). An overview of its arguments is shown here:

An overview of the other methods available in the dnapath package is given in the following sections.

Summarizing and visualizing the results

The dnapath_list and dnapath objects obtained from running dnapath() can be summarized and visualized using various methdos.

Obtaining a list of Reactome pathways

The user may specify any list of gene sets to use in the analysis. However, for convenience dnapath provides a function for obtaining a list of Reactome pathways for a given species. This list may be useful for most analyses.

Renaming entrezgene IDs to gene symbols

The Reactome pathways are based on entrezgene IDs, which are a unique identifier for individual genes. It is recommended to also annotate RNA-seq data using entrezgene IDs when obtaining gene expression counts. However, when summarizing and visualizing the differential network analysis results, it may be preferred to use gene symbols. The functions in this section are used to rename the genes in a dnapath_list or dnapath object, or a pathway list.

Network inference methods

There are also several network inference methods provided in this package. Many of these are available in other R packages, but the format of their inputs and outputs can vary. Wrapper functions have been created to standardize these aspects. The method names are all prefixed with run_ (for easy searching) and can be used as the network_inference argument in dnapath(). The methods are listed here, and additional information/references can be found in the package documentation.

The default method used is run_pcor, which infers the gene-gene association network using partial correlations. It is important to note that different measures may be better suited for answering different research questions, however those details are beyond the scope of this vignette.

Note: For more information on any method, use the help() or ? command in R to open the documentation (for example help(run_aracne) or ?run_pcor).

Real data analysis example

This example will analyze the meso data provided in the package. (The "Package data" vignette shows how these data were obtained.) First we load the data and take a look at it.

data(meso)
str(meso)

The data is a list containing (1) a gene expression data set of 32 samples and 150 genes and (2) a vector indicating which group each sample belongs to -- stageii (group 1) or stage iv (group 2).

No pathway information

The differential network analysis compares the gene-gene association network between the two groups (stage ii vs stage iv). The dnapath package enables the use of known pathway information into the analysis, but a pathway list is not required. The dnapath() method can be run on the full gene expression dataset, without using pathways:

# Run dnapath using the gene expression and group information from meso dataset.
results <- dnapath(meso$gene_expression, 
                   pathway_list = NULL, 
                   group_labels = meso$groups)
results

Printing the results provides a summary of the differential network analysis. The output shows that "stageii" and "stageiv" groups were compared. It says a single "unnamed pathway" was analyzed containing 150 genes; this is referring to the fact that all genes in the dataset were analyzed together and no pathway list was provided. The mean expression level of all genes in each group are also shown (these are 9.3 and 9.4, respectively, for this data), and a p-value (0.723) of the overall differential connectivity.

The output shows the differential network analysis results at the gene level. This table can be obtained using summary(results), which returns a tibble that can be filtered and sorted using base R functions or methods from the dplyr package. The columns in this table include: genes, the genes analyzed in this pathway; dc_score, the differentially connectivity (DC) score estimated for this gene; p_value, the permutation p-value for the DC score (under the null hypothesis of no differential connectivity); p_value_m, monotonized p-values, which are a conservative estimate; mean_expr1, the mean expression of each gene in group 1; mean_expr2, the mean expression of each gene in group 2.

The plot function can be used to view the differential network. In this case, since all genes were analyzed together, the differential network will show all genes present in the gene expression dataset.

The alpha = 0.05 argument will remove any edges whose differential connectivity score p-value is above 0.05.

plot(results, alpha = 0.05, only_dc = TRUE)

With 150 genes, the visualization of the differential network is not very useful; it is difficult to extract much information from this plot. One of the benefits of using known pathway information is that it breaks down the analysis into smaller gene sets, and the resulting differential networks are more manageable and informative. These advantages are illustrated in the next section.

Using a pathway list

The main feature of dnapath is that it incorporates known gene pathways into the differential network analysis. A pathway is nothing more than a set of genes that is thought to belong to a particular biological process. The get_reactome_pathways method can be used to obtain a list of pathways from the Reactome database.

In this section, we will use the p53_pathways data that accompanies the package. TP53 is a oncogene that is often mutated in many cancers, and the p53_pathways list contains 13 pathways related to P53 signaling.

We refer the reader to the "Package data" vignette for details on how the p53_pathways list was created. That vignette illustrates how to use get_reactome_pathways().

The difference compared to the previous section is that the pathway_list argument is now specified when running dnapath(). Additionally, we will use the seed argument (which is optional) so that each pathway result can be easily reproduced individually.

data(meso) # Load the gene expression data
data(p53_pathways)

# Run the differential network analysis.
results <- dnapath(x = meso$gene_expression,
                   pathway_list = p53_pathways,
                   group_labels = meso$groups,
                   seed = 0)

results

The results can be filtered using the filter_pathways method to remove any pathways with differential connectivity p-values above some threshold. This is demonstrated in the following code, which uses a threshold of 0.1.

results <- filter_pathways(results, alpha_pathway = 0.1)
results

The pathways can also be easily sorted based on different values. For example, the follow code sorts the pathways so that the one with the highest number of differentially connected genes is listed first. Note: this method of ordering will tend to favor larger pathways, since there are more changes for that pathway to contain differentially connected genes just by chance.

results <- sort(results, decreasing = TRUE, by = "n_dc")
results

Pathways can be plotted one at a time by indexing the results list. The following code shows how to print out the first pathway.

Note: The layout of the nodes is random when generating plots. For reproducible plots, the random number generator seed needs to be set prior to plotting.

# The plot layout is stochastic. Setting the RNG seed allows for reproducible plots.
set.seed(0)
plot(results[[1]], alpha = 0.05, only_dc = TRUE)

This plot uses entrezgene IDs, which are the identifiers used in the gene expression and pathway list data. At this point in the analysis it can be convenient to translate these IDs to the more recognizable gene symbols. The rename_genes method can be used to achieve this.

Note: Internet connection is required to connect to biomaRt, which rename_genes uses to map entrezgene IDs to gene symbols. The dir_save argument can be set when running rename_genes() which will save the ID mapping obtained from biomaRt in the specified directory. This way, the mapping can be obtained once (using the internet connection) and accessed from memory in all future calls of rename_genes(). A temporary directory is used in this example.

results <- rename_genes(results, to = "symbol", species = "human",
                        dir_save = tempdir())
results[[1]] # Print the results for the first pathway.
set.seed(0) # Reset seed to use same layout as previous plot.
plot(results[[1]], alpha = 0.05, only_dc = TRUE)

The differential network plot shows a lot of information about the two groups being compared. First, the nodes are scaled based on the gene's average expression across both groups. If the gene is differentially expressed, then the node will be shaded blue or red depending on whether the gene has higher expression in group 1 or group 2, respectively. For example, in the plot above we see that the node for POU4F1 is red, indicating a higher expression in group 2. However, the size of this node is very small, indicating that it's mean expression is small across both groups. So, while it may have a high fold-change, its overall expression level is relatively low. This can be verified in the table above -- the second row shows that POU4F1 has mean expression of 0.476 in group 1 and 0.811 in group 2, which is an almost two-fold change in expression, but is overall a very low level relative to the expression levels of other genes in this pathway. Notice that none of the genes with a large nodes have any color, which indicates there is no substantial difference in expression levels for these between the two groups.

The edges in the network are similarly colored: a blue edge indicates the gene-gene association is stronger in group 1, and a red edge indicates the association is stronger in group 2. The thickness of the edges are scaled according to the magnitude of the association, and the opacity is scaled based on the p-value of the edge's differential connectivity score. That is, a thick, opaque (i.e. non-transparent) edge indicates a strong association with high statistical evidence of differential connectivity. But a thin, transparent edge indicates a relatively weaker association with less evidence of differential connectivity (high p-value). For example, the red edge between TP53 and BAND is thick -- indicating that the association is strong (in at least one of the two groups) compared to other connections -- and the edge is dark red -- indicating that the association is stronger in group 2. In contrast, the edge between BAND and AKT3 is thin -- indicating that the magnitude of association between these genes is relatively small compared to other associations -- and the edge is light blue -- indicating the association is stronger in group 1, but the statistical evidence of this association is relatively low.

The plot is useful for a quick overview of the findings; a detailed report of the differential connections can be obtained using the summarize_edges() function.

# Summary table of the edges in pathway 1.
summarize_edges(results[[1]], alpha = 0.05)

In row 6, we see the summary statistics for the BANP-TP53 edge. The values agree with our interpretation from the plot: there is an estimated association of -0.02 in group 1 and an estimate of 0.22 in group 2. This suggests that there is some difference in the perturbation in the genomic environment between these two groups that is either (1) causing an inactivation of a BANP-TP53 association in the stage II patients (group 1) or (2) an activation of this association in the stage IV patients (group 2). An external literature review may suggest that BANP promotes TP53 activation, which causes cell cycle arrest. If true, then the results of this analysis would suggest that this regulatory behavior is inactive in the stage II patients.

Note: The output of the summarize_edges() function (as well as other summary functions) is a tibble object, so it can be sorted, filtered, and otherwise manipulated using functions from the dplyr package. The follow code shows an example where the results are arranged by p_values and further filtered to remove any edges that have an association magnitude lower than 0.2

library(dplyr)
tab <- summarize_edges(results[[1]])
tab <- dplyr::arrange(tab, p_value, decreasing = FALSE)
tab <- dplyr::filter(tab, pmax(abs(nw1), abs(nw2)) > 0.2)
tab

Each edge in the differential network corresponds to a pair of genes. To further investigate the association of a pair of genes, and to see how their association differs between the two groups, the plot_pair() function can be used to create a scatterplot of the gene expression data. We demonstrate this function by visualizing BANP and TP53 genes.

plot_pair(results, "BANP", "TP53")

By default, this function will fit a loess curve for each group. Alternatively, a linear model can be used.

plot_pair(results, "BANP", "TP53", method = "lm")

These plots show the marginal association between two genes -- it does not take into account any the expression level of other observed genes. When we conducted the differential network analysis using dnapath() earlier in this example, the run_pcor() function was used by default as our measure of association. This is estimates the partial correlation, which is a conditional association measure that takes into account the other genes in the pathway. It is important to remember that the marginal plot created by plot_pair() does not show all the information that the differential network analysis uses, however it may nonetheless be a useful visualization to check the results obtained from the analysis.

Requiring genes to be significantly DC

There is an option when creating visualizations or generating summary tables of the differentially connected edges to require that at least one of the two genes for a given edge is also significantly differentially connected (as tested at the gene level). Enforcing this requirement can help reduce the false discovery rate (at the potential cost of reduced sensitivity). Setting the require_dc_genes argument to TRUE in the plot() function and summarize_edges() function will enforce this requirement, as shown in the code below. Note how there fewer differentiallly connected edges shown here compared to the previous section.

set.seed(0) # Reset seed to use same layout as previous plot.
plot(results[[1]], alpha = 0.05, only_dc = TRUE, require_dc_genes = TRUE)
summarize_edges(results[[1]], alpha = 0.05, require_dc_genes = TRUE)

Visualizing the shared edges

In many cases it may be useful to determine which gene-gene associations are shared across both groups in addition to those that are different. However, the utility of visualizing the network with both shared and differential edges is dependent on the association measure used. In this example, we used partial correlations; while this is a conditional measure that takes into account the expression of all the genes in the pathway, the specific estimator used will result in many nonzero estimates for the gene-gene associations. So, as we see in plot generated by the following code, the network is full of non-differentially connected edges.

set.seed(0) # Reset seed to use same layout as previous plot.
plot(results[[1]], alpha = 0.05)

Note: This type of visualization will provide more useful insights when a sparse estimator is used for the gene-gene associations (for example, using run_aracne as the network inference method, as shown in the next section).

Using other network inference methods

The default network inference method is run_pcor, which uses partial correlations to estimate the gene association network in the two groups. This is changed by setting the network_inference argument in dnapath(). For example:

# Run the differntial network analysis using ARACNE.
results <- dnapath(x = meso$gene_expression,
                   pathway_list = p53_pathways,
                   groups = meso$groups,
                   network_inference = run_aracne)

Note: This code is not run here because it requires the minet package that is available on Bioconductor.

All of the summary and plotting functions behave the same as before regardless of the network inference method used. Note however, if the chosen method produces a dense network (such as run_corr), then the visualization may become more cluttered. In addition, some methods are more computationally intensive (such as run_genie3), and the differential network analysis will become more time consuming. To help counter this, the number of permutations performed for the significance testing can be lowered (the default is n_perm = 100). For example:

# Run the differntial network analysis using GENIE3 with 20 permutations.
results <- dnapath(x = meso$gene_expression,
                   pathway_list = p53_pathways,
                   group_labels = meso$groups,
                   network_inference = run_genie3,
                   n_perm = 20)

Alternatively, there may be tuning parameters in the network inference method that can speed up computation. For example, in run_genie3, the nTrees argument might be lowered to trade-off performance for computation time. This parameter can be adjusted using the additional ... argument of dnapath.

# Run the differntial network analysis using GENIE3 with 20 permutations.
results <- dnapath(x = meso$gene_expression,
                   pathway_list = p53_pathways,
                   group_labels = meso$groups,
                   network_inference = run_genie3,
                   nTrees = 10)

Summary

In this brief introduction to the dnapath package, we have reviewed the basic steps for conducting the differential network analysis and for summarizing the results with tables and plots. The examples here used the Mesothelioma dataset that accompanies the package, but we reiterate that the package is not limited to analyzing cancer datasets.

options(scipen = reset_options_scipen, digits = reset_options_digits)


Try the dnapath package in your browser

Any scripts or data that you put into this service are public.

dnapath documentation built on May 9, 2022, 9:05 a.m.