SourceSet package

Introduction {#intro}

One of the most promising and widely used computational approaches for analyzing gene expression data is gene set analysis. Gene set analysis moves from a gene-centered perspective towards a gene set-centered perspective, where the gene sets are usually defined as groups of functionally related genes. An example of a gene set is a group of genes participating in the same pathway.

Among gene set analyses, topological pathway analysis aims to improve inferential analysis by exploiting the explicit pathway information on biological interactions. Indeed, a biological pathway can be converted into a graphical structure, where nodes represent genes and edges represent biochemical interactions between them.

All methods proposed within the context of topological pathway analysis use marginal approaches and are therefore unable to distinguish between the so-called primary genes representing the source of perturbation -- for example, a mutation, a copy number variation or an epigentic change -- from those that are merely affected by the propagation of that perturbation.

SourseSet implements a new method for identifying a primary dysregulation within a perturbed pathway. Set within the framework of Gaussian graphical models, our method compares all marginal and conditional distributions induced by the underlying graph, and uses the results to infer the set of primary genes potentially responsible for the differential behavior. For a detailed exposition, we refer to the original articles [@SourceSetmethod].

Although our primary aim is the analysis of gene expression data, the proposed method is general and can be applied whenever data from two conditions are assumed to come from two multivariate normal distributions with a common and known graphical structure. For this reason, in the first part of this vignette, we use a more general terminology: _variables_ and _graphs_, instead of _genes_ and _pathways_.

Overview {#over}

Given a list of graphs representing a chosen set of pathways, and a matrix of gene expression values measured in two experimental conditions, SourceSet functions

SourceSet contains four core functions: sourceSet implements the proposed algorithm, while infoSource, easyLookSource, sourceSankeyDiagram provide the user with tools for visualizing and interpreting the obtained results. In addition, sourceCytoscape and sourceUnionCytoscape provide a connection with Cytoscape, a well known bioinformatics tool for visualizing, exploring and manipulating biological networks.

In the following, two examples illustrate the use of the SourceSet package:

  1. Section Getting started uses a toy example to introduce the main idea behind the source set, as well as to offer a first look at the arguments and the output of the main functions;
  2. Section Getting deeper features an analysis of a real dataset, taken from a well known benchmark ALL.

Getting started {#start}

Install SourceSet from the CRAN repository.

help(package = "SourceSet")

Function sourceSet: input arguments {#sourcefun}

We first have a look at the main function sourceSet which performs the actual analysis.


The user is required to specify the following arguments:

Two additional arguments:

Data preparation {#data}

We consider an example featured in the Simulation study of @SourceSetpackage. All the necessary parameters are contained in the data object data("simulation"). For details on generating parameters which emulate two experimental conditions within the Gaussian graphical models framework, we refer to the Supplementary material of the main article and simPATHy CRAN package.


In this case, it was assumed that the perturbation affects directly only gene 5 (primary dysregulation). The perturbation is then propagated towards remaining genes (secondary dysregulation) by means of a network simulation$graph.

Parameters of the first or control condition, the mean $\mu_1$ and the variance matrix $\Sigma_1$, are given by simulation$condition1. Parameters of the second condition, $\mu_2$ and $\Sigma_2$, reflecting the effect of the foregoing strong perturbation are contained in simulation$condition2$5$2`. We use these parameters to generate two random samples of size 50 from the corresponding multivariate normal distributions.

# sample size

# parameters of control condition

# parameters of perturbed condition (`5`: true source; `2`: dysregulation intensity, strong)

# condition 1 
data.cond1<-rmvnorm(n = n,mean =param.cond1$mu ,sigma =param.cond1$S )
# condition 2
data.cond2<-rmvnorm(n = n,mean =param.cond2$mu ,sigma=param.cond2$S)

# Input arguments for the sourceSet function

simulation$graphis a graphNEL object representing an undirected graph consisting of 10 nodes and 5 cliques. The largest clique contains $p=4$ nodes, and since $n_1=n_2=n=50$ we can set shrinkage=FALSE, and use the maximum likelihood estimate of the covariance matrix. Furthermore, given that $\min(n_1,n_2)\gg p$, we can use asymptotic $p$-values (permute=FALSE).


# number of cliques
# size of larger clique

Quick usage {#quick}

We use the function sourceSet to analyze simulated data.

A progress bar shows the status of permutations[^permute] and the elapsed time for each input graph.

result<-sourceSet(graphs ,data ,classes ,seed = 123 ,permute =FALSE ,shrink =FALSE, alpha=0.05  )

sourceSet function returns a sourceSetList class object containing useful information in the form of lists, where the number of lists equals the number of input graph (in this case 1). Each list contains the results of the analysis. In particular,


As anticipated, the source set is able to distinguish the primary dysregulation (r result$source.node5$primarySet) from the set of nodes affected by the perturbation due to network propagation (r paste(setdiff(unique(unlist(result$source.node5$orderingSet)),result$source.node5$primarySet),sep=", ")).

# source set: primary disregulation

# secondary disregulation

# all affected variables

Understand the output {#output}

Source set algorithm decomposes each input graph in $k$ different clique orderings, where $k$ is the number of maximal cliques for a given graph. Different orderings are obtained by setting each clique as the root of the underlying junction tree (rip function in the gRbase package). This procedure is implemented in the utility function ripAllRootsClique.

Each ordering is composed of $k$ components, and each component is associated to a clique and a separator (the first separator in the ordering is taken to be empty). A component represents a conditional distribution of the clique variables conditional on the separator variables. The source set algorithm tests equality of these distributions in the two groups. Components may repeat across different orderings, and all information relative to them can be found in Decomposition. In particular, a list of unique components (across all clique orderings) can be found in Components, while the list of variables contained within each clique and separator is in Elements.

For the graph of our example, the number of different orderings is r length(result$source.node5$Decompositions) and the number of unique components is r nrow(result$source.node5$Components). For example, when clique C5 is root, featured components are comp7, comp13, comp5, comp3 and comp12. In particular, comp12 represents the marginal distribution of the clique C5, while comp7 represents the conditional distribution of the clique C1 conditional on the separator S1.

# number of orderings
# number of unique components
# ordering with root clique 'C3'

The function will return an ordering source set ($\hat{D}_{G,i}$) for each of the $k$ orderings. Each set is composed of the components for which the hypothesis of equality between the two groups was rejected. This information is contained in orderingSet, while the information regarding the multiple testing correction can be found in Threshold.

If we look at the ordering given by the root C5, the adjusted threshold that controls the FWER at the desired level $\alpha=$ r round(result$source.node5$Threshold$alpha,2) is r round(result$source.node5$Threshold$value,4). The components whose equality is rejected at this level are comp13 e comp12. The set $\hat{D}_{G,i}$ will thus contain both variables in C2, and variables in C5.

# alpha and corrected threshold

# ordering source set when 'C5' is used as root clique

# manual indentification of the ordering source set

The estimated source set (or primary set) consists of variables that are common to all orderings source sets (the node 5). While, the secondary set consists of variables that are affected by some form of dysregulation (i.e., appear in at least one ordering source set) but are not responsible for the primary dysregulation. In this case, the algorithm is thus able to distinguish the primary and the secondary dysregolation.

# source set of each ordering

# primary disregulation

# secondary disregulation

# all affaceted variables

Statistics {#stat}

Although the interpretation of the source set for a single graph can seem intuitive, the interpretation of the results for a collection of overlapping graphs can be challenging. To simplify this task, SourceSet offers the function infoSource. Given a sourceSetList object, infoSource provides useful summaries of the obtained results, guiding the user in identifying interesting variables.


The list info$graph summarizes the results of the individual input graphs. Here we find some summary statistics regarding the number of nodes within the estimated source set (n.primary), the secondary set (n.secondary), within the graph (n.graph), as well as the number of connected components of the underlying graph (n.cluster). The relative size of the estimated source set and the set of all the variable affected by some form of dysregulation (with respect to the graph size) is given in primary.impact and total.impact, respectively. Finally, a $p$-value for the hypothesis of equality of the two distributions associated to the given graph is reported.

knitr::kable(info$graph,caption = "> info$graph")

info$variable is a list with information regarding variables of the input graphs. Although some of the indices bear the same name as above, the interpretation is now slightly different. In particular:

Ideally, variables of the primary dysregulation will be elements of the source set in all input graphs that contain them and will thus have high values of source.impact and score. However, if a given variable appears in a single graph, and belongs to its source set, these indices can be deceptive. For this reason, relevance serves to identify variables that apart from being good candidates for primary genes, also appear frequently in the input graphs. Which index is to be preferred depends on the objective of the analysis: in case of exploratory analysis, we suggest to rely on relevance.

In our toy example, the specificity will be 1 for all the considered variables, while the only variable with relevance different from 0 is variable 5. The variable 5 also achieves the maximum score.

knitr::kable(info$variable[,-1],caption=" > info$variable")

Visual summary {#visual}

An alternative, more intuitive, way of exploring the results is to use visual summaries offered by easyLookSource and sourceSankeyDiagram. As before, the input is a sourceSetList object. Additional parameters may be needed to customize the display.


The function easyLookSource summarizes the results through a heatmap. The plot is composed of a matrix in which rows ($i$) represent input graphs (pathways) and, columns ($j$) represent variables (genes). Each cell$_{i,j}$ can take one of the following configurations:

In the plot, the pathways are vertically ordered -- top to bottom -- according to the numbers of nodes in the source set. On the other hand, genes are horizontally ordered -- from left to right-- based on the number of times they appear in a source set.



The function sourceSankeyDiagram highlights the relationships among nodes, graphs, and source sets, by summarizing the results through a Sankey diagram. The layout is organized on three levels:

The three levels are to be read from left to right. A link between left element a and right element b must be interpret as "element a is contained in element b". The implementation of the sourceSankeyDiagram function takes advantage of the D3 library (JavaScript), making the plot interactive. In fact, it is possible to vertically shift the displayed elements, and to view some useful information by positioning the cursor over items and links.

To better illustrate this visual tool, we consider a slight modification of our toy example so that the true source set is composed of variables 5, 9, 8 e 10.

data2.cond1<-rmvnorm(n = n,mean =simulation$condition1$mu ,sigma =simulation$condition1$S )
data2.cond2<-rmvnorm(n = n,mean =simulation$condition2$`10`$`2`$mu ,sigma =simulation$condition2$`10`$`2`$S)

# Input arguments for the sourceSet function

result2<-sourceSet(graphs,data2,classes,seed=222,permute = FALSE,shrink = FALSE)
sourceSankeyDiagram(result2,height = 100,width = 800)

sourceSankeyDiagram(result2,height = 150,width = 800)

Cytoscape {#cytovisual}

SourceSet package offers various tools for the analysis and dynamic network manipulation. The sourceCytoscape function, thanks to the connection with the Cytoscape software, allows the user to create a collection of graphs to be visualized in a unique session, while documenting interesting findings. Before executing the following two commands, it is necessary to launch Cytoscape (see Note on cytoscape).

The input is as before an object of the sourceSetList class. A subset of the analyzed graphs can be selected by setting the parameter name.graphs; if unspecified all analyzed graphs will be visualized. It is also possible to call the sourceCytoscape function multiple times, with all the graphs being visualized in a unique session within a collection specified by

# Lunch cytoscape and run the following commands

# simulation 1: sourceset composed by variable 5
cytoID.5<-sourceCytoscape(result, = "Simulation")
# simulation 2: sourceset composed by variable 10,9,8,5
cytoID.10<-sourceCytoscape(result2, = "Simulation")

The visual node attributes size and fill color are defined in a dynamic manner through a visual mapping[^style] based on the indices available by the infoSource function (automatically uploaded in the bottom panel - right side). A discrete mapper between source attribute and size is applied: a big size if the variable belongs to the secondary set (2), a medium size if the variable belongs to the primary set (1), and a small size otherwise (0). On the other hand, a color gradient mapper between fill node color and relevance is adopted.

The default style can be changed manually either within Cytoscape (for further information see manual) or within an R package r2cytoscape through network SUID returned by the sourceCytoscape function (for further details see manual).

Getting deeper: ABL/BCR chimera {#deep}

This section guides the user in the source set analysis of a real case study, both in terms of data preparation and results discussion.

As an example, we used a well-known benchmark published by @Chiaretti on Acute Lymphocytic Leukemia (ALL) cells associated with known genotypic abnormalities in adult patients. The dataset is available in ALL BioC package and consists of microarray expressions and phenotypical information from 128 different individuals affected by ALL.

The expression values (deriving from Affymetrix single channel technology) are already appropriately normalized according to robust multiarray analysis and quantile normalization.

library(ALL); data("ALL")

Data preparation: ALL data {#dataALL}

Some genotype abnormalities are known to be responsible for different transformation mechanisms of ALL and, as a consequence, of different response to treatment. For this reason, the assessment will focus on the ability of the source set algorithm to identify genes for which there are documented evidences in the origin of the phenotype under study (i.e., chimera genes).

Comparing patients with and without the B-cell receptor (ABL/BCR) gene rearrangement, we expect that chimera genes will be present in the source set of pathways that contained them. Moreover, we foresee that ABL and BCR appear among the most relevant genes in the meta-analysis.

We need to retrieve from the ExpressionSet object the expression matrix and the corresponding sample information for the individuals of interest. Specifically, we are interested in the subset of patients with B-cell type and BCR/ABL translocation (class 2) or without translocation (class 1). This information is hosted in the BT and mol.biol columns of ALL phenotype data.

# Retrieve expression matrix and phenotypes

# Select individuals:
# BT: type and stage of the disease ('B' indicates B-cell ALL)
# mol.biol: molecular biology of cancer ('BCR/ABL' and 'NEG')
ind<- intersect(grep("B",pheno.all$BT), which(pheno.all$mol.biol %in% c("BCR/ABL","NEG")))
code<- rownames(pheno.all)[ind]
group<- paste(pheno.all$mol.biol[ind])

Moreover, it is convenient to use Entrez gene IDs instead of manufacturer identifiers. It helps us to map genes in pathways. To this purpose we need the hgu95av2.db BioC package, as required in the annotation specification of ALL dataset (see ALL@annotation).

As some IDs could be repeated or not annotated, the final dataset will generally have a different size from the initial one; in case of not unique mapping IDs, we summarized them by the mean value.

# Convert identifiers:
# map between manufacturer identifiers and Entrez Gene identifiers
mapped_probes <- mappedkeys(hgu95av2ENTREZID)
# remove not mapped probes
data.all<-data.all[rownames(data.all) %in% mapped_probes,]
# convert identifiers<-paste(hgu95av2ENTREZID[rownames(data.all)])
# merge not unique code on mean value
data.all<-apply(data.all,2,function(x,f) { tapply(x,f,mean) },

After this selection our dataset consists of $n_1 = 42$ observations from the control condition (NEG, absence of rearrangement), $n_2 = 37$ observations from the second experimental condition (BCR/ABL, presence of gene rearrangement) and r nrow(data.all) measured gene expression levels.

# sourceSet function arguments:
classes<-sapply(group,function(x) switch(x,"BCR/ABL"=2,"NEG"=1))

Pathway: graphite package {#graphite}

The primary interest of our work is not in the detection of the structure of a pathway because we consider it fixed _a priori_. To incorporate topology information into source set analysis, biological pathways need to be translated into a graph object, either directed or undirected. Due to the descriptive nature of pathways and their inherent complexity, there is no simple recipe for this conversion that can be applied in every situation.

In general, sourceSet function gives to the user full freedom in providing the underlying pathways, requiring only specific input data format (i.e., graphNEL objects). So, the user can provide a list of manually curated pathways, or use developed software to translate the bases of knowledge. To date, the most exhaustive resource available for this task is graphite BioC package.

By way of example, in ALL case study, we employed graphite to retrieve the graphical structure of a selection of KEGG pathways[^chimera] that contain at least one of chimera genes. In particular, we selected the Chronic myeloid leukemia pathway (i.e., the target pathway): it describes the impact of the ABL/BCR fusion genes in the cell.

library(graphite); library(graph)
# pathways selection
names<-c("Axon guidance","Cell cycle","Chronic myeloid leukemia","ErbB signaling pathway",
"Neurotrophin signaling pathway","Pathways in cancer","Ras signaling pathway","Viral myocarditis")
# retrieve a list of pathways from a database for a given species
pathways  <- pathways("hsapiens", "kegg")[names]
# convert the node identifiers of pathways

# For each pathway, build a graphNEL object representing its topology
graphs<-lapply(pathways,function(p) pathwayGraph(p))

# Match node IDs with the names of the data matrix columns (delete the prefix 'ENTREZID:')
# (graphite version 1.24.1)
for(i in 1:length(graphs)) graph::nodes(graphs[[i]])<-gsub("ENTREZID:","",graph::nodes(graphs[[i]]))

We observe that pathways are commonly translated into directed graphs, while the source set algorithm works with decomposable structures. However, it should be stressed that we can always obtain a decomposable one in a few steps (i.e., moralization and triangulation) and this feature is internally implemented in the sourceSet function.

graphs$`Chronic myeloid leukemia`

Results and discussion {#ALLresults}

path<-system.file("extdata","ALLsourceresult.RData",package = "SourceSet")
load(file = path)

We use the sourceSet function to analyze the ALL dataset and the selected pathways. In this case, to examine a collection of graphs, the regularized estimate of the covariance matrix (shrink=TRUE) and the permutational p-value distribution (permute=TRUE) are preferable because of the medium/low number of replicates per class (for more details see section Function sourceSet).

# It requires about 16 minutes:
# run instead: load(file=system.file("extdata","ALLsourceresult.RData",package = "SourceSet"))
results.all<-sourceSet(graphs,data,classes,seed =111 ,permute =TRUE ,shrink =TRUE )

The results are summarized through the easyLookSource function, where only genes that appear in at least one of the source sets of the investigated pathways are shown (max.num.variable). Moreover, using the map.gene.names argument, all the visualization functions of the SourceSet package allow converting the identifiers used in the expression data matrix with customized labels. To facilitate the identification of the chimera genes we used gene symbols.

ABL1 (first column) is annotated in seven pathways, and apart from one (i.e., Axon guidance), it is always identified in the source set. While BCR (second column) is annotated in two pathways and for both, it is detected in the source set. Moreover, it is worth noting that, as expected, ABL and BCR are the only genes in the source set of the target pathway (sixth row).

# Convert identifiers:
# map between Entrez Gene identifiers and gene symbols
mapped.genes.symbol <- as.list(org.Hs.egSYMBOL[rownames(data.all)])

# Lists of primary genes for the analyzed graphs
primary<-lapply(results.all,function(x) x$primarySet)
# Number of primary genes 

easyLookSource(sourceObj = results.all, = mapped.genes.symbol,
               maxnum.variable = n.primary,
               label.variable = "Genes",label.graph = "Pathways")

All this information and much more (such as useful indexes for exploratory analysis, see section Statistics) can be obtained in a tabular format using the infoSource function, inside the variable list. We point out that chimera genes also achieve the highest score and relevance values among all the annotated genes.

info.all<-infoSource(results.all, = mapped.genes.symbol)
knitr::kable(info.all$variable[1:5,],caption = "> info.all$variable")

If we look at the info.all$graph for the Chronic myeloid leukemia pathway, we notice that the source set algorithm highlights the fundamental role of the chimera, which with marginal methods, would be hidden because of the propagation of the perturbation (n.secondary), involving other 16 out of 67 genes.

knitr::kable(info.all$graph,caption = "> info.all$graph")

The module composed of the chimera genes (module 6), is also detected in Pathway in cancer.

For Viral Myocarditis pathway, we observe that the source set consists of three disconnected groups of primary genes, one of which (module 7) comprises ABL1. This behavior could be due to the high fragmentation of the pathway itself; in fact, it is composed of six distinct clusters (info.all$graph$n.cluster) and a total of only 25 nodes (info.all$graph$n.graph).

sourceSankeyDiagram(results.all,height = 600,width = 800, = mapped.genes.symbol)

sourceSankeyDiagram(results.all,height = 600,width = 800, = mapped.genes.symbol)

To deepen the interpretation of each pathway by highlighting the results of the source set analysis, we can take advantage of the Cytoscape interface (see scetion Cytoscape).

In this case, it may be useful to subdivide the pathways into several groups according to a biological rationale (for example distinguishing those of signal from the others) and highlight the relations between source sets of different groups through their graphical union. To do this, simply enter the names of the pathways of each group in the name.graphs argument and load them into different collections.

# NB: Remember to launch cytoscape before running the following commands

# Create two collections of pathways to visualize the results

# Signaling collection + union source set
cytoID.signaling<-sourceCytoscape(results.all, name.graphs = graph.signaling, ="SignalingPathway", = mapped.genes.symbol)
cytoID.signaling.union<-sourceUnionCytoscape(results.all, name.graphs =graph.signaling , ="SignalingPathway" , ="SignalingUnion", =mapped.genes.symbol)

# Other collection + union source set
cytoID.other<-sourceCytoscape(results.all, name.graphs = graph.other, ="OtherPathway", = mapped.genes.symbol)
cytoID.other.union<-sourceUnionCytoscape(results.all ,name.graphs =graph.other, ="OtherPathway" , ="OtherUnion", =mapped.genes.symbol)

Note on Cytoscape {#cyto}

The sourceCytoscape andsourceUnionCytoscape functions use the r2cytoscape package to connect to Cytoscape from R using CyREST. r2cytoscape can be downloaded from the Bioconductor or GitHub repository (for old version of R) as follows:

# Install from GitHub

To enable the two display functions to work properly, three simple steps are required:

  1. Download [Cytoscape] ( (version 3.3 or later)
  2. Complete installation wizard [^commandline]
  3. Launch Cytoscape (before calling the functions).

Further details on the usage can be found in help (package =" r2cytoscape ").


[^permute]: even if the argument permute is set to FALSE the function will permute the dataset; these permutations will be used to calculate the multiple testing adjusted cut-off for the asymptotic p-values.

[^commandline]: under some versions it is necessary to install manually Command Line Implementation.

[^style]: sometimes it may be necessary to rerun the command to apply the default style.

[^chimera]: the complete analysis of the whole KEGG collection is reported in the original manuscript, see @SourceSetpackage

Try the SourceSet package in your browser

Any scripts or data that you put into this service are public.

SourceSet documentation built on Oct. 30, 2019, 9:38 a.m.