sourceSet: Source Set

Description Usage Arguments Details Value Note References See Also Examples

View source: R/mainFunctions.R

Description

Identify the sets of variables that are potential sources of differential behavior, (i.e., the primary genes) between two experimental conditions. The two experimental conditions are associated to a set of graphs, where each graph represents the topology of a biological pathway.

Usage

1
2
3
sourceSet(graphs, data, classes, seed = NULL, theta = 1,
  permute = TRUE, alpha = 0.05, shrink = FALSE,
  return.permutations = FALSE)

Arguments

graphs

a list of graphNEL objects representing the pathways to be analyzed.

data

a matrix of expression levels with column names for genes and row names for samples; gene names must be unique.

classes

a vector of length equal to the number of rows of data. It indicates the class (condition) of each statistical unit. Only two classes, labeled as 1 and 2, are allowed;

seed

integer value to get a reproducible random result. See Random.

theta

positive numeric value greater then 1, that defines the number of permutation. If permute=TRUE, (m/alpha x theta) permutations are used, where m is the number of unique conditional tests to be performed; otherwise, (1/alpha x theta) permutations are supplied.

permute

if TRUE permutation p-values are provided; if FALSE, asymptotic p-values are returned. NOTE: even if the argument permute is set to FALSE the function will permute the dataset; these permutations will be used to calculate the adjusted cut-off for the asymptotic p-values.

alpha

the p-value threshold. Denotes the level at which FWER is controlled for each input graph.

shrink

if TRUE, regularized estimation of the covariance matrices is performed; otherwise, maximum likelihood estimations is used.

return.permutations

if TRUE, the function returns the matrix of test statistic values for the supplied (first row) and the permutated datasets.

Details

The sourceSet approach models the data of the same pathway in two different experimental conditions as realizations of two Gaussian graphical models sharing the same decomposable graph G. Here, G = (V,E) is obtained from the pathway topology conversion, where V and E represent genes and biochemical reactions, respectively.

We give full freedom to the user in providing the underlying graph G, requiring only a specific input format (i.e., a graphNEL object). So, the user can provide a list of manually curated pathways or use developed software to translate the bases of knowledge. To date, the most complete software available for this task is graphite R package (Sales et al. 2017).

The source set algorithm infers the set of primary genes (i.e., the source set) following - for each graph - five steps:

Although the interpretation of the source set for a single graph is intuitive, the interpretation of the collection of results associated to a set of pathways might be complex. For this reason, we propose a guideline for the meta-analysis providing descriptive statistics and predefined plots. See, infoSource, easyLookSource, sourceSankeyDiagram, sourceCytoscape and sourceUnionCytoscape.

Value

The output of the function is an object of the sourceSetList class. It contains as many lists as the input graphs, and each of them provides the following variables:

Note

If permute and/or shrink parameters violate the conditions required for the existence of the full-rank maximum likelihood estimates, the algorithm reserves the possibility to change the user settings through internal controls.

Indeed, if the user wants to use the MLE of the covariance matrix (shrink=FALSE), all cliques - in all pathways - must satisfy the n > p_i condition, where n is the number of samples for the smaller class and p_i is the cardinality of the largest clique in the i-th pathway. If even one clique does not satisfy this requirement, the regularized estimate must be used. When a regularized estimate is employed (shrink=TRUE), the analytical null distribution of the test statistics is no longer available, and we rely on permutation methods to obtain the associated p-values.

To address the multiple testing problem we use two versions of the method proposed by Westfall and Young (2017), which uses permutations to obtain the joint distribution of the p-values. More specifically, when the maximum likelihood estimates of the covariance matrices are used (shrink=FALSE), the asymptotic p-values and the maxT approach is adopted. While, if the regularized estimates are calculated (shrink=TRUE), asymptotic distribution is no longer valid and the min P version and the per-hypothesis permutation p-values to obtain the joint distribution of the p-values are needed. The number of permutations depends on the method, the alpa level chosen, and the number of hypotheses. A minimum number of 500 and a maximum number of 10.000 permutations are allowed.

References

Sales, G. et al. (2017). graphite: GRAPH Interaction from pathway Topological Environment, r package version 1.22.0 edition.

Westfall, P. and Young, S. (2017). Resampling-based multiple testing : examples and methods for p-value adjustment. Wiley.

Djordjilovic, Vera and Chiogna, Monica (2017) Searching for a Source of Difference: a Graphical Model Approach. [Working Paper] WORKING PAPER SERIES, 4/2017, PADOVA

Salviato et al. (2019). SourceSet: a graphical model approach to identify primary genes in perturbet biological pathways. (Accepted - PLOS Computational Biology).

See Also

pathways, infoSource, easyLookSource, sourceSankeyDiagram, sourceCytoscape and sourceUnionCytoscape

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#### Toy example: only one graph
if(require(mvtnorm)){
  # Generate two random samples of size 50 from two multivariate normal distributions
  n<-50
  # true parameters of class 1 and class 2
  param.class1<-simulation$condition1
  param.class2<-simulation$condition2$`10`$`2`

  # simulated dataset
  data.class1<-rmvnorm(n = n,mean =param.class1$mu ,sigma =param.class1$S)
  data.class2<-rmvnorm(n = n,mean =param.class2$mu ,sigma=param.class2$S)

  # Input arguments for the sourceSet function
  data<-rbind(data.class1,data.class2)
  classes<-c(rep(1,nrow(data.class1)),rep(2,nrow(data.class2)))
  graphs<-list("toy.graph"=simulation$graph)

  result<-sourceSet(graphs ,data ,classes ,seed = 123 ,permute =FALSE ,shrink =FALSE, alpha=0.05  )

  # source set: primary disregulation (toy.graph)
  result$toy.graph$primarySet
  # secondary disregulation (toy.graph)
  result$toy.graph$secondarySet
  # all affected variables
  unique(unlist(result$toy.graph$orderingSet))

  # summary statistics
  info<-infoSource(result)
  info$variable
  info$graph

  # visual summaries
  easyLookSource(result)
  sourceSankeyDiagram(result)
}

# launch cytoscape and run:

sourceCytoscape(result,name.graphs = "toy.graph",collection.name = "Example")
sourceUnionCytoscape(result ,collection.name = "Example")


### Real data:
# see vignette, section Getting deepening
vignette("SourceSet")

SourceSet documentation built on Oct. 30, 2019, 9:38 a.m.