PAGI.Main: A novel pathway identification approach based on global...

Description Usage Arguments Details Value Author(s) References Examples

View source: R/PAGI.1.0.R

Description

PAGI.Main is an attempt to identify dysregulated pathways, which are influenced by both the internal effect of pathways and crosstalk between pathways, integrating pathway topological information and differences between two biological phenotypes.

Usage

1
2
  PAGI.Main(dataset,class.labels,nperm = 100, p.val.threshold = -1, FDR.threshold = 0.01,
  gs.size.threshold.min= 25, gs.size.threshold.max = 500 )

Arguments

dataset

A dataframe of gene expression data whose first column are genes symbols and whose names are samples.

class.labels

A vector of binary labels. The vector is used to distinguish the class of phenotype.

nperm

An integer. The number of random permutations. The default value is 100.

p.val.threshold

A value. The significance threshold of NOM p-value for pathways whose detail results of pathways to be presented. The default value is -1, which means no threshold.

FDR.threshold

A value. The significance threshold of FDR q-value for pathways whose detail results of pathways to be presented. The default value is 0.01.

gs.size.threshold.min

An integer. The minimum size (in genes) for pathways to be considered. The default value is 25.

gs.size.threshold.max

An integer. The maximum size (in genes) for pathways to be considered. The default value is 500.

Details

When users input interesting gene expression data and the vector of binary labels (class labels), the function can identify dysregulated pathways mainly through: (1) Mapping genes with the absolute t-score more than 0 to the global graph reconstructed based on the relationships of genes extracted from each pathway in KEGG database and the overlapped genes between pathways; (2) We defined a global influence factor (GIF) to distinguish the non-equivalence of gene influenced by both internal effect of pathways and crosstalk between pathways in the global network. The random walk with restart (RWR) algorithm was used to evaluate the GIF by integrating the global network topology and the correlation of gene with phenotype; (3) We used cumulative distribution functions (CDFs) to prioritize the dysregulated pathways. The permutation is used to identify the stasistical significance of pathways (normal p-values) and the FDR is used to to account for false positives.

The argument dataset is gene expression data set stored in a dataframe. The first column of the dataframe are gene symbols and the names of the dataframe are samples names.

Value

A list. It includes two elements: SummaryResult and PathwayList.

SummaryResult is a dataframe. It is the summary of the result of pathways. Each rows of the dataframe represents a pathway. Its columns include "Pathway Name", "SIZE", "PathwayID", "Pathway Score", "NOM p-val", "FDR q-val", "Tag percentage" (Percent of gene set before running enrichment peak), "Gene percentage" (Percent of gene list before running enrichment peak), "Signal strength" (enrichment signal strength).

PathwayList is list of pathways which present the detail results of pathways with NOM p-val< p.val.threshold or FDR< FDR.threshold. Each element of the list is a dataframe. Each rows of the dataframe represents a gene. Its columns include "Gene number in the (sorted) pathway", "gene symbol from the gene express data", "location of the gene in the sorted gene list", "the T-score of gene between two biological states", "global influence impactor", "if the gene contribute to the score of pathway".

Author(s)

Junwei Han <hanjunwei1981@163.com> Yanjun Xu <tonghua605@163.com> Haixiu Yang <yanghaixiu@ems.hrbmu.edu.cn> Chunquan Li <lcqbio@yahoo.com.cn> and Xia Li <lixia@hrbmu.edu.cn>

References

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A, 102, 15545-15550.

Li, C., Li, X., Miao, Y., Wang, Q., Jiang, W., Xu, C., Li, J., Han, J., Zhang, F., Gong, B. et al. (2009) SubpathwayMiner: a software package for flexible identification of pathways. Nucleic Acids Res, 37, e131.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
## Not run: 

##########identify dysregulated pathways by using the function PAGI.Main###########
#example 1
#get example data
dataset<-getdataset()
class.labels<-getclass.labels()

#identify dysregulated pathways
result<-PAGI.Main(dataset,class.labels,nperm = 100,p.val.threshold = -1,FDR.threshold = 0.01,
gs.size.threshold.min = 25, gs.size.threshold.max = 500 )

#print the summary results of pathways to screen
result[[1]][1:10,]

#The result is a dataframe. The rows of the dataframe are ranked by the values of False 
#discovery rate (FDR). Each row of the result (dataframe) is a pathway. It columns include 
#"Pathway Name", "SIZE", "PathwayID", "Pathway Score",  "NOM p-val", "FDR q-val", "Tag 
#percentage", "Gene percentage", "Signal strength". They correspond to pathway names, 
#the number of genes which were mapped to the pathway from gene expression profiles, pathway ID,
#the scores of pathway, the nominal p-values of the pathways, the FDR values, the percent of
#gene set before running enrichment peak, the percent of gene list before running enrichment peak,
#enrichment signal strength.

#print the detail results of pathways to screen
result[[2]][1:5]

#The result is a list. Each element of the list is a dataframe whcih present the detail results of
#genes in the pathway with FDR.threshold< 0.01. Each rows of the dataframe represents a gene.
#Its columns include "Gene number in the (sorted) pathway", "gene symbol from the gene express data",
#"location of the gene in the sorted gene list", "the T-score of gene between two biological states",
#"global influence impactor", "if the gene contribute to the score of pathway".

#write the summary results of pathways to tab delimited file. 
write.table(result[[1]], file = "SUMMARY RESULTS.txt", quote=F, row.names=F, sep = "\t")

#write the detail results of genes for each pathway with FDR.threshold< 0.01 to tab delimited file. 
for(i in 1:length(result[[2]])){
gene.report<-result[[2]][[i]]
filename <- paste(names(result[[2]][i]),".txt", sep="", collapse="")
write.table(gene.report, file = filename, quote=F, row.names=F, sep = "\t")
}

#example 2
#get example data
dataset<-read.table(paste(system.file(package="PAGI"),"/localdata/dataset.txt",sep=""),
header=T,sep="\t","\"")
class.labels<-as.character(read.table(paste(system.file(package="PAGI"),
"/localdata/class.labels.txt",sep=""),quote="\"", stringsAsFactors=FALSE)[1,])

#identify dysregulated pathways
result<-PAGI.Main(dataset,class.labels,nperm = 100,p.val.threshold = -1,FDR.threshold = 0.01,
gs.size.threshold.min = 25, gs.size.threshold.max = 500 )

#print the summary results of pathways to screen
result[[1]][1:10,]

#The result is a dataframe. The rows of the dataframe are ranked by the values of False 
#discovery rate (FDR). Each row of the result (dataframe) is a pathway. It columns include 
#"Pathway Name", "SIZE", "PathwayID", "Pathway Score",  "NOM p-val", "FDR q-val", "Tag 
#percentage", "Gene percentage", "Signal strength". They correspond to pathway names, 
#the number of genes which were mapped to the pathway from gene expression profiles, pathway ID,
#the scores of pathway, the nominal p-values of the pathways, the FDR values, the percent of
#gene set before running enrichment peak, the percent of gene list before running enrichment peak,
#enrichment signal strength.

#print the detail results of pathways to screen
result[[2]][1:5]

#The result is a list. Each element of the list is a dataframe whcih present the detail results of
#genes in the pathway with FDR.threshold< 0.01. Each rows of the dataframe represents a gene.
#Its columns include "Gene number in the (sorted) pathway", "gene symbol from the gene express data",
#"location of the gene in the sorted gene list", "the T-score of gene between two biological states",
#"global influence impactor", "if the gene contribute to the score of pathway".

#write the summary results of pathways to tab delimited file. 
write.table(result[[1]], file = "SUMMARY RESULTS.txt", quote=F, row.names=F, sep = "\t")

#write the detail results of genes for each pathway with FDR.threshold< 0.01 to tab delimited file. 
for(i in 1:length(result[[2]])){
gene.report<-result[[2]][[i]]
filename <- paste(names(result[[2]][i]),".txt", sep="", collapse="")
write.table(gene.report, file = filename, quote=F, row.names=F, sep = "\t")
}


## End(Not run)

PAGI documentation built on May 1, 2019, 7:48 p.m.