library(sigora)
library(prora)
library(readxl)
library(UpSetR)
library(msigdbr)

How does over-representation analysis (ORA) work?

The input is a user-specified list of genes/proteins which are differentially expressed in two experiments. We test whether this list is significantly associated with a pathway or a set of pathways. This analysis depends on the background which is the set of pathways that we consider for this test. The larger the background, the less pathways can be identified as differentially regulated. All genes on the list are treated as equally important. The list contains the genes with estimates/fold changes larger than a certain threshold.

As a test, Fisher's exact test is used, which is based on the hypergeometric distribution. One issue is multiple testing since we test multiple pathways. Corrections such as Bonferroni are used.

See further: Lucas Kook, "Pathway Analysis" slides, p. 4, and http://www.nonlinear.com/progenesis/qi/v2.0/faq/should-i-use-enrichment-or-over-representation-analysis-for-pathways-data.aspx

How does signature over-representation analysis (SIGORA) work?

The difference of SIGORA to ORA is that it looks at gene pairs instead of individual genes for the association of genes with pathways. Usually, genes belong to multiple pathways. The probability is a lot higher to find a combination of two genes that is unique for one pathway. Hence, the relevant pathways can be better identified on the level of gene pairs. For a sigora analysis, two steps are needed:

\begin{enumerate} \item For pathways of a repository, identify the gene pair signatures, \emph{i.e.} weighted pairs of genes that are unique for the given pathway. We call this the gene pair signature repository (GPS repo). \item Perform an ORA on these signatures. \end{enumerate}

Example from the SIGORA package description (step 2)

The package comes with precomputed GPS repositories for KEGG human and mouse (kegH and kegM) and Reactome human and mouse (reaH and reaM).

## example data
data(kegH)
## select 50 genes from 3 human KEGG pathways
a <- genesFromRandomPathways(seed = 12345, kegH, 3, 50)
## the genes are
a[["genes"]]

## sigora analyis (for saving results: saveFile = "myResultsKEGG.csv")
sigoraRes <- sigora(GPSrepo = kegH, queryList = a[["genes"]], level = 4)
## again, the three originally selected pathways were:
a[["selectedPathways"]]
## genes related to the second most significant result:
getGenes(sigoraRes, 2)
## number of genes related to the most significant result:
nrow(getGenes(sigoraRes, 1))
## traditional ora identifies dozens of statistically significant pathways!
oraRes <- ora(a[["genes"]], kegH)
nrow(oraRes)
## first 10 pathways:
oraRes[1:10,]

In the SIGORA analysis, only 2 pathways are identified as significant. More precisely, 2 of the 3 randomly selected pathways in the query list.

In the ORA analysis, 62 pathways are identified as significant. This is because genes belong to multiple pathways but gene pairs are more unique and identify the relevant pathways better.

\textbf{Question:} How can an UpSet plot for the ORA analysis be created? It would need the set of genes involved in each pathway.

Create the GPS repository (step 1)

The package provides a function makeGPS() to create a GPS repository from the user's repository of choice, \emph{e.g.} Gene Ontology Biological Processes. The repository format should be a tab delimited file or a datafrane with three columns ordered \textbf{PathwayID, PathwayName, Gene}. The function makeGPS() identifies GPSs and PUGs.

Note: This function relies on package slam, which should be installed from CRAN. It is fairly memory intensive, and it is recommended to be run on a machine with at least 6GB of RAM. Also, make sure to save and reuse the resulting GPS repository in future analyses!

The following functions from package slam are used:

The first creates a sparse matrix and the second is used to compute the cross-product for sparse matrices (faster implementation).

Consider now an example from the SIGORA package description to see how makeGPS() can be used. Note that sigora::idmap is a table with Ensembl IDs, Entrez IDs and Symbols for human and mouse. The data nciTable gives the National cancer instiute (NCI) human gene-pathway associations.

## example data
data(nciTable)
head(nciTable)

## create a GPS repository (for future reuse: saveFile='nciH.rda')
nciH <- makeGPS(pathwayTable = nciTable)
## set up a gene list by finding all words starting with IL... 
## in the Symbol column in idmap
ils <- grep("^IL", idmap[, "Symbol"], value = TRUE)
## sigora analysis
ilnci <- sigora(GPSrepo = nciH, queryList = ils, level = 3)

5 pathways identified as significant. It did not take long (around 1 sec).

\textbf{Question:} Does it find all of the "^IL" pathways?

\newpage

ID mapping

The package SIGORA works for Ensembl IDs, Entrez IDs and Gene Symbols (HGNC/ MGI). The GPS repository and the gene list can have any of those three identifiers and they will be mapped via idmap if not the same identifier is used. You can load the querylist as follows:

In the following, we try different ID mappings.

Parsing FASTA headers to Uniprot IDs

## example data
dog <- read_excel("../sampleData/CanisLupus.xlsx")
yeast <- read_excel("../sampleData/YEAST_Example.xlsx")

## bring it to the right format
dog <- dog[, c("top_protein", "estimate")]
colnames(dog) <- c("protein_Id", "estimate")
yeast <- yeast[, c("TopProteinName", "pseudo.log2FC")]
colnames(yeast) <- c("protein_Id", "estimate")

## FASTA header --> Uniprot

dog_uniprot <- prora::get_UniprotID_from_fasta_header(dog)
head(dog_uniprot)
nrow(dog) - nrow(dog_uniprot) # is 0 but we have NAs so we lost some?
# we lost (in percent)
100 / nrow(dog) * sum(is.na(dog_uniprot$UniprotID))

yeast_uniprot <- prora::get_UniprotID_from_fasta_header(yeast)
nrow(yeast) - nrow(yeast_uniprot) # is 0 but we have NAs so we lost some?
# we lost (in percent)
100 / nrow(yeast) * sum(is.na(yeast_uniprot$UniprotID))

\textbf{Question:} Why does the prora vignette say that $0$ genes are lost when mapping FASTA header to Uniprot IDs? One can check that there are NAs, so maybe comparing number of rows is not the correct way to identify lost IDs..?

\textbf{Question:} The vignette shows that sigora works with Uniprot IDs. Why? From reading the package description, it looks like it works only for Ensembl IDs, Entrez IDs and Gene Symbols.

Note: Creating a GPS repository from the GO repository using the function makeGPS_wrappR() does not work using the Uniprot IDs for dog and yeast.

We should try with makeGPS(). For this, we need to bring the yeast and dog datasets in the right format. We would need the columns \textbf{PathwayID, PathwayName, Gene} with Ensembl or Entrez IDs.

Note: The function checkIDmappingEfficiency() does not work for this test data.

Mapping Uniprot IDs to Entrez IDs

## Uniprot --> Entrez
dog_uniprot_entrez <- map_ids_uniprot(dog_uniprot, ID_col = "UniprotID")
# we lost additionally to what we lost before (in percent)
100 / sum(!is.na(dog_uniprot$UniprotID)) * 
  (sum(is.na(dog_uniprot_entrez$P_ENTREZGENEID)) - sum(is.na(dog_uniprot_entrez$UniprotID)))
# one to many mapping?
nrow(dog_uniprot_entrez) - nrow(dog_uniprot)

yeast_uniprot_entrez <- map_ids_uniprot(yeast_uniprot, ID_col = "UniprotID")
# we lost additionally to what we lost before (in percent)
100 / sum(!is.na(yeast_uniprot$UniprotID)) *
  (sum(is.na(yeast_uniprot_entrez$P_ENTREZGENEID)) - sum(is.na(yeast_uniprot_entrez$UniprotID)))
# one to many mapping?
nrow(yeast_uniprot_entrez) - nrow(yeast_uniprot)

Use the MSigDB table for creating a GPS repository

## what species are there?
#msigdbr_show_species()

## example tables
## from category GO and subcategory BP
msig_dog <- msigdbr(species = "Canis lupus familiaris", 
                    category = "C5", 
                    subcategory = "BP")
nrow(msig_dog)

msig_yeast <- msigdbr(species = "Saccharomyces cerevisiae", 
                      category = "C5", 
                      subcategory = "BP")
nrow(msig_yeast)


## create a pathway table
## make it compatible
colnames(dog_uniprot_entrez) <-
  c("UniprotID", "entrez_gene", "protein_Id", "estimate")
dog_uniprot_entrez$entrez_gene <-
  as.integer(dog_uniprot_entrez$entrez_gene)

colnames(yeast_uniprot_entrez) <-
  c("UniprotID", "entrez_gene", "protein_Id", "estimate")
yeast_uniprot_entrez$entrez_gene <-
  as.integer(yeast_uniprot_entrez$entrez_gene)
## join
msig_dog <- dplyr::inner_join(dog_uniprot_entrez, msig_dog, 
                              by = "entrez_gene")
# we lost (in percent)
100 / sum(!is.na(dog_uniprot_entrez$entrez_gene)) * 
  (sum(!is.na(dog_uniprot_entrez$entrez_gene)) - dim(table(msig_dog$entrez_gene)))

msig_yeast <- dplyr::inner_join(yeast_uniprot_entrez, msig_yeast, 
                                by = "entrez_gene")
# we lost (in percent)
100 / sum(!is.na(yeast_uniprot_entrez$entrez_gene)) * 
  (sum(!is.na(yeast_uniprot_entrez$entrez_gene)) - dim(table(msig_yeast$entrez_gene)))
## bring it to the right format for makeGPS()
msig_dogTable <- msig_dog[, c("gs_id", "gs_name", "entrez_gene")]
colnames(msig_dogTable) <- c("PathwayID", "PathwayName", "Gene")

msig_yeastTable <- msig_yeast[, c("gs_id", "gs_name", "entrez_gene")]
colnames(msig_yeastTable) <- c("PathwayID", "PathwayName", "Gene")

## create a GPS repository
msigDog <- makeGPS(pathwayTable = msig_dogTable)
msigYeast <- makeGPS(pathwayTable = msig_yeastTable)

## set up the query list
listDog <- genesFromRandomPathways(seed=12345, msigDog, 3, 50)
listYeast <- genesFromRandomPathways(seed=12345, msigYeast, 3, 50)
## the genes are
listDog[["genes"]]
listYeast[["genes"]]

## sigora analysis
res_d <- sigora(GPSrepo = msigDog, 
                queryList = listDog[["genes"]], 
                level = 4)
res_y <- sigora(GPSrepo = msigYeast, 
                queryList = listYeast[["genes"]], 
                level = 4)

The SIGORA analysis identifies three pathways for dog (two of which are among the three randomly selected) and three pathways for yeast (none of the randomly selected, however).

Try the mapping procedure with human test data from the package

## example data --> Uniprot --> Entrez
data("exampleContrastData")
dd <- prora::get_UniprotID_from_fasta_header(exampleContrastData)
dd <- map_ids_uniprot(dd, ID_col = "UniprotID")
colnames(dd) <- c("UniprotID", "entrez_gene", "protein_Id", "estimate")
dd$entrez_gene <- as.integer(dd$entrez_gene)

## pathway table
msig_h <- msigdbr(species = "Homo sapiens", 
                  category = "C5", 
                  subcategory = "BP")
msig_hTable <- dplyr::inner_join(dd, msig_h, by = "entrez_gene")
msig_hTable <- msig_hTable[, c("gs_id", "gs_name", "entrez_gene")]
colnames(msig_hTable) <- c("PathwayID", "PathwayName", "Gene")

## GPS repository
msigH <- makeGPS(pathwayTable = msig_hTable)

## sigora analysis
# trick to use sigoraWrappR
colnames(dd) <- c("a", "UniprotID", "protein_Id", "estimate")
res_h_wrapp <- sigoraWrappR(data = dd,
                            threshold = 0.6,
                            score_col = "estimate",
                            GPSrepos = msigH,
                            greater_than = TRUE)
# directly with sigora
listHuman <- genesFromRandomPathways(seed=12345, msigH, 3, 50)
res_h <- sigora(GPSrepo = msigH, 
                queryList = listHuman[["genes"]], 
                level = 4)

Note: sigoraWrappR() uses the estimates and a threshold to obtain the gene list whereas sigora() needs a precomputed gene list, here randomly selected.

Extensions/ Questions

Only certain repositories?

According to the SIGORA package description, there is no limitation regarding repositories.

Only for human and mouse?

The function makeGPS works for dog and yeast using MSigDBr.

The SIGORA internal idmap (when it is needed) does only work for human and mouse. In the sigora() function, idmap is used only if the following is TRUE:

That means, idmap is used when the queryList and the GPSrepo have no intersection. In the three previous examples for dog, yeast and human using MSigDBr, this if statement is FALSE.

What does the level argument in the sigora() function mean?

Some repositories are hierarchical, \emph{e.g.} Reactome and GO. Hierarchical means that a parent pathway contains all the pathways from its children pathways.

If the repository is hierarchical, children pathways would not have GPSs and would therefore not be detected in a sigora analysis. In this case, SIGORA does the GPS finding procedure in several steps, in other words for several levels:

The GPS repository for a hierarchical repository then consists of sublists of GPS repositories for all the levels, \emph{e.g.} L1, L2,\dots, L5. Specifying the argument "levels=4" in the sigora() function then corresponds to sigora analyses with the GPS repositories up to level 4.



protViz/prora documentation built on Dec. 12, 2021, 12:32 a.m.