# An Introduction to GAPGOM In GAPGOM: GAPGOM (novel Gene Annotation Prediction and other GO Metrics)

library(knitr)
library(kableExtra)
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = FALSE ) library(GAPGOM)  ## Citation When using GAPGOM, please cite the following; • Ehsani R, Drablos F: TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinformatics 2016, 17(1):296. DOI: https://doi.org/10.1186/s12859-016-1160-0 • Ehsani R, Drablos F: Measures of co-expression for improved function prediction of long non-coding RNAs. BMC Bioinformatics 2018. Accepted. ## Introduction GAPGOM (novel Gene Annotation Prediction and other GO Metrics) is an R package with tools and algorithms for estimating correlation of gene expression, enriched terms in gene sets, and semantic distance between sets of gene ontology (GO) terms. This package has been made for predicting the annotation of un-annotated gene(s), in particular with respect to GO, and testing such predictions. The prediction is done by comparing expression patterns between a query gene and a library of annotated genes, and annotate the query gene by enriched terms from the set of genes with similar expression pattern (often described as "guilt by association"). For correlation of gene expression, GAPGOM is introducing LNCRNA2GOA which is a novel tool. The main interface for expression data is currently the Fantom5 data, using Bioconductor's ExpressionSet class. For semantic similarity of GO terms (in particular for testing predictions), the package is using TopoICSim. It makes use of GO data via the GOSemSim package with the godata() interface. GO consists of three main Ontologies; molecular function (MF), biological process (BP) and cellular component (CC). ## Installation Before installing, the package has quite a few dependencies in both cran and Bioconductor. You can run the following block of code (preferably line-by-line because of prompts) to install these and the package itself. ### NEEDED (depends, suggests) if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("GAPGOM", dependencies = TRUE)  ## Expression data interfaces ### ID support As of v0.2.7 and up, all AnnotationDbi IDs should be supported. However, we recommend usage of EntrezIDs because this is the most widely supported ID in this and other packages. If you find issues with respect to ID support, please notify this issue on the package repository. If you want to (or have to) convert IDs manually, the BiomaRt package is recommended. However, translating IDs to other types is lossy and does not always translate well. ### Expression data (FANTOM5) As of right now, the package has one main dataset interface for expression data; The Fantom5 dataset. For other datasets, an ExpressionSet has to be manually made as described later in this chapter. There are a few helper functions to make this data usable. The Fantom5 dataset is only available for the human and mouse genome. Examples of helper functions/interfaces can be found below; # download the fantom5 data file fantom_file <- fantom_download("./", organism = "mouse", noprompt = TRUE) # saves filename # load the file (use fantom_file variable if doing all at once) ft5 <- fantom_load_raw("./mm9.cage_peak_phase1and2combined_tpm_ann.osc.txt", verbose = TRUE) # remove first two rows from fantom5 data (these are seperate statistis, # we just need expressionvalues) ft5$df <- ft5$df[3:nrow(ft5$df),]

# convert the raw fantom table to an ExpressionSet
expset <- fantom_to_expset(ft5, verbose = TRUE)


Do note that all standard columns are necessary when converting to an ExpressionSet in this way!

### Manually specifying an ExpressionSet

Because loading expression data right now is a bit limited, this paragraph will describe how to convert expression data to an ExpressionSet object. We will give an example with randomly selected expression values and IDs. In some cases if you want something specific, defining it this way can actually be better (More control over extra data that goes into the object/better interoperability between packages).

Minimal requirement for an ExpressionSet;

• expression values
• Unique IDs of a certain type. AnnotationDbi keys are the only IDs that are supported right now.

Each row of expression values should have corresponding IDs, with the ID-type as its column name.

Random expression value generation;

# select x random IDs
x_entries <- 1000

go_data <- GAPGOM::set_go_data("human", "BP", computeIC = FALSE)
random_ids <- unique(sample(go_data@geneAnno$ENTREZID, x_entries)) # and only keep # uniques # make general dataframe. expressions <- data.frame(random_ids) colnames(expressions) <- "ENTREZID" expressions$ID

# n expression values depending on the amount of unique IDs that are present
expressionvalues <- abs(rnorm(length(random_ids)*6))*x_entries
expressions[,2:7] <- expressionvalues


Converting the expression dataframe to an ExpressionSet;

expression_matrix <- as.matrix(expressions[,2:ncol(expressions)])
rownames(expression_matrix) <- expressions$ENTREZID featuredat <- as.data.frame(expressions$ENTREZID) # And everything else besides expressionvalues (preferably you don't even need to include the IDs themselves here!)
rownames(featuredat) <- expressions$ENTREZID # because they will be the rownames anyway. expset <- ExpressionSet(expression_matrix, featureData = new("AnnotatedDataFrame", data=featuredat)) # To see how it is structured; head(expset) head(assayData(expset)[["exprs"]]) # where expressionvalues are stored. head(pData(featureData(expset))) # where other information is stored.  ## LNCRNA2GOA (expression similarity) ### Background The LNCRNA2GOA (long non-coding RNA to GO Annotation) or expression_prediction() uses various methods/measures to determine similar genes with similar expression pattern; pearson, spearman, kendall, sobolev and fisher. This calculates scores between expression value sets given a query gene. These scores are used to identify genes for the enrichment analysis, which will be sorted by significance before being returned. It is also possible to find similar expression patterns by using the combined method. The Sobolev and Fisher metrics are so far unique to this package (at least in the context of this type of analysis), all the others are standard methods from the R cor() function. Details of the novel methods are described below (quoted from paper [1], references edited). #### Sobolev metric In this section, we use definitions and notations as in [2]. We start with the usual p-inner product. Let$f$,$g$be real-valued functions (in this case$f$and$g$values are the expression vectors of two genes$f$and$g$):$\langle f,g\rangle_{p} = (\sum_{k=1}^{n}\mid f_k.g_k\mid^p)^\frac{1}{p}$(2) By this notation, Sobolev inner product, norm and meter of degree$k$respectively can be defined by:$\langle f,g\rangle_{p,a}^{S} = \langle f,g\rangle_p+\alpha\langle D^kf,D^kg\rangle_p$(3)$\mid\mid f\mid\mid_{p,k,\alpha}^S = \sqrt{\langle f,f\rangle_{p,\alpha}^S}$(4)$d_{p,k,\alpha}^S(f,g) = \mid\mid f-g\mid\mid_{p,k,\alpha}^S$(5) where$D^k$is the$k$th differential operator. For the special case$p=2$and$\alpha=1$an interesting connection to the Fourier-transform of analysis can be made; let$\hat{f}$be the Fourier-transform$f\hat{f}(\omega_k) = \sum_{j=1}^{N-1}g_jexp(-i\frac{2\pi kj}{N})$(6) Where$\omega_k=\frac{2\pi k}{N}$and$i=\sqrt{-1}$. Finally the norm can be written as$\mid\mid f\mid\mid_{2,k,1}^S = \sqrt{\sum_{j=1}^{N-1}(1+\omega_j)^k\mid \hat{f}(\omega_j)\mid^2}$(7) In this work metric (5) with norm (7) and$k=1$was used. #### Fisher metric In this section we use definitions and notations such as in [3]. To define Fisher information metric we first introduce the n-simplex$P_n$defined by$P_n={x\in R^{n+1}:\forall i, x_n \ge0,\sum_{i=1}^{n+1}xi=1}$(8) The coordinates${x_i}$describe the probability of observing different outcomes in a single experiment (or expression value of a gene in$i$th cell type). The Fisher information metric on$P_n$can be defined by$Jij = \sum_{k=1}^{n+1}\frac{1}{x_k}\frac{\partial x_k}{\partial x_i}\frac{\partial x_k}{\partial x_j}$(9) We now define a well-known representation of the Fisher information as a pull-back metric from the positive n-sphere$S_n^+S_n^+={x\in R^n;\forall i,x_n\ge0,\sum_{i=1}^{n+1}x^2=1}$(10) The transformation$T: P_n\to S_n^+$defined by$T(x)=(\sqrt{x_1}, \dots, \sqrt{x_n+1})$(11) pulls back the Euclidean metric on the surface of the sphere to the Fisher information on the multinomial simplex. Actually, the geodesic distance for$x,y \in P_n$under the Fisher information metric may be defined by measuring the length of the great circle on$S_n^+$between$T(x)$and$T(y)d(x,y) = acos(\sum_{i=1}^{n+1}\sqrt{x_iy_i})$(12) The LNCRNA2GOA method can also be used on other novel genes besides lncRNAs. ### Example #### Scores + Enrichment The following example is an arbitrary use-case. Meaning that this is just an example and does not (necessarily) imply a certain question/real life use-case. id_select_vector represents a vector of gene ids that you'd want to use for annotation enrichment (if left empty, algorithm will use all available gene ids in the ExpressionSet). # Example with default dataset, take a look at the data documentation # to fully grasp what's going on with the making of the filter etc. (Biobase # ExpressionSet) # keep everything that is a protein coding gene (for annotation) filter_vector <- fData(GAPGOM::expset)[( fData(GAPGOM::expset)$GeneType=="protein_coding"),]$GeneID # set gid and run. gid <- "ENSG00000228630" result <- GAPGOM::expression_prediction(gid, GAPGOM::expset, "human", "BP", id_translation_df = GAPGOM::id_translation_df, id_select_vector = filter_vector, method = "combine", verbose = TRUE, filter_pvals = TRUE) kable(result) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")  Here we display the results, you can see that it has 6 columns; • GOID Describing the significantly similar GO terms • Ontology Describing the ontology of the result. • Pvalue The P-value/significance of the result. • FDR bonferoni normalized P-value • Term The description of the GO term. • used_method Used scoring method for getting the result. Besides this, there is an optional parameter for different GO labeling/annotation; id_translation_df. This dataframe should contain the following; • rownames$\to$rownames of the expression set • first column$\to$Gene ID (e.g. EntrezID). Gene should be the same as in the expression dataset. • second colum$\to$GO IDs. This can also drastically improve calculation time as most of the time is spent on querying for this translation. #### Only scores There is also another algorithm that allows you to just calculate scores and skip the enrichment; # Example with default dataset, take a look at the data documentation # to fully grasp what's going on with making of the filter etc. (Biobase # ExpressionSet) # set an artbitrary gene you want to find similarities for. (5th row in this # case) gid <- "ENSG00000228630" result <- GAPGOM::expression_semantic_scoring(gid, GAPGOM::expset) kable(result[1:100,]) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")  We can see that this function returns a different dataframe; • original_ids The identifier of the gene expression row • score The similarity score/correlation calculated by one of the methods. • used_method The used method used to calculate the score. The rownames also represent the gene expression row. Only the first 100 rows are shown, otherwise the table would be quite big. Enrichment and GO annotation/translation needs to be done manually after this step. However, this should be quite doable with a bit of help from GOSemSim. #### Original dataset The original publication used the lncRNA2Function data [4], to test if the results are the same, a small script has been made to reproduce the same results and is located at the package install directory under scripts. The script folder also contains the two original scripts for the algorithm, but not neccesarily its data. The data (along with the scripts) can be found on the following websites: Besides this, the scripts folder also contains a proof-of-concept script for potentially doing the analysis on unannotated transcripts (by finding the closest gene). This is eventually meant as a sort of substitute to the famous GREAT tool. ## TopoICSim ### Background TopoICSim or Topological Information Content Similarity, is a method to measure the similarity between two GO terms given the information content and topology of the GO DAG tree. Unlike other similar measures, it considers both the shortest and longest DAG paths between two terms, not just the longest or shortest path(s). The paths along the GO DAG tree get weighted with the information content between two terms. For the information content the following forumla is used;$IC(t) = -log(p(t))$(1) Where$t$is the (GO) term. The IC is calculated by GOSemSim, and based on the frequency of the specific go term ($p(t)$). A GO tree can be described as a triplet$\Lambda=(G,\Sigma,R)$, where$G$is the set of GO terms,$\Sigma$is the set of hierarchical relations between GO terms (mostly defined as is_a or part_of) [5], and$R$is a triplet$(t_i, t_j, \xi)$, where$t_i,t_j\in G$and$\xi\in R$and$t_i\xi t_j$. The$\xi$relationship is an oriented child-parent relation. Top level node of the GO rDAG is the Root, which is a direct parent of the MF, BP and CC nodes. These nodes are called aspect-specific roots and we refer to them as root in the following. A path$P$of length$n$between two terms$t_i,t_j$can be defined as in (23).$P:G\times G\to G\times G\dots\times G=G^{n+1};\P(t_i,t_j) = (t_i,t_j+1,\dots,t_j)$(23) Here$\foralls$,$i\le s<j$,$\exists\xi_s\in \Sigma$,$\exists\tau_s\in R$,$r_s=(t_s,t_{s+1}, \xi_s)$. Because$G$is an rDAG, there might be multiple paths between two terms, so we represent all paths between two terms$t_i,t_j$according to (24).$\mathcal{A}(t_i,t_j)=\underset{P}{\cup}P(t_i,t_j)$(24) We use Inverse Information Content (IIC) values to define shortest and longest paths for two given terms$t_i,t_j$as shown in (25-27).$SP(t_i,t_j) = \underset{P\in A(t_i,t_j)}{argminIIC(P)}$(25)$LP(t_i,t_j) = \underset{P\in A(t_i,t_j)}{argmaxIIC(P)}$(26)$IIC(P) = \sum_{t\in P}\frac{1}{IC(t)}$(27) The standard definition was used to calculate$IC(t)$as shown in (28)$IC(t) = -log\frac{G_t}{G_\mathrm{Tot}}$(28) Here$G_t$is the number of genes annotated by the term$t$and$G_\mathrm{Tot}$is the total number of genes. The distribution of IC is not uniform in the rDAG, so it is possible to have two paths with different lengths but with same IICs. To overcome this problem we weight paths by their lengths, so the definitions in (25) and (26) can be updated according to (29) and (30).$wSP(t_i,t_j)=SP(t_i,t_j)\times len(P)$(29)$wLP(t_i,t_j)=LP(t_i,t_j)\times len(P)$(30) Now let$ComAnc(t_i,t_j)$be the set of all common ancestors for two given terms$(t_i,t_j)$. First we define the disjuntive common ancestors as a subset of$ComAnc(t_i,t_j)$as in (31).$DisComAnc(t_i,t_j) = {x\in ComAnc(t_i,t_j)\mid P(x, root)\cap C(x)=\varnothing}$(31) Here$P(x,root)$is the path between$x$and$root$and$C(x)$is set of all immediate children for$x$. For each disjuntive common ancestor$x$in$DisComAnc(t_i,t_j)$, we define the distance between$t_i,t_j$as the ratio of the weighted shortest path between$t_i,t_j$which passes from$x$to the weighted longest path between$x$and$root$, as in (32-33).$D(t_i,t_j,x) = \frac{wSP(t_i,t_j,x)}{wLP(x,root)}$(32)$wSP(t_i,t_j,x) = wSP(t_i,x)+wSP(t_j,x)$(33) Now the distance for two terms$t_i,t_j$can be defined according to (34).$D(t_i,t_j)=\underset{x\in DisComAnc(t_i,t_j)}{min}D(t_i,t_j,x)$(34) We convert distance values by the$\frac{Arctan(.)}{\pi/2}$function, and the measure for two GO terms$t_i$and$t_j$can be defined as in (35).$S(t_i,t_j) = 1-\frac{Arcatan(D(t_i,t_j))}{\pi/2}$(35) Note that$root$refers to one of three first levels in the rDAG. So if$DisComAnc(t_i,t_j)={root}$then$D(t_i,t_j)=\infty$and$S(t_i,t_j)=0$. Also if$t_i = t_j$then$D(t_i,t_j)=0$and$S(t_i,t_j)=1$. Finally let$S=[s_{ij}]{n\times m}$be a similarity matrix for two given fenes or gene products$g1, g2$with GO terms$t{11},t_{12},\dots,t_{1n}$and$t_{21},t_{12},\dots,t_{2m}$where$s_{ij}$is the similarity between the GO terms$t_{1i}$and$t_{2j}$. We used a rcmax method to calculate similarity between$g1, g2$, as defined in (36).$\begin{aligned}TopoICSim(g_1,g_2)&=rcmax(S)\&=rcmax\left(\frac{\sum_{i=1}^n\underset{j}{maxs_{ij}}}{n},\frac{\sum_{i=1}^m\underset{i}{maxs_{ij}}}{m}\right)\end{aligned}$(36) We also tested other methods on the similarity matrix, in particular average and BMA, but in general$rcmax$gave the best performance for TopoICSim (data not shown). Besides all this, there's also an algorithm for the geneset level, you can calculate interset/intraset similarities of these with (13) and (14). Or simply use R's mean on the resulting matrix.$IntraSetSim(S_k)=\frac{\sum_{i=1}^n\sum_{j=1}^mSim(g_{ki},g_{kj})}{n^2}$(13)$InterSetSim(S_k)=\frac{\sum_{i=1}^n\sum_{j=1}^mSim(g_{ki},g_{kj})}{n\times m}$(14) (13) is equal to (14) in the specific case of comparing the geneset to itself. All forumulas/explanations are quoted from the paper (references edited) [6]. ### Examples The following example uses the pfam clan gene set to measure intraset similarity. For the single gene, EntrezID 218 and 501 are compared. result <- GAPGOM::topo_ic_sim_genes("human", "MF", "218", "501", progress_bar = FALSE) kable(result$AllGoPairs) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")
result$GeneSim # genelist mode list1 <- c("126133","221","218","216","8854","220","219","160428","224", "222","8659","501","64577","223","217","4329","10840","7915","5832") # ONLY A PART OF THE GENELIST IS USED BECAUSE OF R CHECK TIME CONTRAINTS result <- GAPGOM::topo_ic_sim_genes("human", "MF", list1[1:3], list1[1:3], progress_bar = FALSE) kable(result$AllGoPairs) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")
kable(result$GeneSim) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") mean(result$GeneSim)


Here we can see the output of TopoICsim, it is a list containg 2 items;

• AllGoPairs

The $n\times m$ matrix of GO terms, including their similarities. Some values might be NA because these were non-occuring pairs. You can add AllGoPairs to the parameters next time you run TopoICSim, to possibly speed up computations (they will be used as pre-calculated scores to fill in occurring pairs).

• GeneSim

The gene/geneset similarities depending on your input. Can be 1 number, or a matrix displaying all possible combinations. $\to$ The mean of the geneset matrix shows the intraset/interset similarity.

#### Custom genes

Besides this, you can also define custom genes for TopoICSim. This consists of an arbitrary amount of GO terms. The custom genes have to individually defined in a named list.

custom <- list(cus1=c("GO:0016787", "GO:0042802", "GO:0005524"))
result <- GAPGOM::topo_ic_sim_genes("human", "MF", "218", "501",
custom_genes1 = custom, drop = NULL, verbose = TRUE, progress_bar = FALSE)
result


Here we define a custom gene named "cus1" with the GO terms; "GO:0016787", "GO:0042802", "GO:0005524", it will be added to the first gene vector (218). If you want to only have a custom gene, you can define an empty vector with c() for the respective vector.

#### Other notes.

With TopoICSim there is a pre-calculated score matrix available, this can be turned on/off. The scores might become deprecated however, as soon as one of the org.DB packages gets an update. For this reason, we advise mostly to keep this option off. You can also pre-calculate some GO's yourself using the custom genes. all_go_pairs can then be used as a precalculated score matrix, only intersecting/present GO terms will be used.

## Parrallel processing and big data

As of right now, parallel processing is not supported. The other algorithms aren't parallelized yet, because of implementation difficulties regarding the amount and type of dependencies that the algorithms rely on. Possibly, in the future, parallel processing will be supported. It might however be possible to divide the work in TopoICSim (on gene pair level). It runs for each unique pair of genes, which you may be able to generate using the hidden GAPGOM:::.unique_combos function. The All_go_pairs object in the result can then be combined together with other results. No support is offered however in regards to trying to make this work. Tip: The all_go_pairs argument in topo_ic_sim_genes() doesn't automatically create a new, bigger matrix, it only uses the overlapping or present GO terms of the analysis based on the input genes.

## Performance and benchmarks

The performance of this package is well tested in the Benchmarks vignette, in which benchmarks are also prepared.

## Contact and support

For questions, contact, or support use the (Bioconductor) git repository or contact us via the Bioconductor forums.

## SessionInfo

sessionInfo()


## References

• [1] Ehsani R, Drablos F: Measures of co-expression for improved function prediction of long non-coding RNAs. BMC Bioinformatics 2018. Accepted.
• [2] Villmann T: Sobolev metrics for learning of functional data - mathematical and theoretical aspects. In: Machine Learning Reports. Edited by Villmann T, Schleif F-M, vol. 1. Leipzig, Germany: Medical Department, University of Leipzig; 2007: 1-13.
• [3] Lebanon G: Learning riemannian metrics. In: Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence; Acapulco, Mexico. Morgan Kaufmann Publishers Inc. 2003: 362-369
• [4] Jian Q: LncRNA2Function: a comprehensive resource for functional investigation of human lncRNAs based on RNA-seq data. In: BMC Genomics 2015. DOI: 10.1186/1471-2164-16-S3-S2
• [5] Benabderrahmane S, Smail-Tabbone, Poch O, Napoli A, Devignes MD. IntelliGO: a new vector-based semantic similarity measure including annotation origin. In: BMC Bioinformatics. 2010;11:588.
• [6] Ehsani R, Drablos F: TopoICSim: a new semantic similarity measure based on gene ontology. In: BMC Bioinformatics 2016, 17(1):296. DOI: 10.1186/s12859-016-1160-0

## Try the GAPGOM package in your browser

Any scripts or data that you put into this service are public.

GAPGOM documentation built on Nov. 8, 2020, 8:08 p.m.