BiocStyle::markdown()
cat(" <style> /* Ocultar inicialmente el contenido colapsable */ .collapsible-content { display: none; } /* Cambiar el puntero del ratón en los encabezados colapsables */ .collapsible-header { cursor: pointer; } .collapsible-header:hover { text-decoration: underline; } /* Título de Nivel 1 */ h1 { font-size: 1.2em; color: #0e5775; font-weight: bold; } /* Título de Nivel 2 */ h2.collapsible-header { font-size: 1.1em; color: #12769f; margin-left: 20px; font-weight: bold; } /* Título de Nivel 3 */ h3.collapsible-header { font-size: 1.0em; color: #16a3dc; margin-left: 40px; font-weight: bold; } /* Mantén el contenido alineado con los encabezados */ .collapsible-content { margin-left: inherit; } #appendix-appendix { display: none; } /* Estilo para el contenedor de referencias */ #refs { list-style-type: none; margin-left: 20px; padding-left: 20px; text-align: justify; } /* Estilo para la viñeta */ #refs > div::before { content: '•'; /* Define la viñeta como un círculo */ font-size: 1.2em; /* Tamaño de la viñeta */ color: #000; /* Color de la viñeta */ margin-right: 10px; /* Espacio entre la viñeta y el texto */ flex-shrink: 0; /* Asegura que la viñeta no se reduce en líneas largas */ } .sidenote { text-align: justify; float: right; /* Posiciona el pie de página en el lado derecho */ width: 28%; max-width: 28%; /* Ajusta el ancho relativo al contenedor principal */ padding-left: 20px; /* Añade margen interno dentro del pie de página */ box-sizing: border-box; /* Asegura que padding no desborde el ancho del contenedor */ } </style> ")
cat(" <script> document.addEventListener('DOMContentLoaded', function () { const excludedSections = ['Abstract', 'Package', 'Author Information', 'Date']; const headers = document.querySelectorAll('h2, h3'); headers.forEach(header => { const headerText = header.textContent.trim(); // Excluir encabezados específicos if (excludedSections.some(section => headerText.includes(section))) { return; } // Encuentra todo el contenido hasta el próximo encabezado const content = []; let next = header.nextElementSibling; while (next && !/^H[1-6]$/.test(next.tagName)) { content.push(next); next = next.nextElementSibling; } // Crear contenedor colapsable const contentWrapper = document.createElement('div'); contentWrapper.className = 'collapsible-content'; content.forEach(node => contentWrapper.appendChild(node)); // Agregar funcionalidad de colapsar/expandir header.classList.add('collapsible-header'); header.addEventListener('click', () => { const isVisible = contentWrapper.style.display === 'block'; contentWrapper.style.display = isVisible ? 'none' : 'block'; }); // Insertar el contenido después del encabezado header.parentNode.insertBefore(contentWrapper, next); }); }); </script> ")
knitr::opts_chunk$set( comment = "" )
library(goSorensen)
\begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \ \mathfrak{D} = \begin{matrix} L_2 \ L_3 \ L_4 \ \vdots \ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}{21} & & & & \ \mathfrak{d}{31} & \mathfrak{d}{32} & & & \ \mathfrak{d}{41} & \mathfrak{d}{42} & \mathfrak{d}{43} & & \ \vdots & \vdots & \vdots & & \ \mathfrak{d}{s1} & \mathfrak{d}{s2} & \mathfrak{d}{s3} & \ldots & \mathfrak{d}{s(s-1)} \ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation}
Given $s$ gene lists, $L_1, L_2, \dots, L_s$, each element $\mathfrak{d}_{ij}$ of the matrix $\mathfrak{D}$ tries to quantify how different or distant are the lists $L_i$ and $L_j$ in the following terms: "A higher equivalence threshold than this would cause these lists to be declared equivalent". These dissimilarities are based on the irrelevance (equivalence) threshold that would make them statistically significant, in the sense of the equivalence tests introduced in @flores2022equivalence: "An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO term enrichment." This concept was introduced in an attempt to give this dissimilarity a more inferential meaning, not merely descriptive as could be the Sorensen--Dice dissimilarity used directly.
But analyzing simultaneously $s$ gene lists poses a multiple testing problem, so the algorithm to obtain the matrix should be expressed as:
Step 1. Establish $h = s(s-1) / 2$ equivalence hypothesis tests. Each test $ET_I$ (with $I = 1, \ldots, h$) compares the pair of lists $L_{i}, L_{j}$ (with $i, j = 1, 2, \ldots, s, i \neq j$). Equivalence is declared when the null hypothesis of non-equivalence is rejected.
Step 2. According to a criterion of type I error control in multiple testing, let $\mathfrak{d}h$ be the smallest irrelevance limit for which all $h$ null hypothesis are rejected. The default adjusting criterion in goSorensen
is the Holm's method ("holm" option in the R function p.adjust
). So, $\mathfrak{d}_h$ is computed as:
$$\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P{(I)}(\mathfrak{d}) \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).$$
Step 3. Use $\mathfrak{d}h$ as the irrelevance limit dissimilarity between the lists $L_i, L_j$ corresponding to the last position in the vector of ordered p-values $P{(1)}(\mathfrak{d}{h}) \le \ldots \le P{(h)}(\mathfrak{d}{h})$, i.e. $\mathfrak{d}{ij} = \mathfrak{d}_{h}.$
Step 4. Set $h = h - 1$, excluding the comparison between the lists $L_i, L_j$ from Step $3$, and go back to Step $2$ until $h=0$.
The dataset used in this vignette, allOncoGeneLists
, is derived from gene lists compiled at http://www.bushmanlab.org/links/genelists, which contains an extensive collection of gene lists associated with cancer. The goSorensen
package loads this dataset using the command data(allOncoGeneLists)
:
data("allOncoGeneLists")
allOncoGeneLists
is an object of class list containing 7 elements. Each element is a character vector that holds the ENTREZ identifiers for the respective gene lists:
# name and length of the gene lists: sapply(allOncoGeneLists, length)
One method to compute the dissimilarity matrix $\mathfrak{D}$ for a given ontology and level (e.g., ontology BP, level 4) is to directly input the gene lists into the sorenThreshold
function:
# Previously load the genomic annotation package for the studied specie: library(org.Hs.eg.db) humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID") # Computing the irrelevance-threshold matrix of dissimilarities dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", onto = "BP", GOLevel = 4, trace = FALSE) dissMatrx_BP4
options(digits = 4) data("dissMatrx_BP4") data("cont_all_BP4") dissMatrx_BP4
Alternatively, we can previously compute all the enrichment contingency tables for all 21 possible pairs of lists from allOncoGeneLists
using the function buildEnrichTable.
The result is an object of class tableList
, from which we can obtain the dissimilarity matrix as follows:
# Previously compute the enrichment contingency tables: cont_all_BP4 <- buildEnrichTable(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", onto = "BP", GOLevel = 4)
# Computing the irrelevance-threshold matrix of dissimilarities from the # enrichment contingency table "cont_all_BP4": dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, trace = FALSE) dissMatrx_BP4
Both alternatives yield the same results; however, the latter approach (whenever possible, if all contingency tables are already available) is much faster, as it eliminates the internal time required to generate the contingency tables since they are provided as arguments to the function sorenThreshold
.
The dissimilarity matrix dissMatrx_BP4
is obtained using equivalence tests based on the normal asymptotic distribution. If the user prefers the tests to be based on the bootstrap distribution, they should set the argument boot = TRUE
(the default is FALSE
), as shown below:
boot_dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", onto = "BP", GOLevel = 4, boot = TRUE, # use bootstrap distribution trace = FALSE) boot_dissMatrx_BP4
sorenThreshold(cont_all_BP4, boot = T, trace = FALSE)
Or, it is faster if we already have the enrichment contingency tables:
boot_dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, boot = TRUE, trace = FALSE) boot_dissMatrx_BP4
An important and useful attribute of this dissimilarity matrix (whether calculated using the normal or bootstrap distribution) is attr(, "all2x2Tables")
, which contains the enrichment contingency tables for all possible pairs of lists among the set of gene lists being compared. These contingency tables, in turn, include the matrix of GO terms enrichment in its attribute attr(, "enriched")
.
For more details about the matrix of GO terms enrichment, the enrichment contingency tables and the equivalence tests (including computation, outputs, attributes, etc.), refer to the vignette An Introduction to the goSorensen R Package.
To provide users with a quick visualization, the goSorensen
package includes the objects cont_all_BP4
and dismatBP4
, which can be accessed using data(cont_all_BP4)
and data(dismatBP4)
.
Note that gene lists, GO terms, and Bioconductor may change over time. So, consider these objects only as illustrative examples, valid exclusively for the dataset allOncoGeneLists
at a specific time. The current version of these results was generated with Bioconductor version 3.20. The same comment is applicable to other objects included in goSorensen
for quick visualization, some of which are also described in this vignette.
To calculate dissimilarity matrices for multiple ontologies and GO levels, we use the allSorenThreshold
function. The first option is to provide the lists of genes to compare as the main input, as shown below:
# For example, for GO levels 3 to 10 and for the ontologies BP, CC and MF: allDissMatrx <- allSorenThreshold(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC", "MF"), GOLevels = 3:10, trace = FALSE)
Another option is previously computing all the enrichment contingency tables from allOncoGeneLists
using the function allBuildEnrichTable.
The result is an object of class allTableList
, from which we can obtain the dissimilarity matrix as follows:
# Previously compute the enrichment contingency tables: allContTabs <- allBuildEnrichTable(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC", "MF"), GOLevels = 3:10) # Computing the irrelevance-threshold matrix of dissimilarities from the # enrichment contingency tables "allContTabs": allDissMatrx <- allSorenThreshold(allContTabs, trace = FALSE)
Remember that, by default, these matrices are obtained from equivalence tests based on the normal distribution. If the user prefers to use the bootstrap distribution instead, they must set the argument boot = TRUE
inside the allSorenThreshold
function.
Given the large number of outcomes produced in allDissMatrx
, displaying all of them is not feasible. However, for a quick visualization, the goSorensen
package includes the object allDissMatrx
, which can be accessed using data(allDissMatrx)
.
Below, we present a dendrogram of the dissimilarity matrix for the specific case of the allOncoGeneLists
dataset, for the BP ontology and GO Level 4, which was computed in Section 2.1 of this vignette (object dissMatrx_BP4
):
clust.threshold <- hclustThreshold(dissMatrx_BP4) plot(clust.threshold)
Since the dissimilarities used to form the biplot are based on an inferential measure of significant equivalence, we can deduce that the closer two or more lists are (forming groups), the stronger the evidence suggesting that these lists are functionally similar, and the opposite for distant lists.
For example, in this particular dendrogram, two groups appear to be formed, as two pairs of lists are very close to each other. The first group consists of sanger and vogelstein, while the second group includes miscellaneous and waldman. So, it can then be inferred that the pairs of lists within each group provide stronger evidence of equivalence.
Multidimensional Scaling (MDS) transforms the initial dissimilarity matrix into Euclidean distances, preserving as much of the original variability as possible. In this case, we compute the first two dimensions, which capture the highest proportion of the original variability. Like in the dendrogram example, we use the dissimilarity matrix dissMatrx_BP4
computed in Section 2.1 of this vignette:
# multidimensional scaling analysis: mds <- as.data.frame(cmdscale(dissMatrx_BP4, k = 2))
Below we plot the MDS-Biplot:
library(ggplot2) library(ggrepel) graph <- ggplot() + geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) + geom_text_repel(aes(mds[,1], mds[,2], label = attr(dissMatrx_BP4, "Labels")), color = "black", size = 3) + xlab("Dim 1") + ylab("Dim 2") + theme_light() graph
The biplot suggests that the closest lists (clusters) are Waldman-Vogelstein and Sanger-Miscellaneous; therefore, there is stronger evidence for equivalence among these lists.
Characterizing the axes of the biplot on which these groups are formed is of great interest, as it can help determine the biological functions associated with the formation of these groups. Specifically, it allows us to identify the GO terms involved in the evidence supporting the biological similarity between the compared gene lists. In the following section, we propose a 5-step procedure to characterize the axes of the biplot and identify the GO terms associated with the formation of these groups. This is a very preliminary version of a work in progress.
In this vignette, we will apply the 5-step procedure to Dimension 1. However, the technique would be the same if we were to apply it to Dimension 2 or any other dimension.
Split the axis (in this case, Dimension 1) in three parts to identify the lists located at the extremes. For example, we could allocate 20% to each extreme, leaving 60% in the middle, as follows:
# Split the axis 20% to the left, 60% to the middle and 20% to the right: prop <- c(0.2, 0.6, 0.2) # Sort according dimension 1: sorted <- mds[order(mds[, 1]), ] # Determine the range for dimension 1. range <- sorted[, 1][c(1, nrow(mds))] # Find the cut-points to split the axis: cutpoints <- (cumsum(prop)[1:2] * diff(range)) + range[1] # Identify lists to the left: lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ]) lleft # Identify lists to the right lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ]) lright
Visualizing in the Biplot:
graph + geom_vline(xintercept = cutpoints, color = "red", linetype = "dashed", linewidth = 0.75)
The basic idea is that if the enrichment of a GO term differs between these two groups of lists at the extremes, we can deduce that this GO term discriminates the groups, thus explaining their formation.
Obtain the matrix of GO terms enrichment for the lists located at the extremes detected in Step 1. Remember, this matrix is stored as an attribute of the enrichment contingency tables. Therefore:
options(max.print = 16)
# enrichment contingency tables: contTablesBP4 <- attr(dissMatrx_BP4, "all2x2Tables") # matrix of GO term enrichment for the lists located at the extreme left tableleft <- attr(contTablesBP4, "enriched")[, lleft, drop = FALSE] tableleft
options(max.print = 50)
# matrix of GO term enrichment for the lists located at the extreme right tableright <- attr(contTablesBP4, "enriched")[, lright, drop = FALSE] tableright
Actually, the contingency tables saved in the objects contTablesBP4
and cont_all_BP4
(from Section 2.1) are exactly the same.
Compute the means and variances for each of the GO terms detected in each extreme group. In other words, calculate the means and variances by rows of the GO term enrichment matrix for the lists at the extremes:
# function to compute mean and standard deviation: mean_sd <- function(x){ c("mean" = mean(x), "sd" = ifelse(length(x) == 1, 0, sd(x))) } # mean and sd of the lists to the extreme left: lmnsd <- apply(tableleft, 1, mean_sd) # mean and sd of the lists to the extreme right: rmnsd <- apply(tableright, 1, mean_sd)
Establish a statistic to assess the degree of enrichment disparity between extreme groups for each GO term. This measure should consider that a GO term discriminates a dimension if:
i) the lists show opposing patterns of enrichment in the GO term, indicated by a large difference between $| L\overline{X}{GO_j} - R \overline{X}{GO_j}|$ ($L$ represents Left and $R$ Rigth), and ii) the behavior of the lists remains stable at each extreme, indicated by a small amount of $LS^2{GO_j} + RS^2{GO_j}$.
Considering these specific details, the following statistic, is suggested:
$$st_j = \dfrac{| L\overline{X}{GO_j} - R \overline{X}{GO_j}|}{\sqrt{\dfrac{LS^2{GO_j}}{l} + \dfrac{RS^2{GO_j}}{r} + \epsilon }}$$ with $l$ the number of lists to the extreme left and $r$ the number of lists to the extreme right. Using $\epsilon= 0.00000001$ the statistic is computed as follows:
nl <- ncol(tableleft) nr <- ncol(tableright) t_stat <- abs(lmnsd[1, ] - rmnsd[1, ]) / sqrt((((lmnsd[2, ] / nl) + (rmnsd[2, ] / nr))) + 0.00000001)
GO terms with the highest statistic are those with a value closest to $\frac{1}{\sqrt{\epsilon}}$. These GO terms are the main contributors to the formation of clusters in the analyzed dimension. Therefore, they are strongly related to the equivalence between the compared gene lists. They are detected as follows:
result <- t_stat[t_stat == max(t_stat)] result
In addition, we can identify the biological functions associated with the detected GO terms:
library(GO.db) library(DT) # Previous function to get the description of the identified GO terms: get_go_description <- function(go_id) { go_term <- Term(GOTERM[[go_id]]) return(go_term) } # GO terms description: datatable(data.frame(GO_term = names(result), Description = sapply(names(result), get_go_description, USE.NAMES = TRUE)), filter = "top", rownames = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.