In pablof1988/goSorensen: Statistical inference based on the Sorensen-Dice dissimilarity and the Gene Ontology (GO)

BiocStyle::markdown()

cat("
<style>
/* Ocultar inicialmente el contenido colapsable */
.collapsible-content {
    display: none;
}

/* Cambiar el puntero del ratón en los encabezados colapsables */
.collapsible-header {
    cursor: pointer;
}

.collapsible-header:hover {
    text-decoration: underline;
}

/* Título de Nivel 1 */
h1 {
    font-size: 1.2em; 
    color: #0e5775; 
    font-weight: bold;
}

/* Título de Nivel 2 */
h2.collapsible-header {
    font-size: 1.1em; 
    color: #12769f; 
    margin-left: 20px; 
    font-weight: bold;
}

/* Título de Nivel 3 */
h3.collapsible-header {
    font-size: 1.0em; 
    color: #16a3dc; 
    margin-left: 40px; 
    font-weight: bold;
}

/* Mantén el contenido alineado con los encabezados */
.collapsible-content {
    margin-left: inherit;
}

#appendix-appendix {
    display: none;
}

/* Estilo para el contenedor de referencias */
#refs {
    list-style-type: none; 
    margin-left: 20px;
    padding-left: 20px; 
    text-align: justify;
}


/* Estilo para la viñeta */
#refs > div::before {
    content: '•'; /* Define la viñeta como un círculo */
    font-size: 1.2em; /* Tamaño de la viñeta */
    color: #000; /* Color de la viñeta */
    margin-right: 10px; /* Espacio entre la viñeta y el texto */
    flex-shrink: 0; /* Asegura que la viñeta no se reduce en líneas largas */
}

.sidenote {
    text-align: justify; 
    float: right; /* Posiciona el pie de página en el lado derecho */
    width: 28%; 
    max-width: 28%; /* Ajusta el ancho relativo al contenedor principal */
    padding-left: 20px; /* Añade margen interno dentro del pie de página */
    box-sizing: border-box; /* Asegura que padding no desborde el ancho del contenedor */
}

</style>
")

cat("
<script>
document.addEventListener('DOMContentLoaded', function () {
    const excludedSections = ['Abstract', 'Package', 'Author Information', 'Date'];

    const headers = document.querySelectorAll('h2, h3');
    headers.forEach(header => {
        const headerText = header.textContent.trim();

        // Excluir encabezados específicos
        if (excludedSections.some(section => headerText.includes(section))) {
            return;
        }

        // Encuentra todo el contenido hasta el próximo encabezado
        const content = [];
        let next = header.nextElementSibling;
        while (next && !/^H[1-6]$/.test(next.tagName)) {
            content.push(next);
            next = next.nextElementSibling;
        }

        // Crear contenedor colapsable
        const contentWrapper = document.createElement('div');
        contentWrapper.className = 'collapsible-content';
        content.forEach(node => contentWrapper.appendChild(node));

        // Agregar funcionalidad de colapsar/expandir
        header.classList.add('collapsible-header');
        header.addEventListener('click', () => {
            const isVisible = contentWrapper.style.display === 'block';
            contentWrapper.style.display = isVisible ? 'none' : 'block';
        });

        // Insertar el contenido después del encabezado
        header.parentNode.insertBefore(contentWrapper, next);
    });
});
</script>
")

knitr::opts_chunk$set(
  comment = ""
)

library(goSorensen)

PRELIMINARIES.

Theoretical Framework of Reference

\begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \ \mathfrak{D} = \begin{matrix} L_2 \ L_3 \ L_4 \ \vdots \ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}{21} & & & & \ \mathfrak{d}{31} & \mathfrak{d}{32} & & & \ \mathfrak{d}{41} & \mathfrak{d}{42} & \mathfrak{d}{43} & & \ \vdots & \vdots & \vdots & & \ \mathfrak{d}{s1} & \mathfrak{d}{s2} & \mathfrak{d}{s3} & \ldots & \mathfrak{d}{s(s-1)} \ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation}

Given $s$ gene lists, $L_1, L_2, \dots, L_s$, each element $\mathfrak{d}_{ij}$ of the matrix $\mathfrak{D}$ tries to quantify how different or distant are the lists $L_i$ and $L_j$ in the following terms: "A higher equivalence threshold than this would cause these lists to be declared equivalent". These dissimilarities are based on the irrelevance (equivalence) threshold that would make them statistically significant, in the sense of the equivalence tests introduced in @flores2022equivalence: "An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO term enrichment." This concept was introduced in an attempt to give this dissimilarity a more inferential meaning, not merely descriptive as could be the Sorensen--Dice dissimilarity used directly.

But analyzing simultaneously $s$ gene lists poses a multiple testing problem, so the algorithm to obtain the matrix should be expressed as:

Step 1. Establish $h = s(s-1) / 2$ equivalence hypothesis tests. Each test $ET_I$ (with $I = 1, \ldots, h$) compares the pair of lists $L_{i}, L_{j}$ (with $i, j = 1, 2, \ldots, s, i \neq j$). Equivalence is declared when the null hypothesis of non-equivalence is rejected.
Step 2. According to a criterion of type I error control in multiple testing, let $\mathfrak{d}h$ be the smallest irrelevance limit for which all $h$ null hypothesis are rejected. The default adjusting criterion in goSorensen is the Holm's method ("holm" option in the R function p.adjust). So, $\mathfrak{d}_h$ is computed as: $$\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P{(I)}(\mathfrak{d}) \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).$$
Step 3. Use $\mathfrak{d}h$ as the irrelevance limit dissimilarity between the lists $L_i, L_j$ corresponding to the last position in the vector of ordered p-values $P{(1)}(\mathfrak{d}{h}) \le \ldots \le P{(h)}(\mathfrak{d}{h})$, i.e. $\mathfrak{d}{ij} = \mathfrak{d}_{h}.$
Step 4. Set $h = h - 1$, excluding the comparison between the lists $L_i, L_j$ from Step $3$, and go back to Step $2$ until $h=0$.

Data Used in this Vignette.

The dataset used in this vignette, allOncoGeneLists, is derived from gene lists compiled at http://www.bushmanlab.org/links/genelists, which contains an extensive collection of gene lists associated with cancer. The goSorensen package loads this dataset using the command data(allOncoGeneLists):

data("allOncoGeneLists")

allOncoGeneLists is an object of class list containing 7 elements. Each element is a character vector that holds the ENTREZ identifiers for the respective gene lists:

# name and length of the gene lists:
sapply(allOncoGeneLists, length)

Obtaining the Irrelevance-threshold Matrix of Dissimilarities.

For a Specific Ontology and one GO level.

One method to compute the dissimilarity matrix $\mathfrak{D}$ for a given ontology and level (e.g., ontology BP, level 4) is to directly input the gene lists into the sorenThreshold function:

# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")

# Computing the irrelevance-threshold matrix of dissimilarities
dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, 
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db",
                            onto = "BP", 
                            GOLevel = 4,
                            trace = FALSE)
dissMatrx_BP4

options(digits = 4)
data("dissMatrx_BP4")
data("cont_all_BP4")
dissMatrx_BP4

Alternatively, we can previously compute all the enrichment contingency tables for all 21 possible pairs of lists from allOncoGeneLists using the function buildEnrichTable. The result is an object of class tableList, from which we can obtain the dissimilarity matrix as follows:

# Previously compute the enrichment contingency tables:
cont_all_BP4 <- buildEnrichTable(allOncoGeneLists,
                                 geneUniverse = humanEntrezIDs, 
                                 orgPackg = "org.Hs.eg.db", 
                                 onto = "BP", 
                                 GOLevel = 4)

# Computing the irrelevance-threshold matrix of dissimilarities from the 
# enrichment contingency table "cont_all_BP4":
dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, 
                                trace = FALSE)
dissMatrx_BP4

Both alternatives yield the same results; however, the latter approach (whenever possible, if all contingency tables are already available) is much faster, as it eliminates the internal time required to generate the contingency tables since they are provided as arguments to the function sorenThreshold.

The dissimilarity matrix dissMatrx_BP4 is obtained using equivalence tests based on the normal asymptotic distribution. If the user prefers the tests to be based on the bootstrap distribution, they should set the argument boot = TRUE (the default is FALSE), as shown below:

boot_dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, 
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db",
                            onto = "BP", 
                            GOLevel = 4, 
                            boot = TRUE, # use bootstrap distribution
                            trace = FALSE)
boot_dissMatrx_BP4

sorenThreshold(cont_all_BP4, boot = TRUE, trace = FALSE)

Or, it is faster if we already have the enrichment contingency tables:

boot_dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, 
                                     boot = TRUE, 
                                     trace = FALSE)
boot_dissMatrx_BP4

An important and useful attribute of this dissimilarity matrix (whether calculated using the normal or bootstrap distribution) is attr(, "all2x2Tables"), which contains the enrichment contingency tables for all possible pairs of lists among the set of gene lists being compared. These contingency tables, in turn, include the matrix of GO terms enrichment in its attribute attr(, "enriched").

For more details about the matrix of GO terms enrichment, the enrichment contingency tables and the equivalence tests (including computation, outputs, attributes, etc.), refer to the vignette An Introduction to the goSorensen R Package.

NOTE: {.unnumbered}

To provide users with a quick visualization, the goSorensen package includes the objects cont_all_BP4 and dismatBP4, which can be accessed using data(cont_all_BP4) and data(dismatBP4).

Note that gene lists, GO terms, and Bioconductor may change over time. So, consider these objects only as illustrative examples, valid exclusively for the dataset allOncoGeneLists at a specific time. The current version of these results was generated with Bioconductor version 3.20. The same comment is applicable to other objects included in goSorensen for quick visualization, some of which are also described in this vignette.

For More than One Ontology and GO Level.

To calculate dissimilarity matrices for multiple ontologies and GO levels, we use the allSorenThreshold function. The first option is to provide the lists of genes to compare as the main input, as shown below:

# For example, for GO levels 3 to 10 and for the ontologies BP, CC and MF:
allDissMatrx <- allSorenThreshold(allOncoGeneLists,
                                  geneUniverse = humanEntrezIDs, 
                                  orgPackg = "org.Hs.eg.db", 
                                  ontos = c("BP", "CC", "MF"), 
                                  GOLevels = 3:10, 
                                  trace = FALSE)

Another option is previously computing all the enrichment contingency tables from allOncoGeneLists using the function allBuildEnrichTable. The result is an object of class allTableList, from which we can obtain the dissimilarity matrix as follows:

# Previously compute the enrichment contingency tables:
allContTabs <- allBuildEnrichTable(allOncoGeneLists,
                                   geneUniverse = humanEntrezIDs, 
                                   orgPackg = "org.Hs.eg.db", 
                                   ontos = c("BP", "CC", "MF"), 
                                   GOLevels = 3:10)

# Computing the irrelevance-threshold matrix of dissimilarities from the 
# enrichment contingency tables "allContTabs":
allDissMatrx <- allSorenThreshold(allContTabs,
                                  trace = FALSE)

Remember that, by default, these matrices are obtained from equivalence tests based on the normal distribution. If the user prefers to use the bootstrap distribution instead, they must set the argument boot = TRUE inside the allSorenThreshold function.

NOTE: {.unnumbered}

Given the large number of outcomes produced in allDissMatrx, displaying all of them is not feasible. However, for a quick visualization, the goSorensen package includes the object allDissMatrx, which can be accessed using data(allDissMatrx).

Representing a Dissimilarity Matrix in Statistical Graphs.

Dendrogram.

Below, we present a dendrogram of the dissimilarity matrix for the specific case of the allOncoGeneLists dataset, for the BP ontology and GO Level 4, which was computed in Section 2.1 of this vignette (object dissMatrx_BP4):

clust.threshold <- hclustThreshold(dissMatrx_BP4)
plot(clust.threshold)

Since the dissimilarities used to form the biplot are based on an inferential measure of significant equivalence, we can deduce that the closer two or more lists are (forming groups), the stronger the evidence suggesting that these lists are functionally similar, and the opposite for distant lists.

For example, in this particular dendrogram, two groups appear to be formed, as two pairs of lists are very close to each other. The first group consists of sanger and vogelstein, while the second group includes miscellaneous and waldman. So, it can then be inferred that the pairs of lists within each group provide stronger evidence of equivalence.

MDS-Biplot.

Multidimensional Scaling (MDS) transforms the initial dissimilarity matrix into Euclidean distances, preserving as much of the original variability as possible. In this case, we compute the first two dimensions, which capture the highest proportion of the original variability. Like in the dendrogram example, we use the dissimilarity matrix dissMatrx_BP4 computed in Section 2.1 of this vignette:

# multidimensional scaling analysis:
mds <- as.data.frame(cmdscale(dissMatrx_BP4, k = 2))

Below we plot the MDS-Biplot:

library(ggplot2)
library(ggrepel)
graph <- ggplot() +
 geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
 geom_text_repel(aes(mds[,1], mds[,2], label = attr(dissMatrx_BP4, "Labels")),
                 color = "black", size = 3) +
 xlab("Dim 1") +
 ylab("Dim 2") +
 theme_light()
graph

The biplot suggests that the closest lists (clusters) are Waldman-Vogelstein and Sanger-Miscellaneous; therefore, there is stronger evidence for equivalence among these lists.

Characterizing the axes of the biplot on which these groups are formed is of great interest, as it can help determine the biological functions associated with the formation of these groups. Specifically, it allows us to identify the GO terms involved in the evidence supporting the biological similarity between the compared gene lists. In the following section, we propose a 5-step procedure to characterize the axes of the biplot and identify the GO terms associated with the formation of these groups. This is a very preliminary version of a work in progress.

Characterizing MDS-Biplot Dimensions.

In this vignette, we will apply the 5-step procedure to Dimension 1. However, the technique would be the same if we were to apply it to Dimension 2 or any other dimension.

Step 1. Split the biplot axis:

Split the axis (in this case, Dimension 1) in three parts to identify the lists located at the extremes. For example, we could allocate 20% to each extreme, leaving 60% in the middle, as follows:

# Split the axis 20% to the left, 60% to the middle and 20% to the right:
prop <- c(0.2, 0.6, 0.2) 

# Sort according  dimension 1:
sorted <- mds[order(mds[, 1]), ] 

# Determine the range for dimension 1.
range <- sorted[, 1][c(1, nrow(mds))] 

# Find the cut-points to split the axis:
cutpoints <- (cumsum(prop)[1:2] * diff(range)) + range[1]

# Identify lists to the left:  
lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ])
lleft

# Identify lists to the right
lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ])
lright

Visualizing in the Biplot:

graph +
  geom_vline(xintercept = cutpoints, color = "red", 
             linetype = "dashed", linewidth = 0.75)

The basic idea is that if the enrichment of a GO term differs between these two groups of lists at the extremes, we can deduce that this GO term discriminates the groups, thus explaining their formation.

Step 2. Obtain the matrix of GO terms enrichment:

Obtain the matrix of GO terms enrichment for the lists located at the extremes detected in Step 1. Remember, this matrix is stored as an attribute of the enrichment contingency tables. Therefore:

options(max.print = 16)

# enrichment contingency tables:
contTablesBP4 <- attr(dissMatrx_BP4, "all2x2Tables")

# matrix of GO term enrichment for the lists located at the extreme left
tableleft <- attr(contTablesBP4, "enriched")[, lleft, drop = FALSE]
tableleft

options(max.print = 50)

# matrix of GO term enrichment for the lists located at the extreme right
tableright <- attr(contTablesBP4, "enriched")[, lright, drop = FALSE]
tableright

Actually, the contingency tables saved in the objects contTablesBP4 and cont_all_BP4 (from Section 2.1) are exactly the same.

Step 3. Compute means and variances:

Compute the means and variances for each of the GO terms detected in each extreme group. In other words, calculate the means and variances by rows of the GO term enrichment matrix for the lists at the extremes:

# function to compute mean and standard deviation:
mean_sd <- function(x){
  c("mean" = mean(x), "sd" = ifelse(length(x) == 1, 0, sd(x)))
}

# mean and sd of the lists to the extreme left:
lmnsd <- apply(tableleft, 1, mean_sd)

# mean and sd of the lists to the extreme right:
rmnsd <- apply(tableright, 1, mean_sd)

Step 4. Establish a statistic:

Establish a statistic to assess the degree of enrichment disparity between extreme groups for each GO term. This measure should consider that a GO term discriminates a dimension if:

i) the lists show opposing patterns of enrichment in the GO term, indicated by a large difference between $| L\overline{X}{GO_j} - R \overline{X}{GO_j}|$ ($L$ represents Left and $R$ Rigth), and ii) the behavior of the lists remains stable at each extreme, indicated by a small amount of $LS^2{GO_j} + RS^2{GO_j}$.

Considering these specific details, the following statistic, is suggested:

$$st_j = \dfrac{| L\overline{X}{GO_j} - R \overline{X}{GO_j}|}{\sqrt{\dfrac{LS^2{GO_j}}{l} + \dfrac{RS^2{GO_j}}{r} + \epsilon }}$$ with $l$ the number of lists to the extreme left and $r$ the number of lists to the extreme right. Using $\epsilon= 0.00000001$ the statistic is computed as follows:

nl <- ncol(tableleft) 
nr <- ncol(tableright) 
t_stat <- abs(lmnsd[1, ] - rmnsd[1, ]) / 
  sqrt((((lmnsd[2, ] / nl) + (rmnsd[2, ] / nr))) + 0.00000001)

Step 5. Select the GO terms with the highest statistic:

GO terms with the highest statistic are those with a value closest to $\frac{1}{\sqrt{\epsilon}}$. These GO terms are the main contributors to the formation of clusters in the analyzed dimension. Therefore, they are strongly related to the equivalence between the compared gene lists. They are detected as follows:

result <- t_stat[t_stat == max(t_stat)]
result

In addition, we can identify the biological functions associated with the detected GO terms:

library(GO.db)
library(DT)
# Previous function to get the description of the identified GO terms:
get_go_description <- function(go_id) {
  go_term <- Term(GOTERM[[go_id]])
  return(go_term)
}

# GO terms description:
datatable(data.frame(GO_term = names(result),
                     Description = sapply(names(result), get_go_description, 
                                           USE.NAMES = TRUE)), 
          filter = "top", rownames = FALSE)

References {.unnumbered}

pablof1988/goSorensen documentation built on April 17, 2025, 10:44 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pablof1988/goSorensen
Statistical inference based on the Sorensen-Dice dissimilarity and the Gene Ontology (GO)

In pablof1988/goSorensen: Statistical inference based on the Sorensen-Dice dissimilarity and the Gene Ontology (GO)

PRELIMINARIES.

Theoretical Framework of Reference

Data Used in this Vignette.

Obtaining the Irrelevance-threshold Matrix of Dissimilarities.

For a Specific Ontology and one GO level.

NOTE: {.unnumbered}

For More than One Ontology and GO Level.

NOTE: {.unnumbered}

Representing a Dissimilarity Matrix in Statistical Graphs.

Dendrogram.

MDS-Biplot.

Characterizing MDS-Biplot Dimensions.

Step 1. Split the biplot axis:

Step 2. Obtain the matrix of GO terms enrichment:

Step 3. Compute means and variances:

Step 4. Establish a statistic:

Step 5. Select the GO terms with the highest statistic:

References {.unnumbered}

R Package Documentation

Browse R Packages

We want your feedback!

pablof1988/goSorensen Statistical inference based on the Sorensen-Dice dissimilarity and the Gene Ontology (GO)

In pablof1988/goSorensen: Statistical inference based on the Sorensen-Dice dissimilarity and the Gene Ontology (GO)

PRELIMINARIES.

Theoretical Framework of Reference

Data Used in this Vignette.

Obtaining the Irrelevance-threshold Matrix of Dissimilarities.

For a Specific Ontology and one GO level.

NOTE: {.unnumbered}

For More than One Ontology and GO Level.

NOTE: {.unnumbered}

Representing a Dissimilarity Matrix in Statistical Graphs.

Dendrogram.

MDS-Biplot.

Characterizing MDS-Biplot Dimensions.

Step 1. Split the biplot axis:

Step 2. Obtain the matrix of GO terms enrichment:

Step 3. Compute means and variances:

Step 4. Establish a statistic:

Step 5. Select the GO terms with the highest statistic:

References {.unnumbered}

R Package Documentation

Browse R Packages

We want your feedback!

pablof1988/goSorensen
Statistical inference based on the Sorensen-Dice dissimilarity and the Gene Ontology (GO)