In priesgo/TCGAome: MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data

knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>")
devtools::load_all()
#output: rmarkdown::html_vignette
#vignette: >
#  %\VignetteIndexEntry{Enrichment Analysis & Dimensionality Reduction}
#  %\VignetteEngine{knitr::rmarkdown}
#  %\VignetteEncoding{UTF-8}

Using Object Oriented (OO) programming

Using OO allows achieving high cohesion and low coupling, which facilitates all the software development cycle from development to maintenance, including testing. Following the principle "Don't repeat yourself" (https://en.wikipedia.org/wiki/Don't_repeat_yourself) is also enhanced with a good OO design. There are 3 OO models in R, we will be using S4 as it is the model most used in Bioconductor. Furthermore, we want to foster code reuse from existing Bioconductor S4 classes.

Enrichment class model

Annotations

This model allows us to support multiple annotations for enrichment. So far the supported annotations based on AnnotationDbi data source and some custom resources are:

Gene Ontology: biological process, molecular function and cell component (from org.Hs.eg.db)
KEGG pathways (from org.Hs.eg.db)
OMIM diseases (from org.Hs.eg.db)
HPO phenotypes (from http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastStableBuild/artifact/annotation/ALL_SOURCES_TYPICAL_FEATURES_phenotype_to_genes.txt)

These annotations can be loaded as follows:

goa_bp <- TCGAome::load_goa(ontology = "BP")
goa_mf <- TCGAome::load_goa(ontology = "MF")
goa_cc <- TCGAome::load_goa(ontology = "CC")
kegg <- TCGAome::load_kegg()
omim <- TCGAome::load_omim()
hpo <- TCGAome::load_hpo()

The S4 object created contains the following slots:

str(goa_bp, list.len = 5, vec.len = 5)

Distribution of annotations

We can visually explore the distribution of annotations and explore the differences between genes and terms for the different resources.

plot(goa_bp)

or alternatively:

plot_scatter(goa_bp)
plot_violin(goa_bp)

A summary on the annotations can be obtained with the print method:

print(goa_bp)

Comparing the different annotations we can observe that the three GO ontologies are the richest ... To be continued...

goa_bp_violin <- plot_violin(goa_bp) + ggplot2::guides(colour=FALSE)
goa_mf_violin <- plot_violin(goa_mf) + ggplot2::guides(colour=FALSE)
goa_cc_violin <- plot_violin(goa_cc) + ggplot2::guides(colour=FALSE)
kegg_violin <- plot_violin(kegg) + ggplot2::guides(colour=FALSE)
omim_violin <- plot_violin(omim) + ggplot2::guides(colour=FALSE)
hpo_violin <- plot_violin(hpo) + ggplot2::guides(colour=FALSE)

cowplot::plot_grid(goa_bp_violin, goa_mf_violin, goa_cc_violin, 
                   kegg_violin, omim_violin, hpo_violin,
                   nrow = 2)

goa_bp_scatter <- plot_scatter(goa_bp) + ggplot2::guides(colour=FALSE, shape=FALSE)
goa_mf_scatter <- plot_scatter(goa_mf) + ggplot2::guides(colour=FALSE, shape=FALSE)
goa_cc_scatter <- plot_scatter(goa_cc) + ggplot2::guides(colour=FALSE, shape=FALSE)
kegg_scatter <- plot_scatter(kegg) + ggplot2::guides(colour=FALSE, shape=FALSE)
omim_scatter <- plot_scatter(omim) + ggplot2::guides(colour=FALSE, shape=FALSE)
hpo_scatter <- plot_scatter(hpo) + ggplot2::guides(colour=FALSE, shape=FALSE)

cowplot::plot_grid(goa_bp_scatter, goa_mf_scatter, goa_cc_scatter, 
                   kegg_scatter, omim_scatter, hpo_scatter,
                   nrow = 2)

print(goa_mf)
print(goa_cc)
print(kegg)
print(omim)
print(hpo)

Metrics

For any term within the TCGAome::GeneAnnotations object we can obtain its relative frequency of annotation, which might be useful to compute the Information Content within a term.

random_term = goa_bp@term2gene$Term[runif(1, max = length(goa_bp@term2gene$Term))]
random_term
TCGAome::get_term_freq(goa_bp, random_term)

For any two terms within the GeneAnnotations object we can obtain its functional similarity based on the binary distances implemented in TCGAome.

random_term1 = goa_bp@term2gene$Term[runif(1, max = length(goa_bp@term2gene$Term))]
random_term2 = goa_bp@term2gene$Term[runif(1, max = length(goa_bp@term2gene$Term))]
random_term1
random_term2
TCGAome::get_functional_similarity(goa_bp, random_term1, random_term2, distance_measure = "cosine")

The supported binary distances are: UI, cosine, Bray-Curtis and binary.

This model should be extended to support semantic similarity measures based on the ontology structure. This is a special case for some of these annotation resources which are also backed by an ontology, like GO and HPO. On the previous implementation of TCGAome we used GoSemSim for the semantic similarity within GO terms. We need to study if this can be easily extended to other ontologies.

Extend annotation support

It is relatively simple to extend the support to additional annotations, we just need to provide the association between genes and terms in a tall-skinny data frame with columns "Gene" and "Term".

uniKeys <- AnnotationDbi::keys(org.Hs.eg.db::org.Hs.eg.db, keytype="SYMBOL")
cols <- c("PATH")
kegg_raw <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db, keys=uniKeys, columns=cols, keytype="SYMBOL")
kegg_raw <- kegg_raw[, c(1, 2)]
colnames(kegg_raw) <- c("Gene", "Term")
kegg <- TCGAome::GeneAnnotations(raw_annotations = kegg_raw, name="KEGG-Human")

Enrichment

The previous TCGAome version used the package topGO for computing the enrichment of GO terms. This package is limited to GO. The computation employed was a Fisher's test, we were not making use of the advanced functionality in topGO. Thus, in order to gain flexibility the enrichment computation was reimplemented inside the class TCGAome::GeneListEnrichment.

We can compute enrichment for a given list of genes based on a preloaded annotation:

gene_list <- c("ZNF638", "HNRNPU", "PPIAL4G", "RAPH1", "USP7", "SUMO1P3",
          "TMEM189.UBE2V1", "ZNF837", "LPCAT4", "ZFPL1", "STAT3", "XRCC1",
          "STMN1", "PGR", "RB1", "KDR", "YBX1", "YAP1", "FOXO3", "SYK", "RAB17",
          "TTC8", "SLC22A5", "C3orf18", "ANKRA2", "LBR", "B3GNT5", "ANP32E",
          "JOSD1", "ZNF695", "ESR1", "INPP4B", "PDK1", "TSC2", "AR", "HSPA1A",
          "CDH3", "SMAD4", "CASP7", "GMPS", "NDC80", "EZH2", "MELK", "CDC45",
          "CRY2", "KLHDC1", "MEIS3P1", "FBXL5", "EHD2", "CCNB1", "GSK3A",
          "DVL3", "NFKB1", "COL6A1", "CCND1", "BAK1")
goa_bp_enrichment <- TCGAome::get_enrichment(goa_bp, gene_list = gene_list)

Or alternatively:

goa_bp_enrichment <- TCGAome::GeneListEnrichment(gene_annotations = goa_bp, gene_list = gene_list)

Explore results:

print(goa_bp_enrichment)

And extract significant results:

head(TCGAome::get_significant_results(goa_bp_enrichment, significance_thr = 0.05, adj_method = "none"))

TODO: visualize enrichment results

TODO: add a print function to enrichment object

Visualizing enrichment results (clustering + MDS)

To visualize enriched terms we use a dimensionality reduction approach based on clustering our results, select then a representative member of each cluster and plot them into 2D by using multidimensional scaling.

Tu run this pipeline we need to create an object of type "TermsClustering" as follows:

goa_bp_term_clustering <- TCGAome::TermsClustering(goa_bp, goa_bp_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)
print(goa_bp_term_clustering)

Or alternatively:

goa_bp_term_clustering <- TCGAome::get_terms_clustering(goa_bp_enrichment, goa_bp, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)

The clustering is performed based on a similarity metric between terms. The supported similarity metrics are based on the genes annotated to each term. As genes are functional elements we called it functional similarity. We support so far several metrics based on binary vector distances, every term is associated to a binary vector with the length of the total number of genes that our term annotations contain. Each position in the vector contains 1 or 0 if the given term is annotated with that gene or not. So far, we support a number of binary distances: Union-Intersection (UI), cosine, Bray-Curtis and binary.

To decide the optimal number of clusters we perform a silhouette analysis. We use by default a maximum of clusters of 10 as otherwise results become quickly complex, this can be changed using parameter max_clusters.

Evaluate the optimal number of clusters with silhouette analysis:

TCGAome::plot_silhouette_analysis(goa_bp_term_clustering)

We perform a PAM clustering with the obtained optimal number of clusters and select a term representative for each of the clusters. The criteria to select the term representative is as follows: for each cluster if there is any term with a frequency of annotation below a given threshold select the term with the lowest enrichment p-value; otherwise select the term with the lowest p-value across all terms independently of their frequency of annotation.

Also based on the similarity metrics between terms we can perform a Multidimensional Scaling to plot our clusters in a reduced number of dimensions. We have two alternatives: perform the MDS on every significant term that was clustered or perform it just on the subset of cluster representatives. To evaluate the MDS we need to inspect the explained variance by the first 3 components, which are the ones we will be visualizing.

TCGAome::plot_explained_variance(goa_bp_term_clustering)

Finally plot the results of the clustering and MDS.

TCGAome::plot_mds(goa_bp_term_clustering, all = TRUE)

As this scatter plot might quickly get very noisy you may plot only the representative terms for each cluster in the MDS performed over the subset of cluster representatives.

TCGAome::plot_mds(goa_bp_term_clustering, all = FALSE)

We can do the same for other annotation resources.

goa_mf_enrichment <- TCGAome::get_enrichment(goa_mf, gene_list = gene_list)
goa_mf_term_clustering <- TCGAome::TermsClustering(goa_mf, goa_mf_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)
TCGAome::plot_mds(goa_mf_term_clustering, all = TRUE)

goa_cc_enrichment <- TCGAome::get_enrichment(goa_cc, gene_list = gene_list)
goa_cc_term_clustering <- TCGAome::TermsClustering(goa_cc, goa_cc_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)
TCGAome::plot_mds(goa_cc_term_clustering, all = TRUE)

kegg_enrichment <- TCGAome::get_enrichment(kegg, gene_list = gene_list)
kegg_term_clustering <- TCGAome::TermsClustering(kegg, kegg_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)
TCGAome::plot_mds(kegg_term_clustering, all = TRUE)

hpo_enrichment <- TCGAome::get_enrichment(hpo, gene_list = gene_list)
hpo_term_clustering <- TCGAome::TermsClustering(hpo, hpo_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)
TCGAome::plot_mds(hpo_term_clustering, all = TRUE)

omim_enrichment <- TCGAome::get_enrichment(omim, gene_list = gene_list)
omim_term_clustering <- TCGAome::TermsClustering(omim, omim_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 10)
TCGAome::plot_mds(omim_term_clustering, all = TRUE)

Evaluating the different similarity metrics

The similarity metric employed for clustering and MDS affects how terms are grouped. It is important to assess how well a given similarity metric explains the variance in our annotation terms in the first components of the MDS that will be used to plot our results. This analysis is specific for every annotation and enrichment results, no magic recipe that can be reused.

First thing is to evaluate the explained variance in our MDS space by using each of the similarity metrics.

goa_bp_term_clustering_cosine <- TCGAome::TermsClustering(goa_bp, goa_bp_enrichment, distance_measure = "cosine", significance_thr = 0.05, adj_method = "none", max_clusters = 20)
goa_bp_term_clustering_ui <- TCGAome::TermsClustering(goa_bp, goa_bp_enrichment, distance_measure = "UI", significance_thr = 0.05, adj_method = "none", max_clusters = 20)
goa_bp_term_clustering_bray <- TCGAome::TermsClustering(goa_bp, goa_bp_enrichment, distance_measure = "bray-curtis", significance_thr = 0.05, adj_method = "none", max_clusters = 20)
goa_bp_term_clustering_binary <- TCGAome::TermsClustering(goa_bp, goa_bp_enrichment, distance_measure = "binary", significance_thr = 0.05, adj_method = "none", max_clusters = 20)

explained_variance_cosine <- goa_bp_term_clustering_cosine@explained_variance
explained_variance_ui <- goa_bp_term_clustering_ui@explained_variance
explained_variance_bray <- goa_bp_term_clustering_bray@explained_variance
explained_variance_binary <- goa_bp_term_clustering_binary@explained_variance
explained_variance_repr_cosine <- goa_bp_term_clustering_cosine@explained_variance_repr
explained_variance_repr_ui <- goa_bp_term_clustering_ui@explained_variance_repr
explained_variance_repr_bray <- goa_bp_term_clustering_bray@explained_variance_repr
explained_variance_repr_binary <- goa_bp_term_clustering_binary@explained_variance_repr

explained_variance <- rbind(
    cbind(set = "all",
          metric = "cosine", 
          explained_variance_cosine[explained_variance_cosine$component <= 10, ]),
    cbind(set = "all",
          metric = "UI", 
          explained_variance_ui[explained_variance_ui$component <= 10, ]),
    cbind(set = "all",
          metric = "Bray-Curtis", 
          explained_variance_bray[explained_variance_bray$component <= 10, ]),
    cbind(set = "all",
          metric = "binary", 
          explained_variance_binary[explained_variance_binary$component <= 10, ]),
    cbind(set = "representatives",
          metric = "cosine", 
          explained_variance_repr_cosine[explained_variance_repr_cosine$component <= 10, ]),
    cbind(set = "representatives",
          metric = "UI", 
          explained_variance_repr_ui[explained_variance_repr_ui$component <= 10, ]),
    cbind(set = "representatives",
          metric = "Bray-Curtis", 
          explained_variance_repr_bray[explained_variance_repr_bray$component <= 10, ]),
    cbind(set = "representatives",
          metric = "binary", 
          explained_variance_repr_binary[explained_variance_repr_binary$component <= 10, ])
)

ggplot2::ggplot(
    data = explained_variance,
    ggplot2::aes(x = component, y = explained_variance, colour = metric, linetype = set)) +
    ggplot2::geom_point(
        shape = I(16),
        size = 3) +
    ggplot2::geom_line(size = 0.2) +
    ggplot2::scale_x_continuous(
        breaks=unique(explained_variance$component)) +
    ggplot2::scale_y_continuous(label = scales::percent) +
    ggplot2::labs(y = "Explained variance", x = "Component") +
    ggplot2::theme_bw()

It is also important to assess how the similarity metric affects the optimal number of clusters.

silhouette_cosine <- goa_bp_term_clustering_cosine@silhouette
silhouette_ui <- goa_bp_term_clustering_ui@silhouette
silhouette_bray <- goa_bp_term_clustering_bray@silhouette
silhouette_binary <- goa_bp_term_clustering_binary@silhouette

silhouette <- rbind(
    cbind(metric = "cosine", 
          silhouette_cosine),
    cbind(metric = "UI", 
          silhouette_ui),
    cbind(metric = "Bray-Curtis", 
          silhouette_bray),
    cbind(metric = "binary", 
          silhouette_binary)
)

ggplot2::ggplot(
    data = silhouette,
    ggplot2::aes(x = k, y = avg_width, colour = metric)) +
    ggplot2::geom_line(
        size = 0.2,
        linetype = 2) +
    ggplot2::geom_point(
      shape = I(16),
      size = 3) +
    ggplot2::scale_x_continuous(
      breaks=unique(silhouette$k)) +
    ggplot2::scale_y_continuous() +
    ggplot2::labs(y = "Average silhouette width", x = "Number of clusters") +
    ggplot2::theme_bw()