knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

VESPUCCI is the gene expression database for grapevine and we can access it via its GraphQL interface, called COMPASS. The rCOMPASS package is a R package that wraps some functionalities to simplify communication with the COMPASS interface.

In this first exercise we will perform few basic operations with VESPUCCI. We will create a Module starting from few genes and then automatically extend it by adding more genes. We will also have a look at gene and samples annotations.:

# devtools::install_github("onertipaday/rcompass")
library(rcompass)

The COMPASS GraphQL endpoint might hosts different compendia. At the moment there is only the VESPUCCI compendium, but there are different version of VESPUCCI, and each version might have data normalized in different ways. In this case there are 2 versions of VESPUCCI, version 1.0 (legacy) and version 2.0 (latest). The latter has data normalized in 2 different ways, TPM normalization and LIMMA (the default one) while the legacy version has the legacy normalization only (i.e. per-sample logratios). For every query we will need to indicate the compendium we want to use, if no version, no normalization and no database is specified the r get_available_compendia()$versions[2] values will be used.

paste("The Vitis gene expression compendium version", get_available_compendia()$versions[2], "normalized contains values for", totalCounts(version="1.0", aggregate_type="biofeatures"), " biological features, measured for", totalCounts(version="2.0", aggregate_type="sampleSets"), " sample sets. This corresponds to a total of", totalCounts(aggregate_type = "experiments"), " experiments and", totalCounts(aggregate_type = "samples"), " samples measured on", totalCounts(aggregate_type = "platforms")," different platforms divided in ")
knitr::kable( table(get_platform_information()$source))

Genes

Let's build our module starting from a bunch of genes that might come from a previous analysis. Gene name used are the VIT_ ids

gene_names <- c('VIT_00s0246g00220','VIT_00s0332g00060','VIT_00s0332g00110','VIT_00s0332g00160','VIT_00s0396g00010','VIT_00s0505g00030','VIT_00s0505g00060','VIT_00s0873g00020','VIT_00s0904g00010')

We can query the compendium with the list of gene names and get a list of BiologicalFeature objects that represents our genes of interest. The easiest way to know which are the valid filter values to be used is to use the autocompletion (ALT + SPACEBAR) in the COMPASS GraphQL interface at http://compass.fmach.it/graphql.

bf_ids <- get_biofeature_id(compendium = "vespucci", id_In = gene_names, useIds = FALSE) 
knitr::kable(bf_ids)

Each BiologicalFeature object has several fields. In our case, since the BiologicalFeature represent a gene, we have sequence. The id is the identifier used internally in the database.

knitr::kable(get_biofeature_by_name(name_In = gene_names))

Get ids from aliases using SPARQL

Genes might have different aliases, all of which are defined in VESPUCCI as annotations. Since annotations (for both genes and samples) are represented as RDF triples using different ontologies, we can perform a SPARQL query to retrieve all the biological features (genes here) that has an alias NCIT_C41095 term on the NCIT Ontology, http://purl.obolibrary.org/obo/) equal to the ones specified in the list.

my_query <- "SELECT ?s ?p ?o WHERE { {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'B9S8R7'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Q7M2G6'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'D7SZ93'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'B8XIJ8'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Vv00s0125g00280'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Vv00s0187g00140'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Vv00s0246g00010'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Vv00s0246g00080'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Vv00s0438g00020'} UNION {?s <http://purl.obolibrary.org/obo/NCIT_C41095> 'Vv00s0246g00200'} }"
sparql_results <- get_sparql_annotation_triples(target = "biofeature", query = my_query) 
sparql_results[,1]  

alternatively we can use the wrapper function get_ids_from_alias():

aliases <- c('B9S8R7','Q7M2G6','D7SZ93','B8XIJ8','Vv00s0125g00280','Vv00s0187g00140','Vv00s0246g00010','Vv00s0246g00080','Vv00s0438g00020','Vv00s0246g00200')
sparql_results_check <- get_ids_from_alias(target = "biofeature", alias_names = aliases)

and check results are the same:

identical(sparql_results[,1],sparql_results_check)

Now we add the genes id retrieved using either the sparql query or the wrapper function get_ids_from_alias() to the original genes id list.

gene_ids_module <- c(bf_ids$id, sparql_results_check)
gene_names_module <- c(bf_ids$name, get_biofeature_id(id_In = sparql_results_check)$name)

Create a Module

We are now ready to create our first module. If we pass a list of genes (BiologicalFeature objects) it will automatically retrieve the "best" conditions (SampleSet objects).

module_1 <- create_module_bf(biofeaturesNames = gene_ids_module, 
                             normalization = "limma", 
                             top_n = 50,
                             useIds = TRUE)

A module can be thought as a matrix composed by BiologicalFeature objects (genes in our case) as rows and SampleSet objects (conditions contrasts in our case) as columns. Each SampleSet, as the name suggests, is composed by 1 or more Sample objects. Since we are using the legacy normalization each SampleSet is composed by 2 samples, one Test sample and one Reference sample and the value associated is the logratio between the Test and Reference for a specific gene (BiologicalFeature).

Plot a module

The easiest way to visualize a module is using a heatmap. The Plot object wraps a module and can return different plot type with different format (the Plotly JSON format or HTML). To see an heatmap embedded in rstudio we can retrieve the HTML code of the Plot (that uses Plotly) and display it directly into the viewer.

view_plot(module = module_1, 
          normalization = "limma",
          plotType = "module_heatmap_expression",
          alternativeColoring = FALSE,
          min = -6, max = 6)

When using the heatmap it is possible to use the button Get Module IDs in order to copy on the clipboard BioFeatures and SampleSets IDs after having selected a part of the module with the left mouse button. For example after having selected one small block of up-regulated genes and clicked the Get Module IDs button, we can paste a JSON structure to be used for create a new module:

json_selection <- '{"sample_sets":["U2FtcGxlU2V0VHlwZToyMTg3Mg==","U2FtcGxlU2V0VHlwZToyMTg2Mg==","U2FtcGxlU2V0VHlwZToyMTAyNg==","U2FtcGxlU2V0VHlwZTo2NTU2","U2FtcGxlU2V0VHlwZToyMTAyNw==","U2FtcGxlU2V0VHlwZToyMTAyOA==","U2FtcGxlU2V0VHlwZTo1ODc2","U2FtcGxlU2V0VHlwZTo1ODg2","U2FtcGxlU2V0VHlwZToyMTAyOQ==","U2FtcGxlU2V0VHlwZTo1ODU4","U2FtcGxlU2V0VHlwZTo1ODYy","U2FtcGxlU2V0VHlwZTo1OTE3","U2FtcGxlU2V0VHlwZToxOTEyNg==","U2FtcGxlU2V0VHlwZToyMTg1OQ=="],"biological_features":["QmlvRmVhdHVyZVR5cGU6Mjc1NTU=","QmlvRmVhdHVyZVR5cGU6Mjg1MDQ=","QmlvRmVhdHVyZVR5cGU6MjgxODM=","QmlvRmVhdHVyZVR5cGU6Mjc0Mzc=","QmlvRmVhdHVyZVR5cGU6MjgyODE=","QmlvRmVhdHVyZVR5cGU6MjgyNzg=","QmlvRmVhdHVyZVR5cGU6MjgwODE=","QmlvRmVhdHVyZVR5cGU6Mjc5MDA=","QmlvRmVhdHVyZVR5cGU6Mjc4OTY=","QmlvRmVhdHVyZVR5cGU6Mjc4OTI=","QmlvRmVhdHVyZVR5cGU6Mjg1OTg=","QmlvRmVhdHVyZVR5cGU6Mjg2MDk=","QmlvRmVhdHVyZVR5cGU6Mjg2MzA="]}'

submodule_ids <- RJSONIO::fromJSON(json_selection)

submodule_1 <- create_module(biofeaturesNames = submodule_ids$biological_features,
                             samplesetNames = submodule_ids$sample_sets,
                             normalization = "limma", useIds = TRUE)

and visualize it

view_plot(module = submodule_1, 
          normalization = "limma",
          plotType = "module_heatmap_expression",
          alternativeColoring = FALSE,
          min = -6, max = 6)

Biological Features (genes) and Sample Sets (contrasts) have no particular order in a module. But since the heatmap is clustered we can retrieve the sorted list of Biological Features and Sample Sets respectively using the sorted = TRUE parameter.

mod_tmp <- create_module_bf(biofeaturesNames = gene_ids_module, 
                            normalization = "limma", 
                            top_n = 50,
                            sorted = TRUE,
                            useIds = TRUE)
# my_sorted_indexes <- plot_module_heatmap(module = module_1, normalization = "limma", plot = FALSE, sorted = TRUE)
# mod_tmp = module_1[match(my_sorted_indexes$sortedBiofeatures,rownames(module_1)),
#                    match(my_sorted_indexes$sortedSamplesets,colnames(module_1))]

Alternatively you can use directly any R function to cluster data and retrieve the sorted rows and columns of the matrix, e.g.:

pheat_cl <- pheatmap::pheatmap(mat =  SummarizedExperiment::assay(mod_tmp), silent = TRUE)
mod_tmp <- mod_tmp[pheat_cl$tree_row$order,pheat_cl$tree_col$order]

Export it in another format

Since we created a data.frame from the module we can now export it in another format (e.g. .csv, .tsv, etc.) and keep working with it elsewhere with different tools:

write.table(module_1, file="/tmp/module_1.tsv", sep="\t", col.names=NA, row.names=T)

Plot correlation network

Given a module we can plot the correlation network between Biological Features (genes) by looking at the Pearson correlation coefficient of the module's Sample Sets (contrast). The default threshold is 0.8, but it can be changed. Edges in orange represent anti-correlation.

view_plot(module = module_1, 
          normalization = "limma",
          plotType = "module_coexpression_network", 
          threshold = 0.8)

Automatically expand a module

One of the classic way to use VESPUCCI is to start with a set of genes selected from previous analysis, have a look at the top conditions in which they show a consistent (and strong) behavior and then look for other genes that behave the same across those same conditions. The way to do it is to plot the distribution of genes that show a similar behavior of the module's genes across the same condition in order pick up a suitable threshold.

view_plot(module = module_1, normalization = "limma",
          plotType = "biological_features_uncentered_correlation_distribution")

For example, if we pick a threshold around 0.9 (looking at the top plot) we might expect to have around 15 Biological Features (genes) with a score equal or greater than that, to be added in our module. In this case let's just pick the top 15 genes and create e new module.

top_bf <- plot_module_distribution(module = module_1, 
                                   normalization = "limma",
                                   plotType = "biological_features_uncentered_correlation_distribution",
                                   getRank = TRUE)

The new module is created by adding Biological Feature objects to the old module.

NB Being the ranking based on the full biofeatures set we remove the duplicates ids (keeping unique ids only) and create a new model.

module_2 <- create_module(biofeaturesNames = unique(c(rownames(module_1), top_bf$id[1:15])), 
                             normalization = "limma", 
                             useIds = TRUE)

We can have a look at the heatmap of the new module

view_plot(module = module_2, 
          normalization = "limma",
          plotType = "module_heatmap_expression",
          alternativeColoring = TRUE)

Annotations

In VESPUCCI, both Biological Feature (gene) and Sample annotations are represented using RDF. We can visualize annotation as a set of RDF triples.

samples <- my_sorted_indexes$sortedSamplesets
samples_info <- get_samples_info(id_In = samples, normalization = "limma", useIds = TRUE)
knitr::kable(samples_info)
knitr::kable(get_sample_by_gsm(sampleName_In =  samples_info$sampleName))
knitr::kable(get_annotation_triples(ids =  get_sample_by_gsm(sampleName_In =  samples_info$sampleName)$sampleId))

Another way is to visualize the annotation is with RDF graph. Here we show the RDF graphs for indexed biofeatures.

 plot_annotation_network(ids = my_sorted_indexes$sortedBiofeatures, viewer = T)

Annotations can also be used to retrieve Sample or Biological Features (gene) starting from a SPARQL query. For example we might want to select all seed Sample.

sparql = "SELECT ?s ?p ?o WHERE { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/NCIT_C19157> . ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/PO_0009010>}"

The term NCIT_C19157 indicates specimen while PO_0009010 is the Plant Ontology term for seed. To indicate a gene, instead the term NCIT_C16612 is used.

Here we retrieve all the samples used based on the sparql query:

seed_samples <- get_sparql_annotation_triples(target = "sample", 
                                              query = sparql,
                                              normalization = "limma")[,1]

Now we can collect all the Sample Sets in which these samples are used and then create a new module based on Sample Sets only leaving VESPUCCI to figure out which are the genes (Biological Feature objects) that show a consistent up or down-regulated profile.

seed_ss <- get_sampleset_by_sampleid(normalization = "limma", samples = seed_samples)
module_3 <- create_module(samplesetNames = unique(seed_ss$id), useIds = TRUE, normalization = "limma")

And now we can plot the heatmap

view_plot(module = module_3, 
          normalization = "limma",
          plotType = "module_heatmap_expression",
          alternativeColoring = TRUE)

Module summary

One way to have a general understanding about what is contained in a module, in terms of sample conditions, is to use the get the Module Description Summary Object.

description <- describe_module(module = module_1, normalization = "limma")

Enrichment

Another way to have a general understanding about which genes and conditions are present in our module we can explore the Module Enrichment Object. The enrichment is calculated using an Hypergeometric test with p-values corrected for multiple testing using Benjamini-Hochberg.

(enrichment <- enrich_module(module = module_1, normalization = "limma"))

SampleSet Short Descriptions

Another way to have a general descriptions of the different conditions is to use the SampleSet short description. It is calculated using Ontological Terms and tries to highlight similarities and differences between sample annotation within a SampleSet if any. In case a SampleSet is a contrast (as in limma and legacy normalizations) the description is divided by a double pipe || with similar terms between reference and test conditions on the left side and different terms on the right side.

knitr::kable(get_sampleset_id(id_In = colnames(module_1)[1], normalization = "limma", useIds = TRUE))

Session information

print(sessionInfo(), locale = FALSE)


onertipaday/rcompass documentation built on Sept. 18, 2021, 5:13 p.m.