knitr::opts_chunk$set(echo = TRUE)

Goal: Demonstrate monarchr disease, genes, homologs and phenotype methods.

We'll look for genes in model organisms that might suggest future focus for a human cancer phenotype.

To do this we need to define:

We'll try to do these using monarch as follows:

  1. We'll attempt to use Monarch to find a list of cancer phenotypes, and select one of these as our goal.
  2. We'll use Monarch to find what human genes are associated with that cancer phenotype.
  3. We'll use Monarch to find homologs to those genes in other model organisms.
  4. We'll use Monarch to find interactions to those genes in the other model organisms.
  5. We'll use Monarch to map the interactions back to human.
  6. We'll use mygene mygene.info to retrieve useful ids for monarch.
  7. We'll use Monarch to describe the interactions novel to the model organism network as projected on human.

In other vignettes, we'll use some of the results we produce here to pull other related information from monarch and non-monarch resources.

Selecting a cancer to work with: Uveal Melanoma

To ensure a richly documented dataset, we have selected a cancer from selected cancer listings at TCGA for which both sample collection is completed and public data available.

Under 'head and neck' cancer, we found uveal melanoma. We'll use this for our example.

Before proceeding further, we wondered: "Are there any animal models of uveal melanoma?"

A quick pubmed search did reveal mouse models for uveal melanoma.

Uveal Melanoma as a disease at Monarch Initiative.

It would be great if we could simply search for the disease by name and retrieve the top result.

Using monarchr: Find a useful ID for 'uveal melanoma'.

Load monarchr.

library(monarchr)

Search for "uveal melanoma" and display top results.

term <- "uveal melanoma"
results <- biolink_search(term)
um <- results$search_results # keep just the results, dropping the 'response' object.
#um
head(um)

Note that the results are presented in order of the 'score' column. This is the same ordering as we observe doing searches at the website.

# subset(um, select = c(id, score, label, category, synonym))
head(subset(um, select = c(id, score, label, category, synonym)), 10)

In this case, the top result is a perfect match for our phrase.

We save the 'id' for this top result. The id is the Monarch Disease Onotology (MONDO) ID and will be most useful for pulling other results from monarchr.

We print the definition for 'uveal melanoma'.

um_id <- um[1,]$id
um[um$id == um_id,]$definition

This definition is the same text as given in the 'Overview' tab from our search of the website in the previous section, MONDO:0006486.

Human Genes associated with uveal melanoma.

Fetching the human genes associated with the disease is simple.

um_genes <- bioentity_genes_assoc_w_disease(um_id)
um_genes <- um_genes$genes # ignore the response object.
um_genes

The HGNC IDs are most useful for further searches with monarchr. Obviously we can subset for the particular relation we want. For this vignette, we take all relation types.

HGNC derives from Human identifiers by the HUGO gene nomenclaure committee HUGO.

Note, the same gene can be returned more than once if there is more than one relation(?) or MONDO id (see notes).

um_hgnc <- um_genes$subject.id
dups <- um_genes[duplicated(um_genes$subject.id),]$subject.id
um_genes[um_genes$subject.id %in% dups, ]

Save the unique gene IDs.

um_genes_ids <- unique(um_genes$object.id)

Homologs of human genes.

Find homologs to each of the human genes.

We'd like to use something like the bioentityset function 'homologs' to do this all at once. But it doesn't seem to be working. I tried with our HGNC genes as well as with the NCBI example genes in their supplied example. We'll watch as the API evolves. Until then, we have to search for each gene's homolog one at a time.

homologs <- lapply(um_genes_ids, bioentity_homologs)
homologs <- lapply(homologs, "[[", "homologs")
homologs <- do.call('rbind', homologs)

Subset homologs to some organisms.

Subset to some of the more completely researched model organisms. Note that most or all of the mouse and rat are in 1 to 1 orthology relationships, and the other organisms a little less so.

#unique(homologs$object.taxon) # all we can choose from
orgs <- c('Mus musculus', "Rattus norvegicus", "Danio rerio", "Drosophila melanogaster", "Caenorhabditis elegans", "Saccharomyces cerevisiae S288C")
ho <- homologs[homologs$object.taxon %in% orgs,]
ho <- subset(ho, select= c(subject.taxon, subject.id, object.id, object.taxon))
ho <- ho[order(ho$object.taxon),]
split(ho, ho$object.taxon) # list these by organism.

Get interactions of the orthologs.

Monarch Initiative has indicated that while only 51% of human genes have an annotation associated with them, if one includes annotations of orthologs to those genes in other organisms, 89% have some annotation Monarch Initiative Nucleic Acids Research paper.

A possible inference from that is that by looking for interactions in other organisms, we might extend what we know about interactions for humans.

Of course we also want interactions in humans.

Find all interactions within each species.

all_genes <- c(um_genes_ids, ho$object.id)
all_intxns <- lapply(all_genes, bioentity_interactions_assoc_w_gene)
all_intxns <- lapply(all_intxns, "[[", "interactions")
all_intxns <- do.call('rbind', all_intxns)
all_intxns <- all_intxns[all_intxns$subject.taxon ==  all_intxns$object.taxon, ]

# unique(all_intxns$relation) # "interacts with" "genetically interacts with"

# nrow(all_intxns) # 3456
# length(all_intxns$object.id) # 3456
# length(unique(c(all_intxns$object.id, all_intxns$subject.id))) # 2677
# length(unique(all_intxns$object.id)) # 2646

all_intxns_split <- split(all_intxns, all_intxns$subject.taxon)

lapply(all_intxns_split, nrow)

Note, sometimes interactions are reported across species. We dropped those.

3484 interactions were found, 720 in human. Including both the subject and object genes, we have a total set of 2755 genes. We didn't find interactions for yeast. We don't know if the lack of yeast interactions is a monarch shortcoming or indicative of there really being no interactions for these particular yeast genes.

(TODO: evaluate counts inline as they'll change with future updates to monarch initiative's data sources.)

Orthologs to the interactions.

This is getting to be a large dataset. Let's continue by focusing on only one of the original human uveal melanoma genes and its orthologs and interactors of those.

We'll focus on the human gene 'BAP1', which has the unique id 'HGNC:950'.

bap1 <- "HGNC:950"
bap1_homologs <- homologs[homologs$subject.id == bap1,] # includes paralogs.
# all_intxns[all_intxns$subject.id %in% bap1_homologs$object.id,] # does not include interactions to human bap1
# Include interacations to human bap1 and all orthologs of bap1.
bap1_homolog_intxns <- all_intxns[all_intxns$subject.id %in% c(bap1, bap1_homologs$object.id),]
bap1_homolog_intxns_split <- split(bap1_homolog_intxns, bap1_homolog_intxns$object.taxon)
lapply(bap1_homolog_intxns_split, nrow)

We found 97 interactions in humans and 50-100 in other organisms.

Note! Some of these counts are suspiciously close to our default max_rows used to query monarch. We could go back and change the all_intxns to allow for more rows, but we'll continue on, understanding we are limiting our results, for now.

TODO: yeah, we should go back and redo that, but we only need BAP1 interactions.

Each of these tibbles represent graphs with a central node of BAP1 (or BAP1's orthologs) connected to each of its interactors.

We can continue from here to look at qualities of these BAP1 centralized network in data from each organism. That is left as an exercise.

We try to fix our limited results, and proceed only with mouse and human BAP1, as there are lots of interactions.

bap1_and_mouse_bap1 <- c(bap1, "MGI:1206586")
hm_bap1_homolog_intxns <- lapply(bap1_and_mouse_bap1, bioentity_interactions_assoc_w_gene, rows = 3000)  
hm_bap1_intxns <- lapply(hm_bap1_homolog_intxns, "[[", "interactions")
hm_bap1_intxns <- do.call('rbind', hm_bap1_intxns)
split(hm_bap1_intxns, hm_bap1_intxns$subject.id)

We found all the possible BAP1 interactions as neither mouse nor human exceed our row limits.

We restrict to only within organism matches.

hm_bap1_intxns <- hm_bap1_intxns[hm_bap1_intxns$subject.taxon == hm_bap1_intxns$object.taxon,]

Map the interactions in the other organisms back to human.

Now, we get a little crazy, and take the point of view that each organism's BAP1 centralized network represents a version of all networks which might be relevant (or not) to human BAP1.

We project the interactions back to human to get their human orthologs.

We already have all the human interactors to BAP1, we just save them with a new name.

We need to project the mouse interactions back to human.

bap1_human_intxns <- hm_bap1_intxns[hm_bap1_intxns$subject.taxon == "Homo sapiens",]
bap1_orthos_mouse <- hm_bap1_intxns[hm_bap1_intxns$subject.taxon == "Mus musculus",]

# we add back the mouse bap1 homolog to all the other mouse genes.
bap1_orthos_mouse_ids <- unique(c(bap1_orthos_mouse$subject.id, 
                                  bap1_orthos_mouse$object.id))
# x <- lapply(bm, bioentity_homologs, homolog_taxon="NCBITaxon:9606")
# Nicer to server with sleep.
bap1_orthos_mouse2human <- lapply(bap1_orthos_mouse_ids, 
                                  function(x) {
                                    r <- bioentity_homologs(gene = x, homolog_taxon = "NCBITaxon:9606", rows = 3000)
                                    Sys.sleep(0.1)
                                    r
                                    })
m2h <- lapply(bap1_orthos_mouse2human, "[[", "homologs")
m2h <- do.call('rbind', m2h)
nrow(m2h) # 105
length(unique(m2h$object.id)) # 102

We note imperfect 1 to 1 orthology led to the discrepancy in our counts above. Some genes have multiple orthology relationships.

m2h[m2h$object.id %in% m2h[duplicated(m2h$object.id),]$object.id,]

When we project these mouse interactors to BAP1 back to human, we have a projected network of 102 unique human genes.

Recall we had a network of 98 human genes using only the human interactions. (97 interactions + BAP1 as the hub)

Compare human network and human network as projected from mouse.

What is the difference between the human network and the projected mouse network?

Note, we're switching to working with the gene symbols rather than HGNC IDs.

h_genes <- c(bap1_human_intxns$object, "BAP1")
m2h_genes <- m2h$object

Common to both.

length(intersect(h_genes, m2h_genes))

Altogether.

length(union(h_genes, m2h_genes))

Unique within the human network.

nrow(setdiff(h_genes, m2h_genes))

unique within the mouse genes.

uniq_mouse <- setdiff(m2h_genes, h_genes)
uniq_mouse

We expand the potential network of genes interact with BAP1 by considering interactions in mouse, and projecting those interactions back to human.

To the extent anything is known of the new genes (the unique to mouse genes) vis a vis cancer or melanoma, these genes might be useful to pursue further. Alternatively, if nothing is known with respect to cancer for these genes, they might be interesting as novel targets for discovery.

Are the novel mouse genes associated with melanoma?

In the appendix, we note that MONDO:0005105 is the identifier for Melanoma.

We ask if any of the genes unique to the mouse network (as projected to human) are noted as melanoma genes.

melanoma <- bioentity_genes_assoc_w_disease("MONDO:0005105")
mel_genes <- unique(melanoma$genes$subject) # gene symbols
length(mel_genes)
intersect(mel_genes, uniq_mouse)

No new melanoma genes from the mouse network. This isn't too surprising as relatively few genes of any sort have been associated with melanoma according to the MONDO ontology.

Nonetheless, the genes might be useful for futher study with regards to melanoma.

Are the novel mouse genes associated with cancer?

Cancer has MONDO id 'MONDO:0004992'.

We suspect there are lots of genes tagged for cancer.

Indeed, https://monarchinitiative.org/disease/MONDO:0004992 shows 3055.

We need them all instead of the default 100 that monarchr provides. But we need to play nice with the server (either because of server limits or because of maximum size limits for the json, or both.)

Note, we get far more than 3035 genes. The Monarch Initiative website count included a handfulfrom non-human organims. We narrowed them to human, and took only the unique results.

#TODO: user shouldn't have to do this, wrap it in all methods instead.
check_for_more = TRUE
start_row = 0
group = 1
results = list()
while(check_for_more == TRUE) {
  # print(paste("fetching", group, "start", start_row))   # Monitor progress when using interactively.
  cancer <- bioentity_genes_assoc_w_disease("MONDO:0004992", 
                                            rows = 100, 
                                            start=start_row, 
                                            fetch_objects = FALSE)
  if (nrow(cancer$genes) < 100) {
    check_for_more <- FALSE
  }
  start_row <- start_row + 100
  group <- group + 1
  results <- c(results, cancer$genes[cancer$genes$subject.taxon == "Homo sapiens", ]$subject)
  Sys.sleep(0.5)
}
cancer_genes <- unique(unlist(results))
intersect(cancer_genes, uniq_mouse)

These 2 genes are BAP1 interactors in mouse. They are not noted as BAP1 interactors in human. In human, they are noted as having some association with cancer.

Get more info on the 2 new genes identified from projecting mouse to human.

Because we now have gene symbols, instead of gene IDs, further searches with monarchr are difficult.

mygene.info is handy to convert symbols back to useful monarch ids.

We use mygene.info to get back to an HGNC identifier.

suppressMessages(library(mygene))
new_genes <- intersect(cancer_genes, uniq_mouse)
# queryMany(new_genes, scopes = "symbol", species = "human", fields = "all")
new_genes_myinfo <- queryMany(new_genes, scopes = "symbol", species = "human", fields = "HGNC")$HGNC
new_genes_hgnc <- paste0("HGNC:", new_genes_myinfo)
# We need to give some time for mygene.info to respond.
Sys.sleep(1)

List disease and phenotype associations (including cancer) for our 2 candidates.

Then we explore how these candidate human BAP1 interactors are annotated for disease.

new_genes <- lapply(new_genes_hgnc, bioentity_diseases_assoc_w_gene)
new_genes <- lapply(new_genes, "[[", "diseases")
new_genes <- do.call(rbind, new_genes)
new_genes

Of the new genes of interest, KDM1A has the most cancer disease terms associated with it.

What are some of the phenotypes associated with either of the new genes.

new_genes <- lapply(new_genes_hgnc, bioentity_phenotypes_assoc_w_gene)
new_genes <- lapply(new_genes, "[[", "phenotypes")
new_genes <- do.call("rbind", new_genes)
new_genes

With a neoplasm listed for PKCA, maybe we were too hasty to suggest focusing more on KDM1A.

We think this sufficiently demonstrates the value of looking at the multiple types of annotations as are readily available from Monarch Initiative.

We note that both of these genes are subjects of intensive study in a variety of organisms. What is novel to our discovery here is that in the case of BAP1, we might want to consider each of these (PKCA, KDM1A) as possible interactors in human. And by extension of BAP1's role in uveal melanoma, possible roles in uveal melanoma.

Finally, we looked at interactions of only 1 (BAP1) of the 9 'melanoma' genes in detail. And we only projected interactors from one organism (mouse) back to human. If we wanted to cast as wide a net as possible, we could repeat these steps using all model organisms and interactions to all of the 9 melanoma genes. Of course we could also go in the other direction, identifying new research targets or questions for model organisms based on orthology to humans.

Summary

Appendix

The website equivalent of searching for 'uveal melanoma'.

If we were using the Monarch Initiative Website Directly, we would:

Note the following tabs and counts at the website:

We'll be interested in the overview, to make sure we found the disease we are targeting. And we'll want the genes to move onto additional steps.

In addition, we see that uveal melanoma is classified under 'melanoma' which is classified under 'cancer'. We make note of these MONDO IDs for validation or analysis steps in this or other vignettes.

Note, the entire MONDO ontology can be found here: MONDO.

Note, when we searched using monarchr the (label, category, synonym) columns in the tibble correspond to (Term, Category, Matching String) columns as observed at the website result. The exception is that the website displayed only one of possibly many synonyms until we drilled further into the results. (TODO(?): How is the particular synonym chosen by monarch? It is not obvious.).

Note, the number of unique genes (9) for uveal melanoma using monarchr is the same as listed under the Genes tab at the website.

Monarch returns all genes at and below a disease ontology's level (i.e. node or leaf).

When we searched for MONDO:0006486, we also returned genes for MONDO:0007966. Searching monarch's website for MONDO:0007966 we see it is directly below the MONDO:0006486 'uveal melanoma' term. So this makes sense, monarchr is returning both the genes at the disease ontology level we requested, and all those underneath it as well. You can confirm this by clicking on each of the nodes and leafs at the website below uveal melonoma. The only leaf that shows genes is 'susceptibility to melanoma', all other genes are classified solely at the higher level of 'uveal melanoma'. (That isn't the entire story, we do see some variants and phenotypes at the other nodes and leaves.)

We need to check server limits, or at least play nice.

TODO: We haven't seen any server limits in the headers from our TODO: previous requests, but it is still on us to play nice.

response\$content\$objects might be a handy shortcut for some searches.

We note that in the json when we searched for cancer genes using the MONDO term, cancer\$response\$content\$objects" gets all 3055 unique objects, even when asking for a limited number of rows. This suggests a shortcut, where we don't have to query for all information.

However, these "cancer\$response\$content\$objects" are in form of gene IDs rather than gene names. Therefore, we had to loop through all the detailed results.



charlieccarey/monarchr documentation built on Dec. 12, 2023, 12:57 p.m.