We would like to be able to use existing R APIs for annotating genes based on gene IDs as the starting point. Thereby we can keep up to date with new Ensembl
versions that are published and are flexible whether IGIS (gRED) or GI-HDAP (pRED) is used.
For gRED molecules, the canonical source for Ensembl
(Entrez
for legacy) annotations is IGIS. Therefore we want to be able to fetch the required feature data columns for the genes based on the IDs from IGIS.
Fortunately, the nice R-package igis
is available.
connection <- connect(prefix = prefix(object), db = "igis")
where object
is a HermesData
object and db
identifies the data base to use.prefix()
is a new generic function/method to access slot of same name in HermesData object: it is either "ENSG" or "GeneID" and filled at creation.connection
result has a class: here Igis
igis::Igis
where we use "human" species and the version is determined
by the gene IDs format found in object
.annotation
method: this can have a getter function, where we just give back
the relevant data frame columns from rowData(object)
.annotation(object) <- query(genes = genes(object), connection)
genes()
is a new method for generic defined in package GenomicFeatures
to access gene names in objectquery
and a method for character
, Igis
signature.Just for illustration we create a new class here, but it would be the same for HermesData
and RangedHermesData
: just an additional slot prefix
.
# this will be folded into both HermesData and RangedHermesData class definitions: .HD <- setClass( Class = "HD", contains = "HermesData", slots = c(prefix = "character") ) # this will be like this: setGeneric("prefix", def = function(object, ...) { object@prefix }) # the constructor then also fills the prefix slot - this will later be folded into HermesData() # constructor: HD <- function(object) { start <- HermesData(object) gene_ids <- rownames(object) prefix <- if (all(grepl("^ENSG", gene_ids))) { "ENSG" } else if (all(grepl("^GeneID", gene_ids))) { "GeneID" } else { stop("hermes currently only supports EnsemblID or EntrezID as gene IDs") } .HD( start, prefix = prefix ) }
So now our typical example is:
object <- HD(summarized_experiment) prefix(object)
Actually this is just a synonym for rownames()
really.
In order to ensure that, we need to add another validation check for HermesData objects:
we require that its rownames equal the GeneID column in rowData.
library(igis) #' @importFrom GenomicFeatures genes setMethod( f = "genes", signature = c(x = "AnyHermesData"), definition = function(x, ...) { rownames(x) } ) # later assert this via validation method: identical(rownames(object), rowData(object)$GeneID) # then we can safely use this: head(genes(object))
As mentioned above, this can later be extended easily to other db
options.
connect <- function(prefix = c("ENSG", "GeneID"), db = "igis") { prefix <- match.arg(prefix) db <- match.arg(db) if (db == "igis") { version <- switch(prefix, ENSG = "4.0", GeneID = "3.0" ) igis::Igis(species = "human", version = version) } } connection <- connect(prefix(object)) class(connection)
This gives back the relevant data frame columns from rowData(object)
.
# Define a package constant: .row_data_annotation_cols <- c( "symbol", "desc", "GeneID", "chromosome", "size" ) #' @importFrom BiocGenerics annotation setMethod( f = "annotation", signature = c(object = "AnyHermesData"), definition = function(object, ...) { rowData(object)[, .row_data_annotation_cols] } ) head(annotation(object))
Now we are ready to query from Igis e.g. - again we define this as a generic function so that it can be easily extended once we need other connections.
setGeneric("query", def = function(genes, connection) {}) setMethod( f = "query", signature = c(genes = "character", connection = "Igis"), definition = function(genes, connection) { trans <- igis::transcriptsByGeneId( geneIds = genes, canonical = TRUE, igis = connection ) df <- cbind( setNames( mcols(trans), c( "symbol", "GeneID", "desc" ) ), Chromosome = as.vector(seqnames(trans)), StartBP = start(trans), EndBP = end(trans), WidthBP = width(trans) ) rownames(df) <- genes df[, .row_data_annotation_cols] } )
Let's try this out:
new_annotations <- query(genes(object), connection) head(new_annotations)
Note: If there is too much time delay between the connection creation and the query, then the error:
could not run statement: MySQL server has gone away
can be thrown. In that case just create a new connection again.
This basically just uses the data frame replacement method: That part of the rowData
which is the
annotation gets replaced with the provided data frame.
setReplaceMethod( f = "annotation", signature = c(object = "AnyHermesData", value = "DataFrame"), function(object, value) { # In production add here more checks/assertions if needed (e.g. test # whether wrong things could be inserted here into rowData) - # do we need to align the row names etc. # and then finally insert: rowData(object)[, .row_data_annotation_cols] <- value[, .row_data_annotation_cols] object } )
Let's try:
# first remove some info # rowData(object)[, "CanonicalTranscript"] <- "" # then fill in annotation(object) <- new_annotations head(annotation(object))
But the class of object is still:
class(object) rd <- rowData(object) head(rd) head(annotation(object))
Gene type can be found using igis::variationConsequence()
, in the biotype
slot.
For a single gene this works well:
connection <- connect_igis("ENSG") gene <- "ENSG00000163734" transcripts <- igis::transcriptsByGeneId( geneIds = gene, canonical = TRUE, igis = connection ) class(transcripts) transcripts cons <- igis::variationConsequence( igis = connection, allele = "T", location = transcripts, species = "human" ) cons$biotype
So probably we can do the same principle as what was done for igis::transcriptsByGeneId()
in that instead of providing one specific gene, we provide a list of all genes. Note that below we also have the other prefix (GeneID
which means Entrez
or IGIS version 3).
object <- HermesData(summarized_experiment) prefix(object) connection <- connect_igis(prefix(object)) all_transcripts <- igis::transcriptsByGeneId( geneIds = genes(object), canonical = TRUE, igis = connection ) all_transcripts
So let's try this:
all_cons <- igis::variationConsequence( igis = connection, allele = "T", location = all_transcripts, species = "human" )
This gives
Error in getListElement(x, i, ...) : GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment
So seems to me that there is a bug in igis
here.
I looked this up on Google and found other people reporting this, and being referred to R 4.0.3 with later BioConductor release. I tested this in my local RStudio with R 4.0.3 and indeed this runs there fine:
library(igis) connection <- Igis(species = "human", version = "4.0") genes <- c("ENSG00000163734", "ENSG00000163735") transcripts <- igis::transcriptsByGeneId( geneIds = genes, canonical = TRUE, igis = connection ) cons <- variationConsequence( igis = connection, allele = "T", location = transcripts, species = "human" ) cons$biotype
As the NEST team overall will switch to R 4.0.x in the next cycle, it seems better to postpone this feature implementation to the next cycle.
As an alternative we could have worked around this for now, e.g. via querying each range separately:
for (i in 1:length(genes(object))) { gene_type <- igis::variationConsequence( allele = "T", location = transcripts[i, ], igis = connection, species = "human" ) df <- cbind(transcripts[i, ], gene_type$biotype) }
However this will be pretty slow, so does not seem like a good option. Also, the feature request is not urgent at all, so it's fine to have this in an upcoming release.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.