MirnahostgenesDb-usage: Retrieving miRNA information and host gene definitions
In jotsetung/mirhostgenes: mirhostgenes: putting miRNAs into genomic context

Description Usage Arguments Retrieving miRNA information Retrieving host genes and transcripts Note Author(s) See Also Examples

Use and retrieve miRNA host gene definitions stored in a corresponding database. Such database packages can be created using the makeMirhostgenesPackage function (see the corresponding help page for more information). For some basic usage of the database and package see the MirhostDb.

## S4 method for signature 'MirhostDb'
hostgenes(x, columns=listColumns(x, "host_gene"),
                                filter, order.by="gene_id",
                                order.type="asc", return.type="DataFrame")

## S4 method for signature 'MirhostDb'
hostgenesBy(x, by="pre_mirna_algn",
                                  columns=listColumns(x, "host_gene"), filter,
                                  return.type="DataFrame", drop.empty=TRUE,
                                  use.names=FALSE)

## S4 method for signature 'MirhostDb'
hosttx(x, columns=listColumns(x, "host_tx"), filter,
                             order.by="tx_id", order.type="asc",
                             return.type="DataFrame")

## S4 method for signature 'MirhostDb'
hosttxBy(x, by="pre_mirna_algn",
                               columns=listColumns(x, "host_tx"), filter,
                               return.type="DataFrame", drop.empty=TRUE,
                               use.names=FALSE)

## S4 method for signature 'MirhostDb'
matmirnas(x, columns=listColumns(x, "mat_mirna"),
                                filter, order.by="mat_mirna_id",
                                order.type="asc", return.type="DataFrame")

## S4 method for signature 'MirhostDb'
matmirnasBy(x, by="pre_mirna_algn",
                                  columns=listColumns(x, "mat_mirna"), filter,
                                  return.type="DataFrame", use.names=FALSE)

## S4 method for signature 'MirhostDb'
matmirnasInMultiplePremirnas(x,columns=c(listColumns(x, "mat_mirna"),
                                                   "pre_mirna_id", "pre_mirna_name"),
                                                   filter=list(),
                                                   return.type="DataFrame")

## S4 method for signature 'MirhostDb'
premirnas(x, columns=listColumns(x, "pre_mirna"), filter,
                                order.by="pre_mirna_id", order.type="asc",
                                return.type="DataFrame")

## S4 method for signature 'MirhostDb'
premirnasBy(x, by="mat_mirna",
                                  columns=listColumns(x, "pre_mirna"),
                                  filter, return.type="DataFrame",
                                  use.names=FALSE)

## S4 method for signature 'MirhostDb'
premirnasWithMultipleAlignments(x,
                                                      columns=listColumns(x, "pre_mirna"),
                                                      filter=list(),
                                                      return.type="DataFrame")

## S4 method for signature 'MirhostDb'
probesets(x, columns=listColumns(x, "array_feature"), filter,
                                order.by="probeset_id", order.type="asc",
                                return.type="DataFrame")

## S4 method for signature 'MirhostDb'
probesetsBy(x, by="pre_mirna_algn",
                                  columns=listColumns(x, "array_feature"),
                                  filter, return.type="DataFrame",
                                  drop.empty=TRUE,
                                  use.names=FALSE)

(in alphabetical order)

`by`	For `hostgenesBy`, `hosttxBy`, `matmirnasBy` and `premirnasBy`: by what the entries should be grouped (`"pre_mirna_algn"`, `"pre_mirna"`, `"mat_mirna"`, `"mirfam"`, `"host_gene"`, `"host_tx"`, `"database"` or `"probeset"` to group results by the alingmnet ID of the pre-miRNA, by pre-miRNA, mature miRNA, miRNA family, host gene, host transcript, database or microarray probe set associated with the host transcript, respectively). The default for all methods except `premirnasBy` is `pre_mirna_algn`, thus it returns the entries grouped by the unique pre-miRNA alignments. `by="database"` causes the entries being grouped by the database in which the transcript/gene model of the host transcript/gene was defined. To get a list of all databases use the `link{listDatabases}` method.
`columns`	Character vector of columns (attributes) to return from the database. For a complete list of available columns use the methods `listTables` or `listColumns`.
`drop.empty`	For `hostgenesBy` and `hosttxBy`: whether empty list elements should be dropped (default). Otherwise all elements are returned, also empty ones, e.g. representing pre-miRNAs for which no host gene or transcript was defined (if `by="pre_mirna"`).
`filter`	A single filter instance or `list` of filter instances to be used to fetch specific elements from the database. See help for `PositionFilter` or `AnnotationFilter` in package `ensembldb` for information on filter objects and their use.

`order.by`	The column by which the result should be ordered. Can also be a string with multiple columns, separated by a `","`.
`order.type`	Either `"asc"` or `"desc"` depending on whether the results should be returned in ascending or descending order.
`return.type`	Allows to specify the class of the result object. Allowed values are `"data.frame"` or `"DataFrame"` (the default). Additionally, for methods `matmirnas`, `matmirnasBy`, `premirnas` and `premirnasBy`, `return.type="GRanges"` can be specified which returns a `GRanges` object for the mature miRNA or pre-miRNA (i.e. representing its genomic alignment) with all additional annotations added as metadata columns. Note that methods `premirnasBy`, `matmirnasBy`, `hostgenesBy` `hosttxBy` and `probesetsBy` split the resulty by the argument `by`, thus, `return.type` specifies the class of the elements in the `list` (for `return.type="data.frame"`) or `SplitDataFrameList` (for `return.type="DataFrame"`) that is returned.
`use.names`	Uses, if available, the names instead of the IDs to group elements (e.g. the pre-miRNA name instead of the pre-miRNA ID). Note, that the gene name (symbol) might be empty for some genes, thus, all entries for genes without a name would be grouped together.
`x`	The `MirhostDb` instance from which the data should be retrieved.

These methods allow to access all miRNA related information from the database (i.e. get mature miRNAs and pre-miRNAs).

matmirnas

Returns all mature miRNAs from the database along with optional additional columns from other database tables (which can be empty for some mature miRNAs). Note that column "sequence" returns the actual RNA sequence of the mature miRNA, not the genomic DNA defined by the columns "mat_mirna_seq_start" and "mat_mirna_seq_end". Also, be aware that mature miRNAs encoded in several pre-miRNAs or in pre-miRNAs with multiple genomic alignments are listed in multiple rows of the results table (as their start and end coordinates differ). To get only a unique list of miRNAs columns should be set to c("mat_mirna_id", "mat_mirna_name").

The method returns a DataFrame, data.frame or GRanges depending on the value of the return.type argument ("DataFrame", "data.frame" or "GRanges", respectively). Entries in the returned object are ordered according to the parameter order.by, NOT by any ordering of values in eventually submitted filter objects.

matmirnasBy

Returns a CompressedSplitDataFrameList of DataFrames or a list of data.frames with the names of the list being the ids by which the mature miRNAs are grouped (e.g. pre-miRNA ids) and the elements of the list being the host gene entries. Similar to matmirnas, column "sequence" in the result object contains the RNA sequence of the mature miRNA.

The method returns a SplitDataFrameList (list of DataFrames), a list of data.frames or a GRangesList, depending on the value of the parameter return.type ("DataFrame", "data.frame" or "GRanges", respectively). The results are ordered by the value of the by parameter.

matmirnasInMultiplePremirnas

Returns mature miRNAs which are encoded in more than one pre-miRNA. The return object is the same than for matrmirnas.

premirnas

Returns pre-miRNAs defined by the miRBase along with optional additional columns from other database tables (which can be NA for some pre-miRNAs). Note that column "sequence" returns the actual RNA sequence of the pre-miRNA, not the genomic DNA defined by the columns "pre_mirna_seq_start" and "pre_mirna_seq_end". Also, some pre-miRNAs might have multiple genomic alignments and might thus be listed multiple times in the returned object.

The method returns a DataFrame, data.frame or GRanges depending on the value of the return.type argument ("DataFrame", "data.frame" or "GRanges", respectively). Entries in the returned object are ordered according to the parameter order.by, NOT by any ordering of values in eventually submitted filter objects.

premirnasBy

Returns a CompressedSplitDataFrameList of DataFrames or a list of data.frames with the names of the list being the ids by which the pre-miRNAs are grouped (e.g. mature miRNA ids) and the elements of the list being the host gene entries.

The method returns a SplitDataFrameList (list of DataFrames), a list of data.frames or a GRangesList, depending on the value of the parameter return.type ("DataFrame", "data.frame" or "GRanges", respectively). The results are ordered by the value of the by parameter.

premirnasWithMultipleAlignments

Returns pre-miRNAs wich are encoded in several genomic loci. The return object is the same than for premirnas.

These methods allow to retrieve host genes and transcripts as well as microarray features (probe sets) targeting these.

hostgenes

Returns all predicted host genes from the database along with optional additional columns from other database tables. Host genes with gene_biotype equal to "miRNA" should be taken with care, as they represent the actual pre-miRNAs. Ensembl defines genes for some of the pre-miRNAs defined in the miRBase. The column/attribute database specifies in which database the gene is defined ("core", "otherfeatures" and "vega" indicating the Ensembl core database with all known genes, the Ensembl otherfeatures database and the manually curated Ensembl vega database).

The method returns a DataFrame or data.frame depending on the value of the return.type argument ("DataFrame" or "data.frame"). Entries in the returned object are ordered according to the parameter order.by, NOT by any ordering of values in eventually submitted filter objects.

hostgenesBy

Returns a CompressedSplitDataFrameList of DataFrames or a list of data.frames with the names of the list being the ids by which the host genes are grouped (e.g. pre-miRNA ids) and the elements of the list being the host gene entries. Note that by default empty elements are dropped (see parameter drop.empty).

The method returns a SplitDataFrameList (list of DataFrames) or a list of data.frames depending on the value of the parameter return.type ("DataFrame" or "data.frame"). The results are ordered by the value of the by parameter.

hosttx

Returns all predicted host transcripts from the database along with optional additional columns from other database tables. Note that for host transcripts being the host for several pre-miRNAs multiple rows are present in the result table (one for each pre-miRNA). To get a unique list of host transcripts, the columns parameter should be restricted to c("tx_id", "tx_biotype", "gene_id"). The columns in_intron and in_exon specify in which intron or exon of the transcript the pre-miRNA is encoded (0 for not in intron or exon), exon_id indicates the exon id for exonic pre-miRNAs and the column is_outside indicates whether the pre-miRNA is only partially inside the transcript. See the package's vignette for a detailed description.

The method returns a DataFrame or data.frame depending on the value of the return.type argument ("DataFrame" or "data.frame"). Entries in the returned object are ordered according to the parameter order.by, NOT by any ordering of values in eventually submitted filter objects.

hosttxBy

Returns a CompressedSplitDataFrameList of DataFrames or a list of data.frames with the names of the list being the ids by which the host transcripts are grouped (e.g. pre-miRNA ids) and the elements of the list being the host gene entries. Note that by default empty elements are dropped (see parameter drop.empty).

The method returns a SplitDataFrameList (list of DataFrames) or a list of data.frames depending on the value of the parameter return.type ("DataFrame" or "data.frame"). The results are ordered by the value of the by parameter.

probesets

Returns microarray probe sets which where found to target the host transcripts. Note that in the database probe sets for different microarrays can be stored, thus it might be advisable to use a ArrayFilter to restrict to probe sets for one specific microarray (use listArrays to get an overview of all microarrays for which probe sets are available).

The method returns a DataFrame or data.frame depending on the value of the return.type argument ("DataFrame" or "data.frame"). Entries in the returned object are ordered according to the parameter order.by, NOT by any ordering of values in eventually submitted filter objects.

probesetsBy

Returns microarray probe sets grouped by the column specified with the argument by.

The method returns a SplitDataFrameList (list of DataFrames) or a list of data.frames depending on the value of the parameter return.type ("DataFrame" or "data.frame"). The results are ordered by the value of the by parameter.

The default grouping of transcripts or genes for hosttxBy and hostgenesBy is by the pre_mirna_algn (i.e. the alignment ID of the pre-miRNA), since pre-miRNAs might have multiple genomic alignments and the thus returned, grouped, transcripts or genes might be encoded on different chromosomes.

For the matmirnas,premirnas, hostgenes and hosttx methods the internal SQL call uses a left join starting from the respective table (e.g. "mature_mirna" for matmirnas), thus returning all entries from that table, but eventually NAs for columns from other tables if no value from that table is linked to any of the entries in the first table. As a result, a call to premirnas with columns set to "pre_mirna_name" and "tx_id" will return the IDs of all pre-miRNAs and the ID of their respective putative host transcripts, or NA if none was defined. A call to hosttx with the same columns will however return less results from the database, as IDs of pre-miRNAs without a specified host transcripts are not returned (see example below).

In functions matmirnasBy, premirnasBy, hostgenesBy and hosttxBy, the internal left join starts from the database table in which the attribute (column) specified with the by argument is defined. As a consequence, entries for which the column specified by by is empty are NOT returned. To get all entries from the database, the methods matmirnas, premirnas, hostgenes and hosttx can be used instead, adding additional column names to the columns argument.

Johannes Rainer

MirhostDb, listColumns, listTables makeMirhostgenesPackage, PositionFilter

library(MirhostDb.Hsapiens.v75.v20)

## define a "shortcut" to the database
Mhdb <- MirhostDb.Hsapiens.v75.v20

##***************************************
##
##  mature miRNAs
##
##***************************************

## Simply get all mature miRNAs; the result is however not a unique list of miRNAs,
## since miRNAs from pre-miRNAs with multiple genomic alignments are listed in
## mulitple rows.
MatMir <- matmirnas(Mhdb)
MatMir
length(unique(MatMir$mat_mirna_id))

## Get mat_mirna and pre_mirna entries for mature miRNA MIMAT0000062.
MatMir <- matmirnas(Mhdb,
                    columns=unique(c(listColumns(Mhdb, "mat_mirna"),
                        listColumns(Mhdb, "pre_mirna"))),
                    filter=list(MatMirnaIdFilter("MIMAT0000062")))
MatMir
## The same mature miRNA is encoded in 3 different pre-miRNAs.

## Get all mature miRNAs along with their pre-miRNAs in which they are encoded
## and their sequence.
MatMir <- matmirnas(Mhdb, columns=c("mat_mirna_id", "mat_mirna_name",
                              "pre_mirna_name", "seq_name", "sequence"))
MatMir
length(unique(MatMir$mat_mirna_id))
length(unique(MatMir$pre_mirna_name))

## Get all mature miRNAs along with the potential host gene in which they are encoded.
MatMir <- matmirnas(Mhdb, columns=c("mat_mirna_id", "mat_mirna_name",
                              "seq_name", "gene_id", "gene_name", "gene_biotype"))
MatMir
## The mature miRNAs present in host genes.
MatMir.inhg <- MatMir[ !is.na(MatMir$gene_id), ]
MatMir.nohg <- MatMir[ is.na(MatMir$gene_id), ]

MatMir.inhg
## However, a considerable number of "host genes" are actually the pre-miRNAs, which some of them
## are stored in the Ensembl database as "gene" with the biotype "miRNA".
table(MatMir.inhg$gene_biotype)

## Now, get all mature miRNAs for which the gene_biotype!=miRNA.
MatMir <- matmirnas(Mhdb, columns=c("mat_mirna_id", "mat_mirna_name",
                              "seq_name", "gene_id", "gene_name", "gene_biotype"),
                    filter=list(GeneBiotypeFilter("miRNA", condition="!=")))
MatMir
sum(is.na(MatMir$gene_biotype))
table(MatMir$gene_biotype)

## Get all mature miRNAs as GRanges.
matmirnas(Mhdb, return.type="GRanges")

## Get all mature miRNAs that are encoded in more than one pre-miRNA.
matmirnasInMultiplePremirnas(Mhdb)

##***************************
## matmirnasBy
## Get all mature miRNAs grouped by pre-miRNA.
matmirnasBy(Mhdb, by="pre_mirna")

## Get all mature miRNAs groped by mirfam as GRanges.
matmirnasBy(Mhdb, by="mirfam", return.type="GRanges")

## Get mature miRNAs for pre-miRNA miR-16-1 and miR-16-2.
matmirnasBy(Mhdb,
            filter=list(PreMirnaFilter(c("hsa-mir-16-2", "hsa-mir-16-1"))))



##***************************************
##
##  pre-miRNAs
##
##***************************************

## Get all pre-miRNAs.
PreMir <- premirnas(Mhdb)
PreMir
length(unique(PreMir$pre_mirna_name))

## Get all pre-miRNAs as GRanges.
premirnas(Mhdb, return.type="GRanges")

## Get all pre-miRNAs along with their miRNA family and their sequence.
## Since we don't ask for the pre_mirna_seq_start and end we get a
## unique table of pre-miRNAs.
PreMir <- premirnas(Mhdb, columns=c("pre_mirna_name", "mirfam_name",
                              "sequence"))
PreMir

## We have some pre-miRNAs without family
sum(is.na(PreMir$mirfam_name))
## but none without sequence.
sum(is.na(PreMir$sequence))

## Get all pre-miRNAs with multiple genomic alignments.
premirnasWithMultipleAlignments(Mhdb)

##***************************
## premirnasBy
## Get the pre-miRNAs by the mature_mirna.
PB <- premirnasBy(Mhdb, by="mat_mirna")

## Add also additional stuff and fetch all pre-miRNAs for host gene SMC4:
premirnasBy(Mhdb, columns=c("pre_mirna_name", "sequence", "mirfam_name",
                      "mat_mirna_name"), filter=list(GenenameFilter("SMC4")))

## Get all pre-miRNAs by host_gene SMC4.
premirnasBy(Mhdb, by="host_gene", filter=list(GenenameFilter("SMC4")))


## Get all pre-miRNAs by host_gene SMC4 as GRanges.
premirnasBy(Mhdb, by="host_gene", filter=list(GenenameFilter("SMC4")),
            return.type="GRanges")


##***************************************
##
##  host transcripts
##
##***************************************

## Get all host transcripts from the database.
HT <- hosttx(Mhdb)
HT
nrow(HT)
## The same host_tx might be the host for multiple miRNAs, thus we do have non-unique tx_ids.
length(unique(HT$tx_id))

## Get a unique table of host transcripts.
HT <- hosttx(Mhdb, columns=c("tx_id", "tx_biotype", "gene_id"))
HT
nrow(HT)
length(unique(HT$tx_id))

## Get the host transcripts along with the corresponding gene.
HT <- hosttx(Mhdb, columns=c("tx_id", "in_intron", "in_exon", "gene_id",
                       "gene_name", "entrezid", "database"))
HT
## In what databases are these transcripts defined?
table(HT$database)
nrow(HT)

## Note that the information from the various databases is redundant
## (e.g. the same gene can be defined in the Ensembl code database as
## well as in the NCBI RefSeq database which genes are provided through
## the Ensembl otherfeatures database.
## To avoid getting redundant entries it is possible to use a
## DatabaseFilter:
HT <- hosttx(Mhdb, columns=c("tx_id", "in_intron", "in_exon", "gene_id",
                       "gene_name", "entrezid", "database"),
             filter=list(DatabaseFilter("core")))
HT
nrow(HT)



## Include now also the pre_mirna ids.
HT <- hosttx(Mhdb, columns=c("tx_id", "in_intron", "in_exon", "gene_id",
                       "gene_name", "entrezid", "database",
                       "pre_mirna_id", "pre_mirna_name"))
HT
nrow(HT)
## We have now more rows, since different pre-miRNAs might be
## associated with the same host_tx.
length(unique(HT$tx_id))


##***************************
## hosttxBy
## Get the host transcripts by the pre-miRNA
## this will drop automatically empty entries, i.e. pre-miRNAs for which
## no host transcript was defined.
HT <- hosttxBy(Mhdb, by="pre_mirna", columns=c("tx_id", "tx_biotype",
                                         "in_intron", "in_exon",
                                         "pre_mirna_name"))
HT

## To get all of them we scan set drop.empty=FALSE.
HT <- hosttxBy(Mhdb, by="pre_mirna",
               columns=c("tx_id", "tx_biotype", "in_intron", "in_exon",
                   "pre_mirna_name"), drop.empty=FALSE)
HT

## There are however also some without any entries:
empties <- unlist(lapply(HT, function(z){ return(all(is.na(z$tx_id))) }))
sum(empties)
HT[ empties ]

## Host transcripts by gene.
HT <- hosttxBy(Mhdb, by="host_gene")
HT



##***************************************
##
##  host genes
##
##***************************************

## With the host genes it is just the same as above.
HG <- hostgenes(Mhdb)
HG
length(unique(HG$gene_id))
nrow(HG)


##***************************
## hostgenesBy
## Get the host genes by the pre-miRNA.
HG <- hostgenesBy(Mhdb, by="pre_mirna")
HG

## Get host genes by mirfam.
HG <- hostgenesBy(Mhdb, by="mirfam",
                  columns=c("gene_id", "gene_name", "mirfam_name"))
HG



##***************************************
##
##  probe sets
##
##***************************************

## First get a list of microarrays for which probe sets are available.
listArrays(Mhdb)

AF <- ArrayFilter("HG-U133_Plus_2")

## Get all probe sets from the database along with the gene name and
## the pre-miRNA name.
PS <- probesets(Mhdb, columns=c(listColumns(Mhdb, "array_feature" ),
                          "gene_name", "pre_mirna_name"), filter=list(AF))
PS

## Get all probe sets grouped by pre-miRNA name.
PS <- probesetsBy(Mhdb, by="pre_mirna", use.names=TRUE, filter=list(AF))
PS


##***************************************
##
##  The effect of the left join
##
##***************************************
## Get all pre-miRNAs and the ID of the host transcript.
fromPre <- premirnas(Mhdb, columns=c("pre_mirna_name", "tx_id"))
## Get the same columns, but starting from table "host_tx"
fromTx <- hosttx(Mhdb, columns=c("pre_mirna_name", "tx_id"))
## We have less rows for the latter query.
nrow(fromPre)
nrow(fromTx)

## The reason being, that pre-miRNAs without host transcript are not returned
## by the second query, while they are for the first.
sum(is.na(fromPre$tx_id))
sum(is.na(fromTx$tx_id))