README.md

Table of Contents

miRTarBase R package

This packages provides the miRNA-target gene interactions (MTI) from the miRTarBase.

Installation

Clone the git repository and build the package using:

R CMD build mirtarbase-db

To install the package call:

R CMD INSTALL mirtarbase.db_*

After that you can use the package as any other R package. The package provides also a vignette with some examples that can be opened, after loading the library in R (library(mirtarbase.db)), by the openVignette() command.

Create the database file

Using the code in this section we create the SQLite database for the package. The workflow is the following: first we have to download the XLS spreadsheet from the miRTarBase web site, which we have to open in Excel and export as tabulator delimited text file. We will then read this file into R in order format it for our requirements (i.e. rename the column names) and we store this file than into a SQLite database.

Assuming that the Excel file has been downloaded from http://mirtarbase.mbc.nctu.edu.tw/php/download.php and has been exported as a tabulator delimited txt file to orig_files/txt. Next we read this file, rename the columns and export it again as a tab delimited text file (over-writing the original one).

It is a little more complicated to process this file due to the following problems:

All in all, we hope that we don't encounter some more issues in the next release.

library( RSQLite )
## the XLS file contains ' and " inside cells, thus have to set qupte=""
MTI <- read.table( "orig_files/txt/miRTarBase_MTI.txt", sep="\t", as.is=TRUE, header=TRUE, check.names=FALSE, quote="" )
## re-formating column names.
CN <- tolower( colnames( MTI ) )
CN <- gsub( CN, pattern="\"", replacement="" )
CN <- gsub( CN, pattern="[(|)]", replacement="" )
CN <- gsub( CN, pattern=" ", replacement="_", fixed=TRUE )
colnames( MTI ) <- CN
## now I want to get rid of ALL \", ", ' in the table!
character.CN <- CN[ !CN %in% c( "target_gene_entrez_gene_id", "references_pmid" ) ]
## loop through all these columns and replace "\""
for( current.CN in character.CN ){
    MTI[ , current.CN ] <- gsub( MTI[ , current.CN ], pattern="\"", replacement="" )
}
## now, replace all remaining " and ' in experiments
MTI[ , "experiments" ] <- gsub( MTI[ , "experiments" ], pattern="'", replacement="", fixed=TRUE )
## fixing the WeaK thing.
MTI[ , "support_type" ] <- gsub( MTI[ , "support_type" ], pattern="WeaK", replacement="Weak", fixed=TRUE )
## fix NA entrezids
which.NAs <- which( is.na( MTI$target_gene_entrez_gene_id ) )
## for all of these, check if there is another gene, same species.
for( idx in which.NAs ){
    tmp <- MTI[ MTI$target_gene==MTI[ idx, "target_gene" ] &
                   MTI$species_target_gene==MTI[ idx, "species_target_gene" ], , drop=FALSE ]
    if( length( tmp[ !is.na( tmp$target_gene_entrez_gene_id ), "target_gene_entrez_gene_id" ] ) > 0 ){
        MTI[ idx, "target_gene_entrez_gene_id" ] <-
            unique( tmp[ !is.na( tmp$target_gene_entrez_gene_id ), "target_gene_entrez_gene_id" ] )
    }
}
## exporting the file again.
## write.table( MTI, file="orig_files/txt/miRTarBase_MTI.txt", sep="\t", row.names=FALSE )

We could also create the database in a cleaner way, i.e. as a real relational database with a table for the miRNA target gene interaction, a table with the evidence (Report) and one relating each other. The question remains whether that would improve it's performance. Note that each publication can have more than one report.

## want to know whether the "report" is unique
## (i.e. pubmed id, experiments and support type) across all miRNAs are specific for a miRNA.
Report <- data.frame( matrix( ncol=4, nrow=0 ), stringsAsFactors=FALSE )
colnames( Report ) <- c( "report_id", "experiments", "support_type", "references_pmid" )
mti2report <- matrix( ncol=2, nrow=0 )
colnames( mti2report ) <- c( "mti_id", "report_id" )
for( i in 1:nrow( MTI ) ){
    current.rep <- MTI[ i, c( "experiments", "support_type", "references_pmid" ) ]
    ## check if it's in Report
    idx <- which( rownames( Report )==paste0( current.rep[ 1, ], collapse="-" ) )
    if( length( idx ) > 0 ){
        ## have it already
        #cat( "have already the report\n" )
        report.id <- Report[ idx, "report_id" ]
    }else{
        rownames( current.rep ) <- paste0( current.rep[ 1, ], collapse="-" )
        report.id <- paste0( "MTIR", sprintf( "%04d", ( nrow( Report )+1 ) ) )
        current.rep <- cbind( report_id=report.id, current.rep, stringsAsFactors=FALSE )
        Report <- rbind( Report, current.rep )
    }
    ## add it to mti2report.
    mti2report <- rbind( mti2report, c( MTI[ i, 1 ], report.id ) )
}
## at last define the MTI table.
MTI.table <- unique( MTI[ , !( colnames( MTI ) %in% colnames( Report ) ) ] )

## ok, now we do have all tables.
con <- dbConnect( dbDriver( "SQLite" ), dbname="inst/extdata/db/mirtarbase_rel.db" )
if( dbExistsTable( con, "mti" ) ){
    dbRemoveTable( con, "mti" )
}
if( dbExistsTable( con, "report" ) ){
    dbRemoveTable( con, "report" )
}
if( dbExistsTable( con, "mti2report" ) ){
    dbRemoveTable( con, "mti2report" )
}
## mti
dbWriteTable( con, name="mti", MTI.table, row.names=FALSE )
dbGetQuery( con, "create index mirna_idx on mti (mirna);" )
dbGetQuery( con, "create index target_gene_idx on mti (target_gene);" )
dbGetQuery( con, "create index target_gene_entrez_idx on mti (target_gene_entrez_gene_id);" )
## report
dbWriteTable( con, name="report", Report, row.names=FALSE )
dbGetQuery( con, "create index report_id_idx on report (report_id);" )
## mti2report
dbWriteTable( con, name="mti2report", data.frame( mti2report, stringsAsFactors=FALSE ), row.names=FALSE )
dbGetQuery( con, "create index mti2report_report_idx on mti2report (report_id);" )
dbGetQuery( con, "create index mti2report_mti_idx on mti2report (mti_id);" )

dbDisconnect( con )

Next we create the database for the package. This requires a working installation of SQLite.

con <- dbConnect( dbDriver( "SQLite" ), dbname="inst/extdata/db/mirtarbase.db" )
if( dbExistsTable( con, "mirtarbase" ) ){
    dbRemoveTable( con, "mirtarbase" )
}
dbWriteTable( con, name="mirtarbase", MTI, row.names=FALSE )
dbGetQuery( con, "create index mirna_idx on mirtarbase (mirna);" )
dbGetQuery( con, "create index target_gene_idx on mirtarbase (target_gene);" )
dbGetQuery( con, "create index target_gene_entrez_idx on mirtarbase (target_gene_entrez_gene_id);" )
dbDisconnect( con )

Development

At present (<2014-07-28>), the miRTarBase can only be exported as a XLS spread sheet. This file contains one line per MTI and publication: - MTI MIRT000140: online: 2 references, XLS sheet: one (one reference missing in XLS sheet). - MTI MIRT001206: online: 8 references, XLS sheet: 3 rows (several rows missing). - MTI MIRT003413: online: 4 references, XLS sheet: 2 rows (2 rows missing): - PMID: 18328430: experiments: Luciferase reporter assay//Microarray//Western blot; evidence: Functional MTI. Online: also Other listed as experiment. - PMID: 19422085: experiments: Luciferase reporter assay//Microarray//qRT-PCR//Western blot; evidence: Functional MTI. Online: also Other listed as experiment.

Thus, an MTI between the same miRNA and target gene (for the same species!) is stored in several rows in the XLS sheet. Each row seems however be related to one publication, and the field experiments seems to list all experiments performed in that publication.

One question is whether the XLS sheet should be stored as-is to the database, or whether a clean relational database should be created. The benefit from the latter approach would be to allow more specific queries, e.g. all MTIs based on a certain support type.

It would be possible to create a relational database with 3 tables, one describing the MTI, one listing the experiments performed in a publication to validate this MTI and one allowing for possible n:m relations (although it's not clear whether these really exist… yet).

The MTI class

The MTI class represents a miRNA target gene interaction. There should only be one MTI for a miRNA target gene combination, with multiple evidences and eventually multiple publications. The unique identifier for a MTI is the identifier used in the miRTarBase (e.g. MIRT001206).

class MTI L_ class Report L_ experiments: lists all experiments that have been performed. L_ pmid: returns the (PMID) of the report. L_ supportedBy: list support type(s) (evidences). L_ reports: returns the Report(s) of the MTI. L_ experiments: returns all experiments (of the Report(s)). L_ pmids: returns the PMIDs of the Report(s). L_ supportedBy

Changelog:

TODOs

DONE Create the sqlite database file from the tab delimited txt file.

DONE Create a class representing an interaction (?).

What would be the benefits from this?

Did that. That's the MTI object.

DONE Create some helper functions to retrieve information from the DB.

DONE Create a txt file with all informations (version, date etc) from the miRTarBase and read it when calling mirtarbase()

DONE Implement the function to get MTIs for a specified gene.

DONE Create a vignette for this package.

CANCELED Include also the phenomiR database?

This makes only sense, if the phenomiR does provide additional information. Will not do that, since the guys from miRTarBase claim that they provide the most data and include also other databases.

DONE implement a central getMti function.

This function should take either one or more gene or miRNA IDs as input and return a list of MTI objects. Split the data.frame by miRTarBase IDs, make one MTI for each data.frame and nrow Report classes.

DONE Implement functions to create MTI objects from a data.frame (and vice versa). [4/4]

These should be put into convertfunctions.R.

DONE Implement all methods for the Report class [4/4].

DONE Implement all methods for the MTI class [12/12].

DONE Implement a function that retrieves additional miRNA annotations for a MTI.

The idea is to get the pre-miRNA or miRNA family for a given mature miRNA. This should then also be exported if MTIs are exported as a data.frame.

DONE Implent additional functions [2/2]



jotsetung/mirtarbase-db documentation built on May 19, 2019, 9:42 p.m.