In pkrog/biodb: biodb, a library and a development framework for connecting to chemical and biological databases

source(system.file('vignettes_inc.R', package='biodb'))

Introduction

The contents of the database entries, once parsed, are stored by biodb into objects of the class BiodbEntry.

The BiodbEntry class is an RC (aka R5) class (not S3 or S4). RC instances are never copied implicitly by R. This means that each instance is shared by all parts of your code. If one part of your code modifies or deletes a BiodbEntry object, any other part of your code will be affected by this modification. See Reference classes and the vignette

make_vignette_ref('details')

, for more explanations.

biodb uses identifiers (IDs) to retrieve and manipulate BiodbEntry instances indirectly. Those identifiers are, in case of web server databases, the official accession numbers provided by these databases.

We will see in this vignette how to retrieve entries using a connector, manipulate fields of an entry, free entry instances from memory and delete their content from disk cache, search for entries in a database, convert entries into data frames or JSON, copying all entries of a database into a new empty database, and merge the entries of several databases into a single database.

To start we need to instantiate the package main class:

mybiodb <- biodb::BiodbMain$new()

For the demonstration of this vignette, we will use an extract of the ChEBI [@hastings2012_chebi] database, that we have put inside a TSV file.

Here is the TSV file:

chebi.tsv <- system.file("extdata", "chebi_extract.tsv", package='biodb')

And now we create the connector to this CSV File database:

chebi <- mybiodb$getFactory()$createConn('comp.csv.file', url=chebi.tsv)

Getting entries

To retrieve entries, we first need to get their identifiers. We can either ask the connector to give us the full list of all entry identifiers:

chebi$getEntryIds()

or get the first n entry IDs:

chebi$getEntryIds(max.results=3)

Another way of getting entry IDs, is to search the database using a filter. Here we search for entries by name:

chebi$searchForEntries(list(name='deoxyguanosine'))

Now we search by mass:

chebi$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))

And finally by both name and mass:

chebi$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1)))

Now that we have identifiers, we can get entry objects. First we choose two identifiers:

ids <- chebi$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)), max.results=2)

Then we get the corresponding list of entry instances:

chebi$getEntry(ids)

Entry fields

The content of an entry is stored inside its fields. To access the values contained in the fields or information about the fields, you need to call methods onto the entry object.

First, we get an entry object:

e <- chebi$getEntry(ids[[1]])

To get a list of all fields having a value inside an entry object, call:

e$getFieldNames()

To get the value of a field, call:

e$getFieldValue('name')

To get all the mass fields, run:

e$getFieldsByType('mass')

If you want more information about a field, you have to access the entry fields instance:

mybiodb$getEntryFields()$get('monoisotopic.mass')

Conversion

Entries may be converted into lists of values, data frames, and JSON.

To convert a single entry into a data frame, run (result in \@ref(tab:entryToDf)):

x <- e$getFieldsAsDataframe()

knitr::kable(head(x), "pipe", caption="Converting an entry to a data frame")

Several options are available to control which fields are output. For instance, you can select the set of fields by their name (result in \@ref(tab:filterByName)):

x <- e$getFieldsAsDataframe(fields=c('name', 'monoisotopic.mass'))

knitr::kable(head(x), "pipe", caption="Selecting fields by names")

or by their type (result in \@ref(tab:filterByType)):

x <- e$getFieldsAsDataframe(fields.type='mass')

knitr::kable(head(x), "pipe", caption="Selecting fields by type")

In case of entries with fields that contain multiple values, other options are useful. This is the case for mass spectrum entries. If we get an entry from an extract of Massbank [@horai2010_massbank]:

massSqliteFile <- system.file("extdata", "generated", "massbank_extract_full.sqlite", package='biodb')
massbank <- mybiodb$getFactory()$createConn('mass.sqlite', url=massSqliteFile)
massbankEntry <- massbank$getEntry('KNA00776')

we can select the fields of cardinality one only (result in \@ref(tab:filterCardOne)):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=TRUE)

knitr::kable(head(x), "pipe", caption="Selecting fields with only one value")

or get all the fields, in which case fields with more than one value will have their values concatenated into a string using a default separator (result in \@ref(tab:concatenated)):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=FALSE)

knitr::kable(head(x), "pipe", caption="Concatenate multiple values")

It is also possible to get one value per line for fields with cardinality greater than one (result in \@ref(tab:onePerLine)):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=FALSE, flatten=FALSE)

knitr::kable(head(x), "pipe", caption="Output one value per row")

And we can limit the number of values for each field (result in \@ref(tab:oneValuePerField)):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=FALSE, limit=1)

knitr::kable(head(x), "pipe", caption="Output only one value for each field")

A list of several entries can also be convert into a data frame (result in \@ref(tab:entriesToDf)):

entries <- chebi$getEntry(chebi$getEntryIds(max.results=3))
x <- mybiodb$entriesToDataframe(entries)

knitr::kable(head(x), "pipe", caption="Converting a list of entries into a data frame")

or to JSON

mybiodb$entriesToJson(entries)

Memory usage

Each time you call the getEntry() method, biodb checks first if the entries you requested are already in memory. If this is the case, it returns them, otherwise it looks into the cache on disk for downloaded contents. If the entry contents have never been downloaded, the connector contacts the database to get the missing contents and save them into the cache. From the contents, biodb create the corresponding BiodbEntry objects.

You may want either to free memory usage by removing entry objects in memory, or delete entry contents from cache in order to download more recent versions of entries. To remove entries from memory, run:

chebi$deleteAllEntriesFromVolatileCache()

To remove entry content files in cache folder, run:

chebi$deleteAllEntriesFromPersistentCache()

To remove all cache files attached to a connector, run:

chebi$deleteWholePersistentCache()

This will also delete the caching of all HTTP requests and all downloads, including the possible download of the database, thus forcing to download again data from the database.

Copy

Entry objects from any connector can be copied into a writable connector.

If we create a new connector to a SQLite file that does not exist:

sqliteOutputFile <- tempfile(pattern="biodb_copy_entries_new_db", fileext='.sqlite')
newDbConn <- mybiodb$getFactory()$createConn('comp.sqlite', url=sqliteOutputFile)

And allow modifications for this connector:

newDbConn$allowEditing()
newDbConn$allowWriting()

We can copy all entries from another connector into it:

mybiodb$copyDb(chebi, newDbConn)

And finally write the entries into the SQLite file:

newDbConn$write()

Merging databases

In this vignette we will merge entries from three different databases into a single database.

For the demonstration we will use the ChEBI connector already created, and create two other connectors.

A connector to the Uniprot [@uniprotConsortium2016UniProtKB] database:

uniprot.tsv <- system.file("extdata", "uniprot_extract.tsv", package='biodb')
uniprot <- mybiodb$getFactory()$createConn('comp.csv.file', url=uniprot.tsv)

A connector to the ExPASy enzyme [@bairoch2000_expasy] database:

expasy.tsv <- system.file("extdata", "expasy_enzyme_extract.tsv", package='biodb')
expasy <- mybiodb$getFactory()$createConn('comp.csv.file', url=expasy.tsv)

Merging the entries

We will now merge the entries into a single database. However we will use differently the entries of the three databases. The ChEBI and Uniprot will just be put together since they have no link between them. But we will use the ExPASy entries to add missing fields to the uniprot entries. We will be able to do that because the uniprot entries have a field 'expasy.enzyme.id' that we can use to make the link with the ExPASy entries.

We will write a function that takes a Uniprot entry and search for the ExPASy entry referenced and take missing fields from it:

completeUniprotEntry <- function(e) {
    expasy.id <- e$getFieldValue('expasy.enzyme.id');
    if ( ! is.na(expasy.id)) {
        ex <- expasy$getEntry(expasy.id)
        if ( ! is.null(ex)) {
            for (field in c('catalytic.activity', 'cofactor')) {
                v <- ex$getFieldValue(field)
                if ( ! is.na(v) && length(v) > 0)
                    e$setFieldValue(field, v)
            }
        }
    }
}

Remember that we use RC (Reference Classes, or R5) OOP model in biodb. This means that we use references to objects. Thus we can modify an instance at any place inside the code.

Now we will get all entries from Uniprot and run the function to complete all entries:

uniprot.entries <- uniprot$getEntry(uniprot$getEntryIds())
invisible(lapply(uniprot.entries, completeUniprotEntry))

Finally we get all entries from our ChEBI extract, merge all our entries into a single data frame and save it in a file (see content in \@ref(tab:mergedData)):

chebi.entries <- chebi$getEntry(chebi$getEntryIds())
all.entries.df <- mybiodb$entriesToDataframe(c(chebi.entries, uniprot.entries))
output.file <- tempfile(pattern="biodb_merged_entries", fileext='.tsv')
write.table(all.entries.df, file=output.file, sep="\t", row.names=FALSE)

knitr::kable(head(all.entries.df), "pipe", caption="Merged data")

Use a writable database

Instead of building the data frame, we could have used a writable database as seen earlier. Here is a new file database for which we enable edition (for inserting new entries) and writing (for saving it onto disk):

newDbOutputFile <- tempfile(pattern="biodb_merged_entries_new_db", fileext='.tsv')
newDbConn <- mybiodb$getFactory()$createConn('comp.csv.file', url=newDbOutputFile)
newDbConn$allowEditing()
newDbConn$allowWriting()

Now we copy entries into this new database:

mybiodb$copyDb(chebi, newDbConn)
mybiodb$copyDb(uniprot, newDbConn)

And finally we write the database:

newDbConn$write()

Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:

mybiodb$terminate()

Session information

sessionInfo()

References

pkrog/biodb documentation built on Nov. 29, 2022, 4:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pkrog/biodb
biodb, a library and a development framework for connecting to chemical and biological databases

In pkrog/biodb: biodb, a library and a development framework for connecting to chemical and biological databases

Introduction

Getting entries

Entry fields

Conversion

Memory usage

Copy

Merging databases

Merging the entries

Use a writable database

Closing biodb instance

Session information

References

R Package Documentation

Browse R Packages

We want your feedback!

pkrog/biodb biodb, a library and a development framework for connecting to chemical and biological databases

In pkrog/biodb: biodb, a library and a development framework for connecting to chemical and biological databases

Introduction

Getting entries

Entry fields

Conversion

Memory usage

Copy

Merging databases

Merging the entries

Use a writable database

Closing biodb instance

Session information

References

R Package Documentation

Browse R Packages

We want your feedback!

pkrog/biodb
biodb, a library and a development framework for connecting to chemical and biological databases