regutools: an R package for the extraction of gene regulatory networks from RegulonDB

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
## For links
library("BiocStyle")

## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    AnnotationDbi = citation("AnnotationDbi"),
    AnnotationHub = citation("AnnotationHub"),
    BiocFileCache = citation("BiocFileCache"),
    BiocStyle = citation("BiocStyle"),
    Biostrings = citation("Biostrings"),
    DBI = citation("DBI"),
    GenomicRanges = citation("GenomicRanges"),
    Gviz = citation("Gviz"),
    IRanges = citation("IRanges"),
    knitr = citation("knitr")[3],
    RCy3 = citation("RCy3"),
    RefManageR = citation("RefManageR")[1],
    regutools = citation("regutools")[1],
    regutoolsPaper = citation("regutools")[2],
    rmarkdown = citation("rmarkdown")[1],
    RSQLite = citation("RSQLite"),
    S4Vectors = citation("S4Vectors"),
    sessioninfo = citation("sessioninfo"),
    testthat = citation("testthat")
)

Basics

Install regutools

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. r Biocpkg('regutools') is a R package available via Bioconductor. R can be installed on any operating system from CRAN after which you can install r Biocpkg('regutools') by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("regutools")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Required knowledge

r Biocpkg('regutools') is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with genomic and sequence data. That is, packages like r Biocpkg('Biostrings') that allow you to work with sequences and r Biocpkg('GenomicRanges') for data on genomic coordinates. A r Biocpkg('regutools') user is not expected to deal with those packages directly but will need to be familiar with them to understand the results r Biocpkg('regutools') generates. Furthermore, it'll be useful for the user to know the syntax of r Biocpkg('AnnotationHub') r Citep(bib[['AnnotationHub']]) in order to query and load the data provided by this package.

If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in this blog post.

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help regarding Bioconductor. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

Citing regutools

We hope that r Biocpkg('regutools') will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("regutools")

Overview

Escherichia coli K-12 (E. coli) is the best bacterial organism studied to date. Thousands of papers have been published using E. coli as a model system asking how genes are regulated. The throughput of these experiments range from single-gene studies to whole-genome approaches. Twenty years ago, the database RegulonDB started collecting, curating and organizing this information into a centralized resource. Thanks to this huge efforts, researchers have had an easy way to access all these data in a database, facilitating the advancements of complete fields, such as systems biology.

The analysis of high-throughput experiments -such as RNA-seq or ChIP-seq- often requires the integration of databases such as RegulonDB in order to give biological interpretations to these data. The regutools package is designed to facilitate such integration by providing programmatic access to RegulonDB within the R environment r Citep(bib[['regutoolsPaper']]). The package retrieves information from the RegulonDB database into Bioconductor objects, ready for downstream analyses.

The package defines the object regulondb, which is a data structure that contains the path to a SQLite database retrieved from RegulonDB along with metadata such as database version and reference genome. The function connect_database() will retrieve the latest version of the database and connect to it. It will download the database using r Biocpkg('AnnotationHub') or a backup mechanism if necessary. The regutools package contains functions with the most popular queries to regutools, such as retrieving information of which gene targets are regulated by a transcription factor. But users can also design queries that are specific to their analyses. This vignette describes how to use the provided functions and how to design programmatic queries to regutools. The general syntax of the function calls of this package is result <- functionCalled( regulondb, arguments ).

The regulondb object

The regulondb object is an extension of an SQLiteConnection class that host a connection to a database with the table structure defined in the RegulonDB database. It contains additional slots that specify the organism, genome version and database version. The function regulondb() is the constructor function of regulondb objects. This function receives as input a file path to the database file as well as information about the annotation as character vectors.

library("regutools")

## Other packages used
library("Biostrings")

## Connect to the RegulonDB database
regulondb_conn <- connect_database()

## Build a regulondb object
e_coli_regulondb <-
    regulondb(
        database_conn = regulondb_conn,
        organism = "E.coli",
        database_version = "1",
        genome_version = "1"
    )

e_coli_regulondb

In order to get an overview of the tables present in a regulondb object, we can use the function list_datasets(). This function will output all the available tables (datasets) that can be used to build queries.

list_datasets(e_coli_regulondb)

For each table in the database, users can explore the fields (or attributes) of each table using the function list_attributes.

head(list_attributes(e_coli_regulondb, "GENE"), 8)

Retrieving data

Since the regulondb object is an extension of the SQLiteConnection, users can retrieve data from a regulondb object using the function dbGetQuery(). Additionally, this package provides a wrapper function to build queries to the database. This function is called get_dataset() and has a very similar syntax to the getBM() function from the biomaRt package. The main arguments of the get_dataset() function are a regulondb object, a dataset (or table) of the database, the fields of the dataset to retrieve (attributes) and filters to specify what information to get. The code below shows an example where three attributes from the dataset "GENE" for the genes araC, crp and lacI. Note that if the filters= parameter is empty, the function will retrieve all the hits it find in the database.

get_dataset(
    regulondb = e_coli_regulondb,
    dataset = "GENE",
    attributes = c("posleft", "posright", "strand", "name"),
    filters = list("name" = c("araC", "crp", "lacI"))
)

Some of the filters, such as posright or posleft, can be filtered by specifying intervals. For example, the code below indicates that all the start positions *posright" should be between position 1 and position 5000 of the genome. The parameter inverval= is used to specify that the filter for that field will be defined by an interval rather than a exact match.

get_dataset(
    e_coli_regulondb,
    attributes = c("posright", "name"),
    filters = list("posright" = c(1, 5000)),
    interval = "posright",
    dataset = "GENE"
)

The regulondb_result object and integration into the BioC ecosystem

By default, the function get_dataset() outputs a regulondb_result object, which is an extension of a DataFrame that stores information about the query used to generate this object. This additional information includes the organism name, the database and genome versions, and the table (or dataset) of the regulondb object that was queried by get_dataset().

res <- get_dataset(
    regulondb = e_coli_regulondb,
    dataset = "GENE",
    attributes = c("posleft", "posright", "strand", "name"),
    filters = list("name" = c("araC", "crp", "lacI"))
)
slotNames(res)

To enable integration with other Bioconductor packages, we provide the function convert_to_granges() which converts a regulondb_result object into a GRanges object whenever possible. For example, the result stored in the variable res has genomic coordinates and it is thus possible convert res into a GRanges object.

convert_to_granges(res)

An alternative way to get to the same result is to use the parameter output_format= directly in the function get_dataset().

get_dataset(
    regulondb = e_coli_regulondb,
    dataset = "GENE",
    attributes = c("posleft", "posright", "strand", "name"),
    filters = list("name" = c("araC", "crp", "lacI")),
    output_format = "GRanges"
)

In a similar manner, the function convert_to_biostrings() converts regulondb objects into objects from the r BiocStyle::Biocpkg("Biostrings") package. Possible outputs of convert_to_biostrings() are DNAStringSet objects if seq_type="DNA" or a BStringSet if seq_type="product".

res_dnastring <- get_dataset(
    regulondb = e_coli_regulondb,
    dataset = "GENE",
    attributes = c("posleft", "posright", "strand", "name", "dna_sequence"),
    filters = list("name" = c("araC", "crp", "lacI"))
)
res_dnastring <-
    convert_to_biostrings(res_dnastring, seq_type = "DNA")
res_dnastring
GenomicRanges::mcols(res_dnastring)
res_prodstring <- get_dataset(
    regulondb = e_coli_regulondb,
    dataset = "GENE",
    attributes = c("posleft", "posright", "strand", "name", "product_sequence"),
    filters = list("name" = c("araC", "crp", "lacI"))
)
res_prodstring <-
    convert_to_biostrings(res_prodstring, seq_type = "product")
mcols(res_prodstring)

As with the GRanges output mentioned above, it is possible for the output of get_dataset() to be a DNAStringSet object by specifying the parameter output_format="DNAStringSet" or a BStringSet object by specifying output_format="BStringSet". Note that the functions to convert regulondb_result objects will throw errors if there is insufficient information for the coercion to occur. For example, we will get an error if we try to convert into a GRanges object when genomic coordinates are missing from the regulondb_result object.

Building your own queries

In the regutools package, we have implemented features that are commonly used when querying data from databases: filtering results by partial matching, filtering by numeric intervals, and building complex queries.

Partial matching

The code below illustrates the concept of partial matching, in which by setting the parameter partialmatch= to "name", the query returns all the gene name in which the word ara is contained.

get_dataset(
    e_coli_regulondb,
    attributes = c("posright", "name"),
    filters = list("name" = "ara"),
    partialmatch = "name",
    dataset = "GENE"
)

Note that setting the parameter partialmatch= to NULL will only return genes where the name string is identical to ara.

Filtering by numeric intervals

In addition to partial matching, queries can be filtered by numeric intervals. For example, in the code below, the parameter interv= is set to "posright". By doing this assignment, we are specifying that the values for "posright" must lie between the values of posright specified in the filter= parameter. Thus, the result of this query will be genes whose right positions lie between the coordinates 2000 and 4000000. Note that the use of the interv= parameter in the code below is equivalent to setting the parameter output_format= to "GRanges" and further subsetting the GRanges object using the function subsetByOverlaps().

get_dataset(
    e_coli_regulondb,
    attributes = c("name", "strand", "posright", "product_name"),
    dataset = "GENE",
    filters = list(posright = c("2000", "4000000")),
    interval = "posright"
)

Retrieving genomic elements

Based on genomic coordinates, the code below retrieves all genomic elements whose positions lie between the coordinates provided as a GRanges object. If no aditional parameters are provided, the result will retrieve genes that relies within the first 5000pb from the E. coli genome.

get_dna_objects(e_coli_regulondb)

Especific genomic positions can be provided within the parameter grange. It is important to provide a genomic range that covers as minimal the length of one genomic element.

grange <- GenomicRanges::GRanges(
    "chr",
    IRanges::IRanges(5000, 10000)
)
get_dna_objects(e_coli_regulondb, grange)

Aditional genomic elements such as "-35 promoter box", "gene", "promoter", "Regulatory Interaction", "sRNA interaction", or "terminator" can be selected.

grange <- GenomicRanges::GRanges(
    "chr",
    IRanges::IRanges(5000, 10000)
)
get_dna_objects(e_coli_regulondb, grange, elements = c("gene", "promoter"))

Evenmore, the genomic elements retrieved can be observed in a Genome Browser-like plot. The genomic elements are annotated using a UCSC genome as reference, it is important to provide a valid chromosome name for annotation purpose.

e_coli_regulondb <-
    regulondb(
        database_conn = regulondb_conn,
        organism = "chr",
        database_version = "1",
        genome_version = "1"
    )

grange <- GenomicRanges::GRanges("chr", IRanges::IRanges(5000, 10000))
plot_dna_objects(e_coli_regulondb, grange, elements = c("gene", "promoter"))

Complex filters

The examples so far have considered queries in which the results are filtered according to single fields from tables. In order to build queries with filters from more than one field, several filters names can be passed as a list to the filters= argument. Additionally the and= argument is used to specify whether the filtering conditions of the result of the query must be satisfied (and=TRUE) or if satisfying a single condition is enough (and=FALSE).

For example, the code below extracts the genes where either name or product_name contain the word Ara or Ara, respectively, if the gene is in the forward strand or if the right position of the gene is between 2000 and 40000000.

nrow(
    get_dataset(
        e_coli_regulondb,
        attributes = c("name", "strand", "posright", "product_name"),
        dataset = "GENE",
        filters = list(
            name = c("ARA"),
            product_name = c("Ara"),
            strand = c("forward"),
            posright = c("2000", "4000000")
        ),
        and = FALSE,
        partialmatch = c("name", "product_name"),
        interval = "posright"
    )
)

The query below, which is identical to the query above except the and= is set to TRUE, returns the genes where all of the specified conditions are satisfied.

nrow(
    get_dataset(
        e_coli_regulondb,
        attributes = c("name", "strand", "posright", "product_name"),
        dataset = "GENE",
        filters = list(
            name = c("ARA"),
            product_name = c("Ara"),
            strand = c("forward"),
            posright = c("2000", "4000000")
        ),
        and = TRUE,
        partialmatch = c("name", "product_name"),
        interval = "posright"
    )
)

Functions with implement popular queries

The regutools package provides functions that encode the most frequent queries that are submitted to the RegulonDB web resource.

Extracting regulatory networks

One of the most important information that the RegulonDB database contains are manually curated regulatory networks. The regutools package exports the function get_gene_regulators(), which inputs a vector of gene names and outputs information about what are the transcription factors that regulate such genes together with the regulatory effect (activator, repressor or dual).

get_gene_regulators(e_coli_regulondb, c("araC", "fis", "crp"))

Similarly, the get_regulatory_network() function retrieves all the regulatory network from a regulondb object. By default, the output will be a list of transcription factor-gene pairs, indicating which transcription factor regulates which gene.

head(get_regulatory_network(e_coli_regulondb))

But users can also set the parameter type= to "GENE-GENE" or "TF-GENE" to retrieve gene-gene regulatory networks or transcription factor-transcription factor regulatory networks, respectively.

Users can also use the get_regulatory_summary() to retrieve a summary of the transcription factor regulated the expression of a set of given genes. The parameter gene_regulators can receive either a vector of gene names or the output of a call to the function get_gene_regulators(). The resulting output is a regulondb_result object in which each row shows a transcription factor, and the columns display information about the number, percentage and regulatory activity exerted to the genes.

get_regulatory_summary(e_coli_regulondb,
    gene_regulators = c("araC", "modB")
)

Visualizing networks using cytoscape

Software tools such as Cytoscape are often useful for interactive exploration of data and visualization of networks. The function get_regulatory_network() has a parameter cytograph= that if set to TRUE, it will visualize the network in a cytoscape session. Of note, this feature will only work if the user has Cytoscape open in their computer.

get_regulatory_network(e_coli_regulondb, cytograph = TRUE)

Transcription factor binding sites

Another common request to the RegulonDB database is to obtain the genomic coordinates and sequences of the DNA binding sites for a given transcription factor. This query is implemented in the function get_binding_sites(), in which the results are formatted according to the parameter output_format= as either a GRanges object or a Biostrings object.

get_binding_sites(e_coli_regulondb, transcription_factor = "AraC")
get_binding_sites(e_coli_regulondb,
    transcription_factor = "AraC",
    output_format = "Biostrings"
)

A note about CDSB

This was a project accomplished by members of the Community of Bioinformatics Software Developers (CDSB in Spanish). In part CDSB was formed to help R users in Latin America become R/Bioconductor developers. For more information about CDSB, the CDSB workshops or its online community, please check the CDSB website which is available in both Spanish and English.

Reproducibility

The r Biocpkg('regutools') package r Citep(bib[['regutools']]) was made possible thanks to:

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("regutools.Rmd"))

## Extract the R code
library("knitr")
knit("regutools.Rmd", tangle = TRUE)

Date the vignette was generated.

## Date the vignette was generated
Sys.time()

Wallclock time spent generating the vignette.

## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

Bibliography

This vignette was generated using r Biocpkg('BiocStyle') r Citep(bib[['BiocStyle']]), r CRANpkg('knitr') r Citep(bib[['knitr']]) and r CRANpkg('rmarkdown') r Citep(bib[['rmarkdown']]) running behind the scenes.

Citations made with r CRANpkg('RefManageR') r Citep(bib[['RefManageR']]).

## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))


Try the regutools package in your browser

Any scripts or data that you put into this service are public.

regutools documentation built on Dec. 20, 2020, 2 a.m.