In ropensci/restez: Create and Query a Local Copy of 'GenBank' in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

1. Download GenBank

In the first section of this introduction to restez we will explain how to download parts of GenBank on to your computer and from these downloads, construct a local database. GenBank can be downloaded from NCBI using the File Transfer Protocol (FTP), the online portal for these downloads can be viewed from your web browser, FTP NCBI GenBank. GenBank is hosted at this portal as 1000s of compressed sequence files called flatfiles. New versions of these flatfiles are released once a month as new sequence information is uploaded. For information on the latest GenBank release, see the NCBI statistics page.

restez downloads a selection of these flatfiles and unpacks them into an SQL-like database (specifically, DuckDB) that can be queried using restez's functions. Because there are potentially huge amounts of sequence information, all of which is unlikely to be of interest to a user, restez allows you to select the sets of flatfiles to download based on GenBank's domains. For the most part these sets of flatfile types are determined by taxonomy, for example, a user can download all primate or plant sequences.

Here, we will explain how to set up restez, download genbank and then create a database from the downloaded files.

1.1 Setting up

Before we can download anything, we need to tell restez where we would like to store the downloaded files and the eventual GenBank database. To do this we use the restez_path_set() function. With this function a user can provide a system file path to a folder within which the function will create a new folder called 'restez'. In this example, we will create a restez folder in a temporary folder.

rstz_pth <- tempdir()
restez::restez_path_set(filepath = rstz_pth)
restez::db_delete(everything = TRUE)

# set a random number generator seed for reproducibility
set.seed(12345)

library(restez)
# rstz_pth <- tempdir()
restez_path_set(filepath = rstz_pth)

A user must always set the restez path in order to use the restez library. Additionally, by running restez_path_set() on a new folder messages are printed to console telling us that the path was created and that a downloads subfolder was created.

Note, whenever you intend to use restez you will need to specify the restez path. If you are using it regularly, it might be a good idea to set it using your .Rprofile. restez_path_set() only needs to be run once per R session, but we will use it repeatedly in the chunks below just as a reminder.

1.2 Download

Now we can download GenBank files using db_download(). This is an interactive function that looks up the latest GenBank release, parses the release notes, and prints to console the available sets of flatfiles that can be downloaded. For this example, we will select the smallest available domain which is 'Unannotated' and can be pre-specified with '20'. Note that the numbering scheme may change between GenBank releases, so you shouldn't assume a given number will correspond to the same domain if you run it again later. Instead, you can run this function without the preselection argument and wait for the function to prompt you for a domain selection, which will show the current numbering scheme.

After launching, the download function will ask you whether it is OK to download and set up a database for the selected sequence files. If the selection is likely to be large you can always quit the process using either Esc or Ctrl+c. Please note that the diskspace estimates are preliminary. The package tries to be conservative in its estimates.

db_download(preselection = '20')

Now the download has begun, we just need to wait for it to finish. For the largest domains, this can take quite a while and may be prone to server errors. In this case, you may want to set the max_tries argument of db_download() to a relatively high number to automatically re-try downloading until that number of tries is exceeded (in this case, be sure to set overwrite to FALSE or the download will start over from scratch).

1.3 Create a database

After the download has completed, we need to use db_create() to create our local database. This looks in the downloads folder of our restez path, breaks these files up into separate GenBank records and adds them to the SQL-like database. Again, for very large collections of sequences this can take quite a while.

db_create()

db_create() allows a user to specify minimum and maximum sequence lengths. It's always a good idea to limit the number of sequences in a database so that look-up times are faster. If you know you are only interested in certain lengths of sequences it is a good idea to limit the sequences in the database at this stage. This can also be done with the acc_filter argument, as described in the "Tips and Tricks" vignette. You can always run db_create() again to change the limits. You will simply need to delete the original database first with db_delete().

1.4 Checking the setup

After the download and the database creation steps are complete, we can confirm our setup using restez_status().

restez_path_set(rstz_pth)
restez_status()

The status function allows a user to always touch base with their restez set-up. It can be run at any point, before and/or after download or database creation. It is also designed to provide useful messages to a user for what they need to do next in order for them to make queries to their database. If you ever get confused, run restez_status() to see what is happening.

Additionally, if you are developing your own functions that depend on restez, you can use restez_ready() which will simply return TRUE or FALSE on whether the database can be queried.

restez_path_set(rstz_pth)
if (restez_ready()) {
  print('Database is ready to be queried!')
} else {
  print('Database cannot be queried :-(')
}

2. Query

Once a restez database has been set up we can query the database using the gb_*_get() functions. These functions allow us to retrieve specific columns in the SQL-like database: 'sequence', 'definition', 'accession', 'version' and 'organism'. Also, they allow us to get the whole text formatted record and sequence data in fasta format. In this example, we can use list_db_ids() to identify accession numbers in the database. We could also use entrez_search() provided the database contains sequences of interest to us, see 'search and fetch'.

restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
# sequence
str(gb_sequence_get(id))
# definition
(gb_definition_get(id))
# version
(gb_version_get(id))
# organism
(gb_organism_get(id))
# fasta
cat(gb_fasta_get(id))
# Note, for large databases these requests can take a long time to complete.
# Always try and limit the size of the database by specifying min and max
# sequence lengths with db_create()
# Also note, if an id is not present in the database nothing is returned
cat(gb_fasta_get(id = c(id, 'notanid')))

Additionally, for more flexibility and options for extracting sequence record information see GenBank record parsing.

3. Entrez

Entrez wrappers are part of the restez package. These allow a user to make use of the local GenBank using functions that were built for rentrez. This minimizes the amount of coding changes required for any Entrez dependent code.

Currently, only entrez_fetch() is available with restez and only text formatted rettypes are allowed.

restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
# get fasta sequences with entrez
res <- restez::entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
# entrez_fetch will also search via Entrez for any ids not in the db
plant_sequence <- 'AY952423'
res <- restez::entrez_fetch(db = 'nucleotide', id = c(id, plant_sequence),
                            rettype = 'fasta')
cat(res)