In joelnitta/gbfetch: Fetch sequences from GenBank into R

gbfetch

knitr::opts_chunk$set(echo = TRUE)

gbfetch makes it easy to download DNA sequences from GenBank directly into R.

You can provide a query string exactly as you would for a GenBank search, and it will return the corresponding sequences or metadata in a tidy format.

Installation

gbfetch is only available on GitHub, and must be installed with devtools or remotes.

devtools::install_github("joelnitta/gbfetch")

Dependencies

Most R packages gbfetch depends on will be automatically detected and installed from CRAN if needed. There are two exceptions.

jntools is only available on GitHub, and must be installed first.

devtools::install_github("joelnitta/jntools")

biofiles is not required to install gbfetch, but is a needed for assemble_gene_set(), so install it if you want to use this function. biofiles is only availble from BioConductor, and itself has dependencies that are only available from BioConductor.

BiocManager::install("biofiles")

GenBank API key

The fetch_sequences() and fetch_metadata() functions in this package use taxize and rentrez functions under the hood to fetch taxonomic data and GenBank sequences. Although not required, enabling the NCBI Entrez API key will increase the number of requests you are allowed to make per second. This should be set up using taxize::use_entrez() prior to running fetch_ functions or your query might get rejected and no sequences returned.

Large downloads

This package is meant for downloading moderate amounts of data, not whole swaths of GenBank. If you want to access large amounts of data, the restez or biomartr packages might be better options.

I am not aware of an upper limit on the number of sequences that can be downloaded, but I haven't tried more than ca. 10,000 at a time (which takes just a few minutes with fetch_sequences()). fetch_metadata() takes significantly longer, because it uses taxize to get species names, which is rather slow. Obtaining the metadata for ca. 10,000 sequences might take up to an hour or so.

Trying to fetch too many sequenes might result in your IP address getting rejected by the API (at least temporarily), so use with caution!

Data format

Sequence data are stored using the DNAbin class from the ape package. If you work with BioConductor packages that use the DNAString class, they will need to be converted. I'm not aware of a as.DNAString() function that can do this, so I wrote one.

Similar work

rentrez (which gbfetch uses under the hood) has much more sophisticated capabilities for querying NCBI databases.

The ape package has a function, read.GenBank() that can read in a DNA sequence from GenBank given its accession number.

restez allows you to download entire chunks of GenBank for querying locally. This is almost certainly the way to go if you are interested in downloading lots of data.

biomartr is designed for downloading and working with whole genomes on GenBank.

phylotaR queries GenBank for a taxonomic group of interest and automatically assembles a set of orthologous loci for phylogenetic analysis.

PyPHLAWD is a Python pipeline for assembling a set of loci for phylogenetic analysis.

Examples

(See notes above about setting your API key before running these)

Fetch sequences and metadata

Download all rbcL sequences for Crepidomanes minutum. The query string is formatted exactly as if we were searching on GenBank.

Though you don't need to do this in practice, I will use tictoc to give an idea of how long this query takes.

library(gbfetch)
library(tictoc) # Optional; for timing the example

query_string <- "Crepidomanes minutum[ORGN] AND rbcl[Gene]"

tic() # Set a timer

fetch_sequences(query_string)

toc() # See how long it took

Download associated metadata for the sequences.

tic() # Set a timer

fetch_metadata(query_string)

toc()

Assemble a set of genes from genomes

The number of whole or partial genomes in GenBank is increasing rapidly. Although fetch_sequnces() is useful for downloading small (e.g., single gene) sequences, we may also want to download multiple genes from a single genome or genomes. That is where fetch_gene_from_genome() comes in.

Let's download three genes of interest from the Diplazium striatum plastome, which has GenBank accession number KY427346.

# KY427346 is the GenBank accession no. for the Diplazium striatum plastome

tic() # Set a timer

genes_to_get <- c("rbcL", "matK", "psbA")
fetch_gene_from_genome(genes_to_get, "KY427346")

toc()

Scaling up, assemble_gene_set() can assemble multiple genes of interest from multiple genomes into a list. Let's get the same three genes for Cystopteris protrusa (GenBank accession no. KP136830) and Diplazium striatum.

# KY427346 is the GenBank accession no. for the Diplazium striatum plastome

tic() # Set a timer

assemble_gene_set(
  accessions = c("KP136830", "KY427346"), 
  genes = genes_to_get)

toc()