knitr::opts_chunk$set(echo = TRUE)
gbfetch
makes it easy to download DNA sequences from GenBank directly into R.
You can provide a query string exactly as you would for a GenBank search, and it will return the corresponding sequences or metadata in a tidy format.
gbfetch
is only available on GitHub, and must be installed with devtools
or remotes
.
devtools::install_github("joelnitta/gbfetch")
Most R packages gbfetch
depends on will be automatically detected and installed from CRAN if needed. There are two exceptions.
jntools
is only available on GitHub, and must be installed first.
devtools::install_github("joelnitta/jntools")
biofiles
is not required to install gbfetch
, but is a needed for assemble_gene_set()
, so install it if you want to use this function. biofiles
is only availble from BioConductor, and itself has dependencies that are only available from BioConductor.
BiocManager::install("biofiles")
The fetch_sequences()
and fetch_metadata()
functions in this package use taxize
and rentrez
functions under the hood to fetch taxonomic data and GenBank sequences. Although not required, enabling the NCBI Entrez API key will increase the number of requests you are allowed to make per second. This should be set up using taxize::use_entrez()
prior to running fetch_
functions or your query might get rejected and no sequences returned.
This package is meant for downloading moderate amounts of data, not whole swaths of GenBank. If you want to access large amounts of data, the restez or biomartr packages might be better options.
I am not aware of an upper limit on the number of sequences that can be downloaded, but I haven't tried more than ca. 10,000 at a time (which takes just a few minutes with fetch_sequences()
). fetch_metadata()
takes significantly longer, because it uses taxize
to get species names, which is rather slow. Obtaining the metadata for ca. 10,000 sequences might take up to an hour or so.
Trying to fetch too many sequenes might result in your IP address getting rejected by the API (at least temporarily), so use with caution!
Sequence data are stored using the DNAbin
class from the ape
package. If you work with BioConductor packages that use the DNAString
class, they will need to be converted. I'm not aware of a as.DNAString()
function that can do this, so I wrote one.
rentrez (which gbfetch
uses under the hood) has much more sophisticated capabilities for querying NCBI databases.
The ape package has a function, read.GenBank() that can read in a DNA sequence from GenBank given its accession number.
restez allows you to download entire chunks of GenBank for querying locally. This is almost certainly the way to go if you are interested in downloading lots of data.
biomartr is designed for downloading and working with whole genomes on GenBank.
phylotaR queries GenBank for a taxonomic group of interest and automatically assembles a set of orthologous loci for phylogenetic analysis.
PyPHLAWD is a Python pipeline for assembling a set of loci for phylogenetic analysis.
(See notes above about setting your API key before running these)
Download all rbcL sequences for Crepidomanes minutum. The query string is formatted exactly as if we were searching on GenBank.
Though you don't need to do this in practice, I will use tictoc
to give an idea of how long this query takes.
library(gbfetch) library(tictoc) # Optional; for timing the example query_string <- "Crepidomanes minutum[ORGN] AND rbcl[Gene]" tic() # Set a timer fetch_sequences(query_string) toc() # See how long it took
Download associated metadata for the sequences.
tic() # Set a timer fetch_metadata(query_string) toc()
The number of whole or partial genomes in GenBank is increasing rapidly. Although fetch_sequnces()
is useful for downloading small (e.g., single gene) sequences, we may also want to download multiple genes from a single genome or genomes. That is where fetch_gene_from_genome()
comes in.
Let's download three genes of interest from the Diplazium striatum plastome, which has GenBank accession number KY427346.
# KY427346 is the GenBank accession no. for the Diplazium striatum plastome tic() # Set a timer genes_to_get <- c("rbcL", "matK", "psbA") fetch_gene_from_genome(genes_to_get, "KY427346") toc()
Scaling up, assemble_gene_set()
can assemble multiple genes of interest from multiple genomes into a list. Let's get the same three genes for Cystopteris protrusa (GenBank accession no. KP136830) and Diplazium striatum.
# KY427346 is the GenBank accession no. for the Diplazium striatum plastome tic() # Set a timer assemble_gene_set( accessions = c("KP136830", "KY427346"), genes = genes_to_get) toc()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.