knitr::opts_chunk$set(collapse = FALSE, comment = '#', cache = TRUE)
R package abbagadabba (amplicon-based biodiversity assessment, gap analysis, database building and more) provides functions to download DNA sequence data from NCBI based on species names.
abbagadabba is in active development and only exists on GitHub. It can be installed using devtools (which is a stable package available on CRAN).
library(devtools) install_github("Maine-eDNA/abbagadabba")
Suppose we have a list of species for which we want to retrieve sequence data. This list might include "bad names" (e.g. synonyms of misspellings). We want to first correct as many of those bad names as we can. Simultaneously, we'll compile all information about the taxonomic hierarchy of those species
library(abbagadabba) cleanNames <- getNCBITaxonomy(c('Idiomyia sproati', 'Drosophil murphy', 'no body')) cleanNames
We were able to fix Drosophil murphy
, making it Drosophila murphyi
, and Idiomyia sproati
, making it Drosophila sproati
, but we couldn't match no body
.
Now we need to get all sequence identifiers associated with those species:
goodNames <- cleanNames$ncbi_name[!is.na(cleanNames$ncbi_name)] seqIDs <- getNCBISeqID(goodNames) head(seqIDs)
Finally we can feed those IDs into the function to retrieve the sequences themselves. For the purpose of this example, we'll just look at a few sequences
seqData <- getGenBankSeqs(seqIDs[1:2]) seqData
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.