knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval=FALSE, warning = FALSE, error = FALSE, message = FALSE )
library(taxreturn) library(Biostrings) library(ape) library(tidyverse)
Accurate taxonomic assignment in metabarcoding studies relies on a well-curated reference database of DNA marker sequences to compare query sequences against. taxreturn is an R package for retrieving DNA barcode sequences and associated taxonomic annotations from public databases, curating these sequences, and formatting them into training sets compatible with popular taxonomic classifiers. The primary public nucleotide databases of relevance to Eukaryotic metabarcoding are the NCBI GenBank database and the Barcode of Life Data System (BOLD), both of which taxreturn supports.
While GenBank hosts greater overall sequence data, BOLD represents a more curated DNA barcoding database that aims to maintain consistent links between sequences, validated morphological specimens, and associated specimen collection metadata.Despite the best efforts of submitters, both these databases have well documented issues with barcode sequences being insufficiently annotated, annotated with the incorrect species, or multiple morpho-species sharing the same barcode. Curating public reference data is therefore crucial for ensuring the robustness of a metabarcoding analysis, and this is the role that the taxreturn package aims to fulfill.
The main steps of the taxreturn workflow are as follows:
The following vignette will run through an example analysis of retrieving some insect COI barcode sequences and Mitochondrial genomes, curating these, and outputting a reference database.
devtools::install_github("alexpiper/taxreturn") library(taxreturn) # Load other necessary packages for this workflow library(Biostrings) library(ape) library(tidyverse)
The first step is to retrieve public reference data sequences from NCBI Genbank and BOLD. This can be done with the fetch_seqs function, which wraps an interface to entrez and BOLD API's respectively. This function can either take a single higher level rank, i.e. the genus 'Scaptodrosophila', or it can take a vector of taxon names from a file.
The marker option determines what will be in the search, note that the naming of loci differs between Genbank and bold so i suggest conducting a test search on their respective websites to confirm the desired marker. For COI,
This function can be multithreaded by setting multithread to the desired number of cores, or TRUE (to autodetect number of cores). Note that querying the server is the main bottleneck here, and using too many threads may overload the server and produce errors.
Depending on the amount of sequences you are downloading, and the speed of your internet connection this step can take from minutes to hours, i suggest running this overnight for large searches with over a million sequences.
## Fetch sequences from GenBank by searching for a taxon name genbank_seqs <- fetch_seqs( "Scaptodrosophila", database="genbank", marker="COI[GENE] OR COX1[GENE] OR COXI[GENE]", output = "gb-binom", multithread = FALSE, quiet=FALSE ) ## OR fetch sequences from a species list spp_list <- readLines("species_list.txt") genbank_seqs <- fetch_seqs( spp_list, database="genbank", marker="COI[GENE] OR COX1[GENE] OR COXI[GENE]", output = "gb-binom", multithread = FALSE, quiet=FALSE )
## Fetch sequences from BOLD by searching for a taxon name bold_seqs <- fetch_seqs( "Scaptodrosophila", database="bold", marker="COI-5P", output = "gb-binom", multithread = FALSE, quiet=FALSE ) ## OR fetch sequences from a species list spp_list <- readLines("species_list.txt") bold_seqs <- fetch_seqs( spp_list, database="bold", marker="COI-5P", output = "gb-binom", multithread = FALSE, quiet=FALSE )
The fetch_seqs function can also download mitochondrial genomes, and accepts a special input argument of marker="mitochondria" to do this.
# Fetch mitochondrial genomes from genbank by searching for a taxon name genbank_mito <- fetch_seqs( "Scaptodrosophila", database="genbank", marker="mitochondria", output = "gb-binom", multithread = FALSE, quiet=FALSE ) ## OR fetch sequences from a species list spp_list <- readLines("species_list.txt") genbank_mito <- fetch_seqs( spp_list, database="genbank", marker="mitochondria", output = "gb-binom", multithread = FALSE, quiet=FALSE )
Due to the aforementioned issues with misannotated or low quality sequences in public reference databases we will use a number of filtering steps.
The first sequence curation step is to remove non-homologous loci by mapping to a reference Profile Hidden Markov Model (PHMM) of the target loci. PHMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. This approach circumvents the bias that can be introduced when mapping sequences to a single reference sequence. To create the PHMM model we will use the aphid R package, and a curated alignment of insect COI sequences obtained from the Midori dataset (Machida et al, 2017) and trimmed to the folmer region.
A pre-trained model of the conventional COI barcode or 'folmer' region can be loaded from the package data as below. This model was trained on the midori longest dataset of all COI sequences.
#This loads the model into the workspace data("model", package="taxreturn") #See what it looks like print(model)
However if you are working with a different barcode locus, or if you wish to improve accuracy by training on a specific taxonomic group you can train a new PHMM model using the aphid R package as below:
Note: if you are using a public dataset as a reference alignment to build the model, it may be worth curating it manually first, as a poor reference alignment will produce a poor model and greatly affect downstream analysis.
library(aphid) # Read in sequence dataset to be used in model training seqs <- readDNAStringSet("MIDORI_LONGEST_20180221_COI.fasta") model <- aphid::derivePHMM(seqs)
Depending on the taxonomic classifier used, it can be beneficial to trim the reference sequences down to the actual region amplified by the primers. This can be achieved by subsetting the full PHMM model using primer sequences or binding positions. Below we will subset the model to the region amplified by the fwhF2-fwhR2n primers of Vamos et al 2017.
However, if you wish to keep the full length sequences, skip this section.
model <- subset_model(model, primers = c("GGDACWGGWTGAACWGTWTAYCCHCC", "GTRATWGCHCCDGCTARWACWGG"), trimprimers = TRUE)
Now that we have our trained PHMM, it can be used with the map_to_model function wto remove non-homolgous sequences and extract the target locus from mitochondrial genomes and longer sequences.
Firstly we will merge the sequences from each database together. As BOLD and GenBank share some sequence, we will dereplicate the merged file to only the unique accession numbers to speed things up.
As we only wish to look at the sequence data contained within the range of our alignment model (in this case the folmer region of COI), we use the option shave=TRUE to remove all bases outside the range defined by the model. As here we are looking at a coding gene, we will also set check_frame to TRUE to ensure that any insertions or deletions are in multiples of 3, which is a quick and dirty way of identifying and removing pseudogenic sequences.
When extracting the target region from longer sequences such as mitochondrial genomes, the dynamic programming algorithm used to align to the PHMM can take a very long time so this function implements a k-mer based pre-screening to reduce the number of wasteful alignments. This works by splitting any sequence that is more than twice the length of the input model into chunks, finding the most similar chunk to the reference using a rapid k-mer pre-screen, then using the most similar chunk and its neighbours to seed the PHMM alignment. The stringency of this screen can be modified using the kmer_threshold parameter, with the default excluding any sequence more than 30% diverged (kmer_threshold = 0.3) for 5-mers (k = 5). To further speed things up, you can set multithread to the number of cores you wish to use, or TRUE to autodetect number of cores.
# Merge all downloaded sequences seqs <- concat_DNAbin(genbank_seqs, bold_seqs, genbank_mito) # Remove those sequnce accessions that are identical across both databases # Using a regex that splits to just the accessions uniq_seqs <- seqs[!duplicated(str_extract(names(seqs), "^.*\\|" ))] #remove non-homologous sequences filtered <- map_to_model( uniq_seqs, model, min_score = 100, min_length = 100, max_N=Inf, max_gap=Inf, shave=TRUE, check_frame=TRUE, kmer_threshold = 0.5, k=5, multithread = FALSE, quiet=FALSE )
The other main form of misannotation that can effect metabarcoding datasets is misannotated taxonomy for the reference sequences. To detect putatively misannotated sequences we will cluster sequences at a specific similarity threshold (in this case 97% similarity), and compare the taxonomy at a certain rank (Genus in this case) within clusters. A sequence will be flagged if its taxonomy diverges from the other sequences within its respective cluster by more than a certain confidence (0.8 by default, meaning the putatively misannotated sequence must diverge from 4/5 other sequences in its cluster)
# Load the NCBI taxonomy db <- get_ncbi_taxonomy() #Get db mixed_clusters <- get_mixed_clusters( x = filtered, db=db, rank = "genus", threshold = 0.97, return = "consensus", confidence=0.6, quiet = FALSE ) # Get accession numbers to remove rem <- mixed_clusters$Acc # Get current accession numbers acc <- str_replace(names(filtered), "(?:.(?!;))+$", "") # Remove any accessions that were found to be in mixed clusters purged <- filtered[!acc %in% rem]
As the COI barcode is a protein coding region, underlying evolutionary constraints mean that any sequences containing stop codons or indels of lengths which are not a multiple of 3 commonly indicate pseudogenes. The codon_filter function detects the reading frame of the sequence, and ensures it contains no frameshifts or stop codons. Note that you must select the appropriate genetic code here otherwise you risk losing all your sequences. Available codes can be found using Biostrings::GENETIC_CODE_TABLE
. If you are not working with a protein coding gene, skip this section.
#SGC4 is invertebrate mitochondrial genetic code purged <- codon_filter( purged, genetic_code = "SGC4", tryrc = TRUE, resolve_draws = "majority" )
Classification of sequences into reference taxonomy can be complicated by the existence of taxonomic synonyms. To resolve this we use the NCBI synonyms list to ensure each name represents a currently valid taxa, and replace known synonyms with accepted taxon names.
resolved <- resolve_synonyms_ncbi(purged) #Check for differences in names names(resolved)[which(!names(resolved) %in% names(purged))]
In many cases groups of taxa are over-represented in databases, which can slow down and bias the taxonomic assignment process. Here we prune over-represented groups down to 5 sequences. This function has the option of discarding these sequences by length (removing smaller sequences first), or randomly.
We can also choose to de-duplicate the sequences here, which will remove all completely identical sequences before pruning down the group sizes.
#Prune group sizes down to 5, removing all identical sequences first pruned <- prune_groups( resolved, max_group_size = 5, discardby = "length", dedup=TRUE, quiet = FALSE )
Finally, we will reformat to fasta annotations to contain the full taxonomic hierarchy and write out the fasta
#Load NCBI taxonomy db <- get_ncbi_taxonomy() #Reformat to complete taxonomic heirarchy hierarchy <- reformat_hierarchy(pruned, db=db, quiet=FALSE) #See if this worked head(names(hierarchy)) # Save a zipped fasta file of curated reference database write_fasta( hierarchy, file = "COI_reference_hierarchial.fa.gz", compress = TRUE )
Finally, we can output a few summary metrics for our final curated database, such as the number of unique sequences for each taxonomic rank:
names(hierarchy) %>% str_split_fixed(";", n=8) %>% as_tibble() %>% tidyr::separate(V1, into=c("Sequences", "tax_ids"), sep = "\\|") %>% magrittr::set_colnames(c("Sequences", "tax_ids", "kingdom", "phylum", "class", "order", "family", "genus", "species")) %>% summarise_all(n_distinct)
We can also generate a tree like representation of the taxonomy which we can output as a newick file for analysis in other software
tree <- tax2tree(hierarchy, output="phylo") write.tree(tree, "tree.nwk")
The taxreturn package supports reformatting of the reference database to work with
The classifiers currently supported are:
#Load NCBI taxonomy db <- get_ncbi_taxonomy() #Reformat to Kingdom to genus heirarchy # suitable for assigntaxonomy classifier in DADA2 dada2_gen <- reformat_dada2_gen(hierarchy, db = db, quiet = FALSE) write_fasta( dada2_gen, file = "COI_reference_dada2gen.fa.gz", compress = TRUE ) #Reformat to genus species binomials as suitable for assignSpecies in DADA2 dada2_spp <- reformat_dada2_spp(hierarchy) write_fasta( dada2_spp, file = "COI_reference_dada2spp.fa.gz", compress = TRUE )
taxreturn implements a wrapper function for training the IDTAXA classifier, as below:
```{R IDTAXA}
db <- get_ncbi_taxonomy()
training_set <- train_idtaxa(
hierarchy, max_group_size = 10, max_iterations = 3,
allow_group_removal = TRUE, get_lineage = TRUE, db = db, quiet = FALSE
)
saveRDS(training_set, file="trained_idtaxa.rds") ```
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.