knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval=FALSE,
  warning = FALSE,
  error = FALSE,
  message = FALSE
)
library(taxreturn)
library(Biostrings)
library(ape)
library(tidyverse)

Introduction

Accurate taxonomic assignment in metabarcoding studies relies on a well-curated reference database of DNA marker sequences to compare query sequences against. taxreturn is an R package for retrieving DNA barcode sequences and associated taxonomic annotations from public databases, curating these sequences, and formatting them into training sets compatible with popular taxonomic classifiers. The primary public nucleotide databases of relevance to Eukaryotic metabarcoding are the NCBI GenBank database and the Barcode of Life Data System (BOLD), both of which taxreturn supports.

While GenBank hosts greater overall sequence data, BOLD represents a more curated DNA barcoding database that aims to maintain consistent links between sequences, validated morphological specimens, and associated specimen collection metadata.Despite the best efforts of submitters, both these databases have well documented issues with barcode sequences being insufficiently annotated, annotated with the incorrect species, or multiple morpho-species sharing the same barcode. Curating public reference data is therefore crucial for ensuring the robustness of a metabarcoding analysis, and this is the role that the taxreturn package aims to fulfill.

The main steps of the taxreturn workflow are as follows:

A Worked example

The following vignette will run through an example analysis of retrieving some insect COI barcode sequences and Mitochondrial genomes, curating these, and outputting a reference database.

Install & load package

devtools::install_github("alexpiper/taxreturn")
library(taxreturn)

# Load other necessary packages for this workflow
library(Biostrings)
library(ape)
library(tidyverse)

Fetching sequences from GenBank and BOLD

The first step is to retrieve public reference data sequences from NCBI Genbank and BOLD. This can be done with the fetch_seqs function, which wraps an interface to entrez and BOLD API's respectively. This function can either take a single higher level rank, i.e. the genus 'Scaptodrosophila', or it can take a vector of taxon names from a file.

The marker option determines what will be in the search, note that the naming of loci differs between Genbank and bold so i suggest conducting a test search on their respective websites to confirm the desired marker. For COI,

This function can be multithreaded by setting multithread to the desired number of cores, or TRUE (to autodetect number of cores). Note that querying the server is the main bottleneck here, and using too many threads may overload the server and produce errors.

Depending on the amount of sequences you are downloading, and the speed of your internet connection this step can take from minutes to hours, i suggest running this overnight for large searches with over a million sequences.

Download sequences from NCBI GenBank

## Fetch sequences from GenBank by searching for a taxon name
genbank_seqs <- fetch_seqs(
  "Scaptodrosophila", database="genbank", marker="COI[GENE] OR COX1[GENE] OR COXI[GENE]", 
  output = "gb-binom", multithread = FALSE, quiet=FALSE
  )

## OR fetch sequences from a species list
spp_list <- readLines("species_list.txt")

genbank_seqs <- fetch_seqs(
  spp_list, database="genbank", marker="COI[GENE] OR COX1[GENE] OR COXI[GENE]",
  output = "gb-binom", multithread = FALSE, quiet=FALSE
  )

Download sequences from BOLD

## Fetch sequences from BOLD by searching for a taxon name
bold_seqs <- fetch_seqs(
  "Scaptodrosophila", database="bold", marker="COI-5P", output = "gb-binom", 
  multithread = FALSE, quiet=FALSE
  )

## OR fetch sequences from a species list
spp_list <- readLines("species_list.txt")

bold_seqs <- fetch_seqs(
  spp_list, database="bold", marker="COI-5P", output = "gb-binom",
  multithread = FALSE, quiet=FALSE
  )

Download mitochondrial genomes from GenBank

The fetch_seqs function can also download mitochondrial genomes, and accepts a special input argument of marker="mitochondria" to do this.

# Fetch mitochondrial genomes from genbank by searching for a taxon name
genbank_mito <- fetch_seqs(
  "Scaptodrosophila", database="genbank", marker="mitochondria",
  output = "gb-binom", multithread = FALSE, quiet=FALSE
  )

## OR fetch sequences from a species list
spp_list <- readLines("species_list.txt")

genbank_mito <- fetch_seqs(
  spp_list, database="genbank", marker="mitochondria",
  output = "gb-binom",  multithread = FALSE, quiet=FALSE
  )

Curating public reference sequences

Due to the aforementioned issues with misannotated or low quality sequences in public reference databases we will use a number of filtering steps.

Removing non-homologous sequences

The first sequence curation step is to remove non-homologous loci by mapping to a reference Profile Hidden Markov Model (PHMM) of the target loci. PHMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. This approach circumvents the bias that can be introduced when mapping sequences to a single reference sequence. To create the PHMM model we will use the aphid R package, and a curated alignment of insect COI sequences obtained from the Midori dataset (Machida et al, 2017) and trimmed to the folmer region.

A pre-trained model of the conventional COI barcode or 'folmer' region can be loaded from the package data as below. This model was trained on the midori longest dataset of all COI sequences.

#This loads the model into the workspace
data("model", package="taxreturn")

#See what it looks like
print(model)

However if you are working with a different barcode locus, or if you wish to improve accuracy by training on a specific taxonomic group you can train a new PHMM model using the aphid R package as below:

Note: if you are using a public dataset as a reference alignment to build the model, it may be worth curating it manually first, as a poor reference alignment will produce a poor model and greatly affect downstream analysis.

library(aphid)

# Read in sequence dataset to be used in model training
seqs <-  readDNAStringSet("MIDORI_LONGEST_20180221_COI.fasta")

model <- aphid::derivePHMM(seqs)

Optional: Trim model to desired amplicon

Depending on the taxonomic classifier used, it can be beneficial to trim the reference sequences down to the actual region amplified by the primers. This can be achieved by subsetting the full PHMM model using primer sequences or binding positions. Below we will subset the model to the region amplified by the fwhF2-fwhR2n primers of Vamos et al 2017.

However, if you wish to keep the full length sequences, skip this section.

model <- subset_model(model, primers = c("GGDACWGGWTGAACWGTWTAYCCHCC", "GTRATWGCHCCDGCTARWACWGG"), trimprimers = TRUE)

Now that we have our trained PHMM, it can be used with the map_to_model function wto remove non-homolgous sequences and extract the target locus from mitochondrial genomes and longer sequences.

Firstly we will merge the sequences from each database together. As BOLD and GenBank share some sequence, we will dereplicate the merged file to only the unique accession numbers to speed things up.

As we only wish to look at the sequence data contained within the range of our alignment model (in this case the folmer region of COI), we use the option shave=TRUE to remove all bases outside the range defined by the model. As here we are looking at a coding gene, we will also set check_frame to TRUE to ensure that any insertions or deletions are in multiples of 3, which is a quick and dirty way of identifying and removing pseudogenic sequences.

When extracting the target region from longer sequences such as mitochondrial genomes, the dynamic programming algorithm used to align to the PHMM can take a very long time so this function implements a k-mer based pre-screening to reduce the number of wasteful alignments. This works by splitting any sequence that is more than twice the length of the input model into chunks, finding the most similar chunk to the reference using a rapid k-mer pre-screen, then using the most similar chunk and its neighbours to seed the PHMM alignment. The stringency of this screen can be modified using the kmer_threshold parameter, with the default excluding any sequence more than 30% diverged (kmer_threshold = 0.3) for 5-mers (k = 5). To further speed things up, you can set multithread to the number of cores you wish to use, or TRUE to autodetect number of cores.

# Merge all downloaded sequences
seqs <- concat_DNAbin(genbank_seqs, bold_seqs, genbank_mito)

# Remove those sequnce accessions that are identical across both databases
# Using a regex that splits to just the accessions
uniq_seqs <- seqs[!duplicated(str_extract(names(seqs), "^.*\\|" ))] 

#remove non-homologous sequences
filtered <- map_to_model(
  uniq_seqs, model, min_score = 100, min_length = 100, max_N=Inf, max_gap=Inf, 
  shave=TRUE, check_frame=TRUE, kmer_threshold = 0.5, k=5,
  multithread = FALSE, quiet=FALSE
  )

Resolve Contaminated sequences and misannotated taxonomy

The other main form of misannotation that can effect metabarcoding datasets is misannotated taxonomy for the reference sequences. To detect putatively misannotated sequences we will cluster sequences at a specific similarity threshold (in this case 97% similarity), and compare the taxonomy at a certain rank (Genus in this case) within clusters. A sequence will be flagged if its taxonomy diverges from the other sequences within its respective cluster by more than a certain confidence (0.8 by default, meaning the putatively misannotated sequence must diverge from 4/5 other sequences in its cluster)

# Load the NCBI taxonomy 
db <- get_ncbi_taxonomy()

#Get db
mixed_clusters <- get_mixed_clusters(
    x = filtered, db=db,
    rank = "genus",
    threshold = 0.97,
    return = "consensus",
    confidence=0.6, quiet = FALSE
    ) 

# Get accession numbers to remove
rem <- mixed_clusters$Acc

# Get current accession numbers
acc <- str_replace(names(filtered), "(?:.(?!;))+$", "")

# Remove any accessions that were found to be in mixed clusters
purged  <- filtered[!acc %in% rem]

Optional: Filter for stop codons

As the COI barcode is a protein coding region, underlying evolutionary constraints mean that any sequences containing stop codons or indels of lengths which are not a multiple of 3 commonly indicate pseudogenes. The codon_filter function detects the reading frame of the sequence, and ensures it contains no frameshifts or stop codons. Note that you must select the appropriate genetic code here otherwise you risk losing all your sequences. Available codes can be found using Biostrings::GENETIC_CODE_TABLE. If you are not working with a protein coding gene, skip this section.

#SGC4 is invertebrate mitochondrial genetic code
purged <- codon_filter(
  purged, genetic_code = "SGC4", tryrc = TRUE, resolve_draws = "majority"
  )

Resolve synonyms

Classification of sequences into reference taxonomy can be complicated by the existence of taxonomic synonyms. To resolve this we use the NCBI synonyms list to ensure each name represents a currently valid taxa, and replace known synonyms with accepted taxon names.

resolved <- resolve_synonyms_ncbi(purged)

#Check for differences in names
names(resolved)[which(!names(resolved) %in% names(purged))]

Prune large groups

In many cases groups of taxa are over-represented in databases, which can slow down and bias the taxonomic assignment process. Here we prune over-represented groups down to 5 sequences. This function has the option of discarding these sequences by length (removing smaller sequences first), or randomly.

We can also choose to de-duplicate the sequences here, which will remove all completely identical sequences before pruning down the group sizes.

#Prune group sizes down to 5, removing all identical sequences first
pruned <- prune_groups(
  resolved, max_group_size = 5, discardby = "length",
  dedup=TRUE, quiet = FALSE
  )

Reformat to taxonomic hierarchy

Finally, we will reformat to fasta annotations to contain the full taxonomic hierarchy and write out the fasta

#Load NCBI taxonomy
db <- get_ncbi_taxonomy()

#Reformat to complete taxonomic heirarchy 
hierarchy <- reformat_hierarchy(pruned, db=db, quiet=FALSE)

#See if this worked
head(names(hierarchy))

# Save a zipped fasta file of curated reference database
write_fasta(
  hierarchy, file = "COI_reference_hierarchial.fa.gz", compress = TRUE
  )

Summarise number of taxa in database

Finally, we can output a few summary metrics for our final curated database, such as the number of unique sequences for each taxonomic rank:

names(hierarchy) %>%
  str_split_fixed(";", n=8) %>%
  as_tibble() %>%
  tidyr::separate(V1, into=c("Sequences", "tax_ids"), sep = "\\|") %>%
  magrittr::set_colnames(c("Sequences", "tax_ids", "kingdom", "phylum",
                           "class", "order", "family", "genus", "species")) %>%
  summarise_all(n_distinct)

We can also generate a tree like representation of the taxonomy which we can output as a newick file for analysis in other software

tree <- tax2tree(hierarchy, output="phylo")
write.tree(tree, "tree.nwk")

Bonus: output trained classifiers

The taxreturn package supports reformatting of the reference database to work with

The classifiers currently supported are:

RDP

#Load NCBI taxonomy
db <- get_ncbi_taxonomy()

#Reformat to Kingdom to genus heirarchy 
# suitable for assigntaxonomy classifier in DADA2
dada2_gen <- reformat_dada2_gen(hierarchy, db = db, quiet = FALSE)
write_fasta(
  dada2_gen, file = "COI_reference_dada2gen.fa.gz", compress = TRUE
  )

#Reformat to genus species binomials as suitable for assignSpecies in DADA2
dada2_spp <- reformat_dada2_spp(hierarchy)
write_fasta(
  dada2_spp, file = "COI_reference_dada2spp.fa.gz", compress = TRUE
  )

IDTAXA

taxreturn implements a wrapper function for training the IDTAXA classifier, as below:

```{R IDTAXA}

Load NCBI taxonomy

db <- get_ncbi_taxonomy()

Train IDTAXA

training_set <- train_idtaxa( hierarchy, max_group_size = 10, max_iterations = 3,
allow_group_removal = TRUE, get_lineage = TRUE, db = db, quiet = FALSE )

Write out training set

saveRDS(training_set, file="trained_idtaxa.rds") ```



alexpiper/taxreturn documentation built on Sept. 14, 2024, 7:56 p.m.