README.md

CRAN_Status_Badge CRAN_Downloads_Badge DOI ORCiD Build_Status Project_Status License

Informatic sequence classification trees

insect is an R package for taxonomic identification of amplicon sequence variants generated by DNA meta-barcoding analysis. The learning and classification algorithms implemented in the package are based on full probabilistic models (profile hidden Markov models) and offer highly accurate taxon IDs, albeit at a relatively high computational cost.

The package also contains functions for searching and downloading reference sequences and taxonomic information from NCBI, a "virtual PCR" tool for sequence trimming, a function for "purging" erroneously labeled reference sequences, and several other handy tools.

insect is designed to be used in conjunction with the dada2 pipeline or any other de-noising tool that produces a list of amplicon sequence variants (ASVs). While unfiltered sequences can also be processed with high accuracy, the insect classification algorithm is relatively slow, since it uses a computationally intensive dynamic programming algorithm to find the likelihood values of each sequence given the models at each node of the classification tree. Hence an appropriately filtered input dataset will generally be much faster to process.

Installation

To download insect from CRAN and load the package, run

install.packages("insect")
library(insect)

To download the latest development version from GitHub, run:

devtools::install_github("shaunpwilkinson/insect", build_vignettes = TRUE) 
library(insect)

Classifying sequences

Training an insect classifier will be the subject of another tutorial; however, several classifiers are already available for some of the more common metabarcoding primer sets:

Marker Target Primers Source Version Date Download 16S Marine crustaceans Crust16S_F/Crust16S_R (Berry et al 2017) GenBank 4 20180626 RDS (7.1 MB) 16S Marine fish Fish16sF/16s2R (Berry et al 2017; Deagle et al 2007) GenBank 4 20180627 RDS (6.8MB) 18S Marine eukaryotes 18S_1F/18S_400R (Pochon et al 2017) SILVA_132_LSUParc, GenBank 5 20180709 RDS (11.8 MB) 18S Marine eukaryotes 18S_V4F/18S_V4R (Stat et al 2017) GenBank 4 20180525 RDS (11.5 MB) 23S Algae p23SrV_f1/p23SrV_r1 (Sherwood & Presting 2007) SILVA_132_LSUParc 1 20180715 RDS (26.9MB) COI Metazoans (amino) mlCOIintF/jgHCO2198 (Leray et al 2013) Midori, GenBank 4aa 20181009 RDS (47 MB) COI Metazoans (marine only) mlCOIintF/jgHCO2198 (Leray et al 2013) Midori, GenBank 4mo 20181009 RDS (34.1 MB) COI Metazoans (no third codon positions) mlCOIintF/jgHCO2198 (Leray et al 2013) Midori, GenBank 4nt 20181009 RDS (51.9 MB) ITS2 Cnidarians and sponges scl58SF/scl28SR (Wilkinson et al in prep) GenBank 5 20180920 RDS (6.6 MB)

To classify a sequence or set of sequences, first read them into R as a "DNAbin" list object. FASTA files can be parsed as follows:

x <- readFASTA("<path-to-file>.fasta")

Alternatively users may wish to assign taxon IDs to the output from the DADA2 pipeline, in which case the column names of the ouput table can be parsed as in the following example:

data("samoa") 
x <- char2dna(colnames(samoa))
## name the sequences sequentially
names(x) <- paste0("ASV", seq_along(x))

The next step is to download and read in the classifier. It is important to ensure that the classifier was trained using the same primer set as that used to generate the query data. In this example the data were generated from autonomous reef monitoring structures in American Samoa (ARMS) using the COI metabarcoding primers mlCOIintF and jgHCO2198 (Leray et al 2013), and de-noised, filtered and merged following the DADA2 tutorial.

There are currently three classifiers available for this primer set, a marine only version, one for DNA sequences with every third codon position removed (nothirds), and one for translated amino acid sequences (amino).

Here we'll use the amino acid version, which was created by translating the MIDORI UNIQUE 20180221 training dataset using the EBI5 invertebrate mitochondrial translation table.

The 47 MB classifier can be downloaded to the current working directory and read into R as follows:

download.file("https://www.dropbox.com/s/c987wohded37cxg/classifier.rds?dl=1", 
              destfile = "classifier.rds", mode = "wb")
classifier <- readRDS("classifier.rds")

Now we need to translate the amplicon sequence variants (ASVs) to amino acid sequences using the same number 5 translation table. Before this, any sequences whose length is not divisible by three should be discarded. The sequences are then converted to character type for translation with the translate function in the seqinr package, and converted back to binary form object using ape::as.AAbin.

keeps <- sapply(x, length) %% 3 == 1L
x <- x[keeps]
x <- ape::as.character.DNAbin(x)
x <- lapply(x, seqinr::translate, numcode = 5, frame = 1)
x <- ape::as.AAbin(x)

Now we need to check that none of the translated sequences contain stop codons:

keeps <- sapply(x, function(v) !any(v == as.raw(42)))
x <- x[keeps]

Finally, the amino acid sequences are passed to the classifier for taxon ID assignment. There is an option to perform a nearest-neighbor search prior to the computationally-expensive recursive model test procedure, which can save time and improve resolution ('recall') at lower taxonomic ranks. Note that this can be a double-edged sword; if multiple species share an identical or near-identical sequence, and the true taxon of the query sequence is missing from the trainingset, the algorithm may over-classify the sequence and return a congeneric taxon. To perform a nearest-neighbor search with a similarity threshold of say 0.99 (meaning any sequence in the trainingset with a similarity greater than or equal to 0.99 is considered a match), set ping = 0.99. To stay on the safe side, we will set ping = 1 (i.e. only sequences with 100% identity are considered matches).

out <- classify(x, classifier, cores = 2, ping = 1)
representative taxID taxon rank score kingdom phylum class order family genus species ASV1 2806 Florideophyceae class 0.9966 Florideophyceae ASV2 6379 Chaetopterus genus 0.9881 Metazoa Annelida Polychaeta Spionida Chaetopteridae Chaetopterus ASV3 2806 Florideophyceae class 0.9769 Florideophyceae ASV4 116569 Neocopepoda infraclass 0.9986 Metazoa Arthropoda Hexanauplia ASV5 33213 Bilateria no rank 0.9456 Metazoa ASV6 2806 Florideophyceae class 0.9973 Florideophyceae ASV7 39820 Nereididae family 0.9110 Metazoa Annelida Polychaeta Phyllodocida Nereididae ASV8 1 root no rank 1.0000 ASV9 2806 Florideophyceae class 0.9147 Florideophyceae ASV11 2759 Eukaryota superkingdom 1.0000 ASV12 2806 Florideophyceae class 0.9424 Florideophyceae ASV13 2759 Eukaryota superkingdom 0.9999 ASV14 33213 Bilateria no rank 0.9456 Metazoa ASV15 2806 Florideophyceae class 0.9530 Florideophyceae ASV16 39820 Nereididae family NA Metazoa Annelida Polychaeta Phyllodocida Nereididae

Any sequences that return exact hits (or near matches if ping = 0.99 or similar) with at least one training sequence are assigned a score of NA, as in the final row of the table above. Here, the multiple matching sequences have a Nereid polychaete common ancestor, and the query sequence was therefore assigned to the family Nereididae.

Further reading

A more detailed overview of the package and its functions can be found here or by running

vignette("insect-vignette")

Issues

If you experience a problem using this software please feel free to raise it as an issue on GitHub.

Acknowledgements

This software was developed at Victoria University of Wellington with funding from a Rutherford Foundation Postdoctoral Research Fellowship award from the Royal Society of New Zealand.



shaunpwilkinson/insect documentation built on Oct. 16, 2018, 6:13 p.m.