insect
is an R package for taxonomic identification of amplicon
sequence variants generated by DNA meta-barcoding analysis. The learning
and classification algorithms implemented in the package are based on
full probabilistic models (profile hidden Markov models) and offer
highly accurate taxon IDs, albeit at a relatively high computational
cost.
The package also contains functions for searching and downloading reference sequences and taxonomic information from NCBI, a "virtual PCR" tool for sequence trimming, a function for purging erroneously labeled reference sequences, and several other tools.
insect
is designed to be used in conjunction with the
dada2 pipeline or other
de-noising tools that produce a list of amplicon sequence variants
(ASVs). While unfiltered sequences can also be processed with high
accuracy, the insect classification algorithm is relatively slow,
since it uses a computationally intensive dynamic programming algorithm
to find the likelihood values of each sequence given the models at each
node of the classification tree. Hence filtered input datasets are
generally be much faster to process.
To download insect from CRAN and load the package, run
install.packages("insect")
library(insect)
To download the latest development version from GitHub, run:
devtools::install_github("shaunpwilkinson/insect", build_vignettes = TRUE)
library(insect)
Classifiers for some of the more commonly used metabarcoding primer sets are available here:
Marker Target Primers Source Version Date Download 12S Fish MiFishUF/MiFishUR (Miya et al 2015) GenBank 1 20181111 RDS (9MB) 16S Marine crustaceans Crust16S_F/Crust16S_R (Berry et al 2017) GenBank 4 20180626 RDS (7.1 MB) 16S Marine fish Fish16sF/16s2R (Berry et al 2017; Deagle et al 2007) GenBank 4 20180627 RDS (6.8MB) 18S Marine eukaryotes 18S_1F/18S_400R (Pochon et al 2017) SILVA_132_LSUParc, GenBank 5 20180709 RDS (11.8 MB) 18S Marine eukaryotes 18S_V4F/18S_V4R (Stat et al 2017) GenBank 4 20180525 RDS (11.5 MB) 23S Algae p23SrV_f1/p23SrV_r1 (Sherwood & Presting 2007) SILVA_132_LSUParc 1 20180715 RDS (26.9MB) COI Metazoans mlCOIintF/jgHCO2198 (Leray et al 2013) Midori, GenBank 5 20181124 RDS (140 MB) ITS2 Cnidarians and sponges scl58SF/scl28SR (Wilkinson et al in prep) GenBank 5 20180920 RDS (6.6 MB)To classify a sequence or set of sequences, first read them into R as a "DNAbin" list object. FASTA files can be parsed as follows:
x <- readFASTA("<path-to-file>.fasta")
Alternatively users may wish to assign taxon IDs to the output from the DADA2 pipeline, in which case the column names of the ouput table can be parsed as in the following example:
data("samoa")
x <- char2dna(colnames(samoa))
## name the sequences sequentially
names(x) <- paste0("ASV", seq_along(x))
The next step is to download and read in the classifier. It is important to ensure that the classifier was trained using the same primer set as that used to generate the query data. In this example the data were generated from autonomous reef monitoring structures in American Samoa (ARMS) using the COI metabarcoding primers mlCOIintF and jgHCO2198 (Leray et al 2013), and de-noised, filtered and merged following the DADA2 tutorial.
The COI classifier was created using the MIDORI UNIQUE 20180221 trainingset, supplemented with around 14,000 non-metazoan COI sequences downloaded from GenBank.
The 140 MB classifier can be downloaded to the current working directory and read into R as follows:
download.file("https://www.dropbox.com/s/dvnrhnfmo727774/classifier.rds?dl=1",
destfile = "classifier.rds", mode = "wb")
classifier <- readRDS("classifier.rds")
There is an option to perform a nearest-neighbor search prior to the
computationally-expensive recursive model test procedure, which can save
time and improve resolution ('recall') at lower taxonomic ranks. Note
that this can be a double-edged sword; if multiple species share an
identical or near-identical sequence, and the true taxon of the query
sequence is missing from the trainingset, the algorithm may
over-classify the sequence and return a congeneric taxon. To perform a
nearest-neighbor search with a similarity threshold of 0.99 (meaning any
sequence in the trainingset with a similarity greater than or equal to
99% is considered a match), set ping = 0.99
. To stay on the safe side,
we will set ping = 1
(i.e. only sequences with 100% identity are
considered matches).
out <- classify(x, classifier, threshold = 0.8)
representative
taxID
taxon
rank
score
kingdom
phylum
class
order
family
genus
species
ASV1
2806
Florideophyceae
class
0.9981
Florideophyceae
ASV2
6379
Chaetopterus
genus
1.0000
Metazoa
Annelida
Polychaeta
Spionida
Chaetopteridae
Chaetopterus
ASV3
2806
Florideophyceae
class
0.9989
Florideophyceae
ASV4
2172821
Multicrustacea
superclass
1.0000
Metazoa
Arthropoda
ASV5
131567
cellular organisms
no rank
0.9952
ASV6
2806
Florideophyceae
class
0.9981
Florideophyceae
ASV7
39820
Nereididae
family
1.0000
Metazoa
Annelida
Polychaeta
Phyllodocida
Nereididae
ASV8
116571
Podoplea
superorder
0.9995
Metazoa
Arthropoda
Hexanauplia
ASV9
2806
Florideophyceae
class
0.9482
Florideophyceae
ASV10
1
root
no rank
NA
ASV11
115834
Hesionidae
family
1.0000
Metazoa
Annelida
Polychaeta
Phyllodocida
Hesionidae
ASV12
1443949
Corallinophycidae
subclass
0.9910
Florideophyceae
ASV13
33213
Bilateria
no rank
1.0000
Metazoa
ASV14
131567
cellular organisms
no rank
0.9952
ASV15
2806
Florideophyceae
class
0.9993
Florideophyceae
ASV16
39820
Nereididae
family
1.0000
Metazoa
Annelida
Polychaeta
Phyllodocida
Nereididae
A more detailed overview of the package and its functions can be found here or by running
vignette("insect-vignette")
If you experience a problem using this software please feel free to raise it as an issue on GitHub.
This software was developed at Victoria University of Wellington with funding from a Rutherford Foundation Postdoctoral Research Fellowship award from the Royal Society of New Zealand. Unpublished COI data care of Molly Timmers (NOAA).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.