knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In the microbiome field, 16S rRNA gene sequencing has been a popular method of characterizing the types of bacteria found in a community. The most frequently used tool for classifying these sequences has been the naive Bayesian classifier that used to be accessible through the Ribosomal Database Project and is currently found in tools like mothur and QIIME. For the first time, the {phylotypr}
package provides an R-based implementation of that tool. There are minimal dependencies and the easy to use code executes rapidly.
{phylotypr}
To get started you'll need to install {phylotypr}
using either of these commands:
# devtools::install_github("mothur/phylotypr") #install development version install.packages("phylotypr") # install stable version from CRAN
To get access to the packages function and a demonstration version of the training data, you'll need to run the library()
function. Because the classification algorithm makes use of a random number generator, I strongly encourage you to set the seed to get consistent results:
library(phylotypr) set.seed(19760620) # pat's birtday in YYYYMMDD format
Before we can classify a sequence, we need to generate the database. This can be done using the build_kmer_database()
function and takes several seconds to execute. Unless you save the database object (e.g. db
below) to a Rda
-formatted file, you'll need to run this function each time you load {phylotypr}
db <- build_kmer_database( trainset9_pds$sequence, trainset9_pds$taxonomy )
By default, build_kmer_database()
will use 8-nt long kmers to calculate its probabilities. This size can be altered by the user if needed, but 8-nt performs very well and shouldn't need to be adjusted.
To classify a sequence we need the DNA sequence to be a character object like unknown
is below.
unknown <- "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGATCGTTAAGTCAGTGGTCAAATTGAGGGGCTCAACCCCTTCCCGCCATTGAAACTGGCGATCTTGAGTGGAAGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATGCCGGCTTCCTACTGACGCTCATGCACGAAAGTGTGGGTAACGAACAGG"
The unknown
object can then be classified using classify_sequence()
along with the database object from above (i.e., db
).
consensus <- classify_sequence(unknown = unknown, database = db)
The output of classify_sequence
is a list with taxonomic levels from kingdom (or domain) down to the genus and the confidence scores for that sequence in each of those taxonomic levels
consensus
In this example, you'll notice that the family and genus assignment have confidence scores below 80%. This is the value commonly used in mothur and which was used on the RDP website. To filter the classification to that level of confidence, you can use the filter_taxonomy()
function:
filtered <- filter_taxonomy(consensus)
You should see that the "Porphyromonadaceae" and "Barnesiella" names are removed along with its confidence score.
filtered
To convert that list format to a string, you can use the print_taxonomy()
function:
print_taxonomy(filtered)
Most likely you will have multiple sequences that you would like to classify. The {phylotypr}
package installs with an example FASTA-formatted file, "miseq_sop.fasta". The code below shows you can read in a FASTA-formatted file and classify the sequences. Below, I show how you can use the map_chr()
function from {purrr}
, or something like it. Note that for classifying your own data you would replace phylotypr_example(...)
with the path to your own data.
library(dplyr) library(purrr) set.seed(19760620) # pat's birtday in YYYYMMDD format miseq <- read_fasta(phylotypr_example("miseq_sop.fasta.gz")) miseq |> dplyr::mutate( classification = purrr::map_chr( sequence, ~ classify_sequence(unknown = .x, database = db) |> filter_taxonomy() |> print_taxonomy(), .progress = TRUE ) )
The {furrr}
package allows you to run classify_sequence()
in parallel. So, if your computer has 8 processors, {furrr}
will split your data into 4 chunks and then classify each chunk separately before pulling it all together. You can alter the number of processors by changing the value assigned to the workers
argument of plan()
. In testing, we found that increasing the number of processors provides diminishing returns. For the example, 4 and 10 processors took about the same amount of time to execute. The value for future.globals.maxSize
may need to be increased if your database gets larger. Finally, because classify_sequence()
uses a random number generator to create its bootstrap replicates, we need to provide the seed to future_map_chr()
. You'll notice that in the furrr_options()
argument. Feel free to adjust it to your desired seed.
library(dplyr) library(furrr) miseq <- read_fasta(phylotypr_example("miseq_sop.fasta.gz")) plan(strategy = multisession, workers = 4) options(future.globals.maxSize = 10000000000) miseq |> mutate( classification = future_map_chr( sequence, ~ classify_sequence(unknown = .x, database = db) |> filter_taxonomy() |> print_taxonomy(), .progress = TRUE, .options = furrr_options(seed = 19760620) ) )
Consider the results of classifying unknown
three times:
map_chr( rep(unknown, 3), ~ classify_sequence(unknown = .x, database = db, num_bootstraps = 100) |> filter_taxonomy() |> print_taxonomy() )
You'll notice that the classification of the three replicates varies slightly. Again, this is due to the use of the random number generator and getting confidence scores that bounce around 80%, which results in different levels of filtering. To get a more stable classification, you should be setting your random number generator seed with set.seed()
. To get a more precise confidence score, you could set num_bootstraps = 1000
in classify_sequence()
. Be forewarned that the function will take 10-times longer to run with 10-times the number of bootstraps!
map_chr( rep(unknown, 3), ~ classify_sequence(unknown = .x, database = db, num_bootstraps = 100) |> filter_taxonomy() |> print_taxonomy() )
The {phylotypr}
package ships with the RDP's v.9 of their training data. This is relatively small and old (2010) relative to their latest versions. You are encouraged to install newer versions of the RDP, greengenes, and SILVA databases from the {phylotyprrefdata}
package on GitHub. Note that installing the package will take about 20 minutes to install. If it sits at "moving datasets to lazyload DB" for a long time, this is normal :)
devtools::install_github("mothur/phylotyprrefdata") library(phylotyprrefdata)
The following will list the references that are available in {phylotyprrefdata}
:
data(package = "phylotyprrefdata")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.