phylotypr

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In the microbiome field, 16S rRNA gene sequencing has been a popular method of characterizing the types of bacteria found in a community. The most frequently used tool for classifying these sequences has been the naive Bayesian classifier that used to be accessible through the Ribosomal Database Project and is currently found in tools like mothur and QIIME. For the first time, the {phylotypr} package provides an R-based implementation of that tool. There are minimal dependencies and the easy to use code executes rapidly.

Getting started with {phylotypr}

To get started you'll need to install {phylotypr} using either of these commands:

# devtools::install_github("mothur/phylotypr") #install development version
install.packages("phylotypr") # install stable version from CRAN

To get access to the packages function and a demonstration version of the training data, you'll need to run the library() function. Because the classification algorithm makes use of a random number generator, I strongly encourage you to set the seed to get consistent results:

library(phylotypr)

set.seed(19760620) # pat's birtday in YYYYMMDD format

Building the database

Before we can classify a sequence, we need to generate the database. This can be done using the build_kmer_database() function and takes several seconds to execute. Unless you save the database object (e.g. db below) to a Rda-formatted file, you'll need to run this function each time you load {phylotypr}

db <- build_kmer_database(
  trainset9_pds$sequence,
  trainset9_pds$taxonomy
)

By default, build_kmer_database() will use 8-nt long kmers to calculate its probabilities. This size can be altered by the user if needed, but 8-nt performs very well and shouldn't need to be adjusted.

Classifying a single unkown sequence

To classify a sequence we need the DNA sequence to be a character object like unknown is below.

unknown <- "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGATCGTTAAGTCAGTGGTCAAATTGAGGGGCTCAACCCCTTCCCGCCATTGAAACTGGCGATCTTGAGTGGAAGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATGCCGGCTTCCTACTGACGCTCATGCACGAAAGTGTGGGTAACGAACAGG"

The unknown object can then be classified using classify_sequence() along with the database object from above (i.e., db).

consensus <- classify_sequence(unknown = unknown, database = db)

The output of classify_sequence is a list with taxonomic levels from kingdom (or domain) down to the genus and the confidence scores for that sequence in each of those taxonomic levels

consensus

In this example, you'll notice that the family and genus assignment have confidence scores below 80%. This is the value commonly used in mothur and which was used on the RDP website. To filter the classification to that level of confidence, you can use the filter_taxonomy() function:

filtered <- filter_taxonomy(consensus)

You should see that the "Porphyromonadaceae" and "Barnesiella" names are removed along with its confidence score.

filtered

To convert that list format to a string, you can use the print_taxonomy() function:

print_taxonomy(filtered)

Classifying multiple unknown sequences

Most likely you will have multiple sequences that you would like to classify. The {phylotypr} package installs with an example FASTA-formatted file, "miseq_sop.fasta". The code below shows you can read in a FASTA-formatted file and classify the sequences. Below, I show how you can use the map_chr() function from {purrr}, or something like it. Note that for classifying your own data you would replace phylotypr_example(...) with the path to your own data.

library(dplyr)
library(purrr)

set.seed(19760620) # pat's birtday in YYYYMMDD format

miseq <- read_fasta(phylotypr_example("miseq_sop.fasta.gz"))

miseq |>
  dplyr::mutate(
    classification = purrr::map_chr(
      sequence,
      ~ classify_sequence(unknown = .x, database = db) |>
        filter_taxonomy() |>
        print_taxonomy(),
      .progress = TRUE
    )
  )

Classifying multiple unknown sequences in parallel

The {furrr} package allows you to run classify_sequence() in parallel. So, if your computer has 8 processors, {furrr} will split your data into 4 chunks and then classify each chunk separately before pulling it all together. You can alter the number of processors by changing the value assigned to the workers argument of plan(). In testing, we found that increasing the number of processors provides diminishing returns. For the example, 4 and 10 processors took about the same amount of time to execute. The value for future.globals.maxSize may need to be increased if your database gets larger. Finally, because classify_sequence() uses a random number generator to create its bootstrap replicates, we need to provide the seed to future_map_chr(). You'll notice that in the furrr_options() argument. Feel free to adjust it to your desired seed.

library(dplyr)
library(furrr)

miseq <- read_fasta(phylotypr_example("miseq_sop.fasta.gz"))

plan(strategy = multisession, workers = 4)
options(future.globals.maxSize = 10000000000)

miseq |>
  mutate(
    classification = future_map_chr(
      sequence,
      ~ classify_sequence(unknown = .x, database = db) |>
        filter_taxonomy() |>
        print_taxonomy(),
      .progress = TRUE,
      .options = furrr_options(seed = 19760620)
    )
  )

Run-to-run variation

Consider the results of classifying unknown three times:

map_chr(
  rep(unknown, 3),
  ~ classify_sequence(unknown = .x, database = db, num_bootstraps = 100) |>
    filter_taxonomy() |>
    print_taxonomy()
)

You'll notice that the classification of the three replicates varies slightly. Again, this is due to the use of the random number generator and getting confidence scores that bounce around 80%, which results in different levels of filtering. To get a more stable classification, you should be setting your random number generator seed with set.seed(). To get a more precise confidence score, you could set num_bootstraps = 1000 in classify_sequence(). Be forewarned that the function will take 10-times longer to run with 10-times the number of bootstraps!

map_chr(
  rep(unknown, 3),
  ~ classify_sequence(unknown = .x, database = db, num_bootstraps = 100) |>
    filter_taxonomy() |>
    print_taxonomy()
)

Alternative databases

The {phylotypr} package ships with the RDP's v.9 of their training data. This is relatively small and old (2010) relative to their latest versions. You are encouraged to install newer versions of the RDP, greengenes, and SILVA databases from the {phylotyprrefdata} package on GitHub. Note that installing the package will take about 20 minutes to install. If it sits at "moving datasets to lazyload DB" for a long time, this is normal :)

devtools::install_github("mothur/phylotyprrefdata")
library(phylotyprrefdata)

The following will list the references that are available in {phylotyprrefdata}:

data(package = "phylotyprrefdata")


Try the phylotypr package in your browser

Any scripts or data that you put into this service are public.

phylotypr documentation built on April 3, 2025, 5:51 p.m.