metagenomeFeatures classes and methods.
In metagenomeFeatures: Exploration of marker-gene sequence taxonomic annotations

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages(library(metagenomeFeatures))

The r Biocpkg("metagenomeFeatures") package defines two classes, MgDb-class for working with 16S rRNA databases, and mgFeatures-class for marker-gene survey feature data. Both classes are S4 object-oriented classes and contain slots for (1) the sequences themselves, (2) taxonomic lineage, and (3) a phylogenetic tree with the evolutionary relationship between features, and (4) metadata. This vignette demonstrates how to explore and subset MgDb-class objects and create new mgFeatures-class objects.

`MgDb-class` Object

The MgDb-class provides a consistent data structure for working with different 16S rRNA databases. 16S rRNA databases contain hundreds of thousands to millions of sequences, therefore an SQLite database is used to store the taxonomic and sequence data. Using an SQLite database prevents the user from loading the full database into memory.
The database connection is managed using the r CRANpkg("RSQLite") R package, and the taxonomic data are accessed using the r CRANpkg("dplyr") and r CRANpkg("dbplyr") packages. The r Biocpkg("DECIPHER") package is used to format the sequence data as an SQLite database and provides functions for working directly with the sequence data in the SQLite database. The phylo-class, from the r CRANpkg("APE") R package, defines the tree slot.

The following examples utilize gg85 as the example MgDb-class object included in the r Biocpkg("metagenomeFeatures") package. MgDb-class objects with full databases are in separate Bioconductor Annotation packages such as the r Biocpkg("greengenesMgDb13.5") package.[^1]

After loading the r Biocpkg("metagenomeFeatures"), the get_gg13.8_85MgDb() is use to load the gg85 database. The taxonomy table is stored in a SQLite database and the taxonomy slot is an open connection to the database.

library(metagenomeFeatures)
gg85 <- get_gg13.8_85MgDb()

`MgDb-class` Accessor Methods

The r Biocpkg("metagenomeFeatures") package includes accessor methods for the four slots mgDb_meta(), mgDb_tree(), mgDb_taxa(), mgDb_seq().

Additional accessor methods, taxa_*() are provided for different taxonomy table elements. These accessors are useful in determining appropriate parameters for the mgDb_select() (See section \@ref(mgdb-select)). The taxa_keytypes() accessor returns a vector with taxonomy table column names and the taxa_columns() return a vector of taxonomy relevant column names. The taxa_keys() method returns a data frame with all values in a database for specified keytype(s).

taxa_keytypes(gg85)

taxa_columns(gg85)

head(taxa_keys(gg85, keytype = c("Kingdom")))

`MgDb-class` Select Methods {#mgdb-select}

The MgDb-class select method is used to retrieve database entries for a specified taxonomic group or id list and can return either taxonomic, sequences, phylogenetic tree, or a combinantion of the three. The type argument defines the type of information the database returns.

## Only returns taxa
mgDb_select(gg85, type = "taxa",
            keys = c("Vibrionaceae", "Enterobacteriaceae"),
            keytype = "Family")

## Only returns seq
mgDb_select(gg85, type = "seq",
            keys = c("Vibrionaceae", "Enterobacteriaceae"),
            keytype = "Family")

## Returns taxa, seq, and tree
mgDb_select(gg85, type = "all",
            keys = c("Vibrionaceae", "Enterobacteriaceae"),
            keytype = "Family")

`mgFeatures-class` Object

mgFeatures-class is used for storing and working with marker-gene survey feature data.
Similar to the MgDb-class, the mgFeatures-class has four slots for taxonomy, sequences, phylogenetic tree, and metadata. As the number of features in a marker-gene survey dataset is significantly smaller than the number of sequences in a reference database, mgFeatures uses common Bioconductor data structures, DataFrames and DNAStringSets to define the taxonomic and sequence slots. Similar to MgDb-class, a phylo class object is used to define the tree slot. Like the MgDb-class there are accessor methods, mgF_*(), for the fours slots.

New mgFeatures-class objects can be defined directly or from a MgDb-class object and user defined values for subsetting the database, including database sequence ids.[^2] The two approaches are demonstrated below.

ve_select <- mgDb_select(gg85, type = "all",
               keys = c("Vibrionaceae", "Enterobacteriaceae"),
               keytype = "Family")

mgFeatures(taxa = ve_select$taxa, 
           tree = ve_select$tree, 
           seq = ve_select$seq,
           metadata = list("gg85 subset Vibrionaceae and Enterobacteriaceae Family"))

annotateFeatures(gg85, query = ve_select$taxa$Keys)

Session info {.unnumbered}

sessionInfo()

[^1]: Other databases are avilable as Bioconductor AnnotationData including; Greengenes Release 13.5: r Biocpkg("greengenesMgDb13.5"), Greengenes Release 13.8 99% OTUs,r Biocpkg("greengenes13.8_99MgDb") Ribosomal Database Project Release 11.5, r Biocpkg("ribosomaldatabaseproject11.5MgDb"), and Silva 128.1: r Biocpkg("silva128.1MgDb"). [^3]: See the "Using metagenomeFeatures to Retrieve Feature Data." vignette, Vignettes("Using metagenomeFeatures to Retrieve Feature Data.") for using the mgFeatues-class with marker-gene survey study data.