README.md
In jfq3/RDPutils: R Utilities for Processing RDPTool Output

RDPutils

RDPutils was originally written to provide means to construct phyloseq objects from the output of RDP's web-based tools for clustering and classifying DNA sequences from high-throughput amplicon sequencing projects. It has since been expanded to handle output from RDP’s command line tools as well as USEARCH and iTagger (JGI) output.

Phyloseq is an R/Bioconductor package that includes a variety of wrappers for quick exploratory data analysis of sequencing data, but perhaps its most convenient feature is that it enables the rapid and flexible sub-setting of data from a large experiment. A phyloseq object has slots for an OTU table, a classification table, a sample data table, a tree of the sequences representing each OTU, and the representative sequences themselves. I recommend phyloseq because it organizes all data for an experiment and R because it provides a flexible means of analyzing that data. The functions in this package reformat RDP, USEARCH, andiTagger outputs so that they may be used to fill phyloseq slots.

The Ribosomal Database Project (RDP) provides both web-based and command line tools (RDPTools) for processing rRNA gene sequences from Bacteria, Archaea, and Fungi as well as functional genes. Web-based tools and tutorials for using them are available at (http://rdp.cme.msu.edu/index.jsp). The command line tools are available at (https://github.com/rdpstaff/RDPTools). Processing can take either of two approaches. In the "supervised" approach, sequences from multiple samples are binned by classifying them using a database for Bacteria or Archaea or Fungi, or a user's own database. Further processing has traditionally been done in a spreadsheet program, but that is no longer necessary. The function hier2phyloseq in this package imports RDP classifier results in hierarchy format into a phyloseq object with OTU and taxonomy tables. In the "unsupervised" approach, sequences are clustered into OTUs based on their degree of similarity. RDP provides additional tools to parse the cluster files into OTU tables that can be imported into R, and to retrieve representative sequences for each cluster. OTU tables can be also be parsed from the cluster file with a function in this package.

The key to filling the classification table and tree slots is to first rename the representative sequences to correspond to the OTU names; this package includes functions to do this. Once renamed, the representative sequences can be classified and treed, and the results used to fill phyloseq classification table and tree slots. For either approach, supervised or unsupervised, a sample data table is most easily constructed in a spreadsheet program. A vignette in the package gives example workflows for constructing phyloseq objects for both supervised and unsupervised methods.

As of May 2014, the RDP command line tools include options to output a biom file with OTU table, classification, and sample data, as well as renaming the representative sequences to correspond to the OTU names. The biom file can be imported directly into phyloseq, making the unsupervised workflow described in the RDPutils vignette no longer necessary. However, if these renamed representative sequences themselves are to be included in the phyloseq object, the functions trim_fasta_names and unalign_fasta should be applied before importing them into phyloseq.

RDPutils version 1.3.0 includes functions to import USEARCH generated biom files, OTU tables with or without taxonomy and taxonomy files generated with utax and sintax. The import_itagger_otutab_taxa function converts the tab-delimited iTagger otu.tax.tsv file into a phyloseq object with otu_table and tax_table. A second vignette in the package demonstrates all of these capabilities. Version 1.3.1 fixed bug related to OTU name format. Version 1.3.2 added the function make_framebot_tax_table.Version 1.4.0 rewrote vignettes using rmarkdown.