map_taxa | R Documentation |
Maps taxonomic names to NCBI (RefSeq) or GTDB taxonomy by automatic matching of taxonomic names, with manual mappings for some groups.
map_taxa(taxacounts = NULL, refdb = "GTDB_220",
taxon_AA = NULL, quiet = FALSE)
taxacounts |
data frame with taxonomic name and abundances |
refdb |
character, name of reference database (‘GTDB_220’ or ‘RefSeq_206’) |
taxon_AA |
data frame, amino acid compositions of taxa, used to bypass |
quiet |
logical, suppress printed messages? |
This function maps taxonomic names to the NCBI (RefSeq) or GTDB taxonomy.
taxacounts
should be a data frame generated by either read_RDP
or ps_taxacounts
.
Input names are made by combining the taxonomic rank and name with an underscore separator (e.g. ‘genus_ Escherichia/Shigella’).
Input names are then matched to the taxa listed in ‘taxon_AA.csv.xz’ found under ‘RefDB/RefSeq_206’ or ‘RefDB/GTDB_220’.
The protein
and organism
columns in these files hold the rank and taxon name extracted from the RefSeq or GTDB database.
Only exactly matching names are automatically mapped.
For mapping to the NCBI (RefSeq) taxonomy, some group names are manually mapped as follows (see Dick and Tan, 2023):
RDP training set | NCBI |
genus_Escherichia/Shigella | genus_Escherichia |
phylum_Cyanobacteria/Chloroplast | phylum_Cyanobacteria |
genus_Marinimicrobia_genera_incertae_sedis | species_Candidatus Marinimicrobia bacterium |
class_Cyanobacteria | phylum_Cyanobacteria |
genus_Spartobacteria_genera_incertae_sedis | species_Spartobacteria bacterium LR76 |
class_Planctomycetacia | class_Planctomycetia |
class_Actinobacteria | phylum_Actinobacteria |
order_Rhizobiales | order_Hyphomicrobiales |
genus_Gp1 | genus_Acidobacterium |
genus_Gp6 | genus_Luteitalea |
genus_GpI | genus_Nostoc |
genus_GpIIa | genus_Synechococcus |
genus_GpVI | genus_Pseudanabaena |
family_Family II | family_Synechococcaceae |
genus_Subdivision3_genera_incertae_sedis | family_Verrucomicrobia subdivision 3 |
order_Clostridiales | order_Eubacteriales |
family_Ruminococcaceae | family_Oscillospiraceae |
To avoid manual mapping, GTDB can be used for both taxonomic assignemnts and reference proteomes.
16S rRNA sequences from GTDB release 220 are available for the RDP Classifier (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.5281/zenodo.7633099")}) and dada2 (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.5281/zenodo.13984843")}).
Example files created using the RDP Classifier are provided under ‘extdata/RDP-GTDB_220’.
An example dataset created with DADA2 is data(mouse.GTDB_220)
; this is a phyloseq-class
object that can be processed with functions described at physeq
.
Change quiet
to TRUE to suppress printing of messages about manual mappings, most abundant unmapped groups, and overall percentage of mapped names.
Integer vector with length equal to number of rows of taxacounts
.
Values are rownumbers in the data frame generated by reading taxon_AA.csv.xz
, or NA for no matching taxon.
Attributes unmapped_groups
and unmapped_percent
have the input names of unmapped groups and their percentage of the total classification count.
Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85: 1338–1355. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s00248-022-01988-9")}
# Partial mapping from RDP training set to NCBI taxonomy
file <- system.file("extdata/RDP/SMS+12.tab.xz", package = "chem16S")
# Use drop.groups = TRUE to exclude root- and domain-level
# classifications and certain non-prokaryotic groups
RDP <- read_RDP(file, drop.groups = TRUE)
map <- map_taxa(RDP, refdb = "RefSeq_206")
# About 24% of classifications are unmapped
sum(attributes(map)$unmapped_percent)
# 100% mapping from GTDB training set to GTDB taxonomy
file <- system.file("extdata/RDP-GTDB_220/SMS+12.tab.xz", package = "chem16S")
RDP.GTDB <- read_RDP(file)
map.GTDB <- map_taxa(RDP.GTDB)
stopifnot(all.equal(sum(attributes(map.GTDB)$unmapped_percent), 0))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.