tax_fix: Replace unknown, NA, or short tax_table values

View source: R/tax_fix.R

tax_fixR Documentation

Replace unknown, NA, or short tax_table values

Description

Identifies phyloseq tax_table values as unknown or uninformative and replaces them with the first informative value from a higher taxonomic rank.

  • Short values in phyloseq tax_table are typically empty strings or " ", or "g__" etc. so it is helpful to replace them. (If this is unwanted: set min_length = 0 to avoid filtering on length.)

  • Values in unknowns are also removed, even if longer than min_length. It is up to the user to specify sensible values in unknowns if their dataset has other unwanted values.

  • NA values are also replaced.

See this article for an extended discussion of tax_table fixing. https://david-barnett.github.io/microViz/articles/web-only/tax-fixing.html

Usage

tax_fix(
  ps,
  min_length = 4,
  unknowns = NA,
  suffix_rank = "classified",
  sep = " ",
  anon_unique = TRUE,
  verbose = TRUE
)

Arguments

ps

phyloseq or tax_table (taxonomyTable)

min_length

replace strings shorter than this, must be integer > 0

unknowns

also replace strings matching any in this vector, NA default vector shown in details!

suffix_rank

"classified" (default) or "current", when replacing an entry, should the suffix be taken from the lowest classified rank for that taxon, "classified", or the "current" unclassified rank?

sep

character(s) separating new name and taxonomic rank level suffix (see suffix_rank)

anon_unique

make anonymous taxa unique by replacing unknowns with taxa_name? otherwise they are replaced with paste("unknown", first_rank_name), which is therefore the same for every anonymous taxon, meaning they will be merged if tax_agg is used. (anonymous taxa are taxa with all unknown values in their tax_table row, i.e. cannot be classified even at highest rank available)

verbose

emit warnings when cannot replace with informative name?

Details

By default (unknowns = NA), unknowns is set to a vector containing:

's__' 'g__' 'f__' 'o__' 'c__' 'p__' 'k__' 'S__' 'G__' 'F__' 'O__' 'C__' 'P__' 'K__' 'NA' 'NaN' ' ' ” 'unknown' 'Unknown' 's__unknown' 's__Unknown' 's__NA' 'g__unknown' 'g__Unknown' 'g__NA' 'f__unknown' 'f__Unknown' 'f__NA' 'o__unknown' 'o__Unknown' 'o__NA' 'c__unknown' 'c__Unknown' 'c__NA' 'p__unknown' 'p__Unknown' 'p__NA' 'k__unknown' 'k__Unknown' 'k__NA' 'S__unknown' 'S__Unknown' 'S__NA' 'G__unknown' 'G__Unknown' 'G__NA' 'F__unknown' 'F__Unknown' 'F__NA' 'O__unknown' 'O__Unknown' 'O__NA' 'C__unknown' 'C__Unknown' 'C__NA' 'P__unknown' 'P__Unknown' 'P__NA' 'K__unknown' 'K__Unknown' 'K__NA'

Value

object same class as ps

See Also

tax_fix_interactive for interactive tax_fix help

Examples

library(dplyr)
library(phyloseq)

data(dietswap, package = "microbiome")
ps <- dietswap

# create unknowns to test filling
tt <- tax_table(ps)
ntax <- ntaxa(ps)
set.seed(123)
g <- sample(1:ntax, 30)
f <- sample(g, 10)
p <- sample(f, 3)
tt[g, 3] <- "g__"
tt[f, 2] <- "f__"
tt[p, 1] <- "p__"
tt[sample(1:ntax, 10), 3] <- "unknown"
# create a row with only NAs
tt[1, ] <- NA

tax_table(ps) <- tax_table(tt)

ps
# tax_fix with defaults should solve most problems
tax_table(ps) %>% head(50)

# this will replace "unknown"s as well as short values including "g__" and "f__"
tax_fix(ps) %>%
  tax_table() %>%
  head(50)

# This will only replace short entries, and so won't replace literal "unknown" values
ps %>%
  tax_fix(unknowns = NULL) %>%
  tax_table() %>%
  head(50)

# Change rank suffix and separator settings
tax_fix(ps, suffix_rank = "current", sep = " - ") %>%
  tax_table() %>%
  head(50)

# by default, completely unclassified (anonymous) taxa are named by their
# taxa_names / rownames at all ranks.
# This makes anonymous taxa distinct from each other,
# and so they won't be merged on aggregation with tax_agg.
# If you think your anonymous taxa should merge on tax_agg,
# or you just want them to be named the all same for another reason,
# set anon_unique = FALSE (compare the warning messages)
tax_fix(ps, anon_unique = FALSE)
tax_fix(ps, anon_unique = TRUE)

# here's a larger example tax_table shows its still fast with 1000s rows,
# from microbiomeutilities package
# library(microbiomeutilities)
# data("hmp2")
# system.time(tax_fix(hmp2, min_length = 1))

david-barnett/microViz documentation built on April 17, 2025, 4:25 a.m.