TaxaUtils: Utils to preprocess taxa table
In walterxie/ComMA: Community Matrix Analysis

Description Usage Arguments Details Value Examples

Utils to preprocess taxa table, and make it easy for visualization.

subsetTaxaTable(taxa.table, taxa.group = "assigned", rank = "kingdom",
  include = TRUE, ignore.case = TRUE)

subsetCM(community.matrix, taxa.table, taxa.group = NA, rank = NA,
  include = TRUE, ignore.case = TRUE, verbose = TRUE, drop.taxa = TRUE,
  merged.by = "row.names", ...)

prepTaxonomy(taxa.table, col.ranks = c("kingdom", "phylum", "class", "order",
  "family"), txt.unclassified = "unclassified", verbose = TRUE,
  pattern = "(\\s\\[|\\()(\\=|\\.|\\,|\\s|\\w|\\?)*(\\]|\\))")

mergeCMTaxa(community.matrix, taxa.table, classifier = c("MEGAN", "RDP"),
  min.conf = 0.8, has.total = 1, sort = TRUE, preprocess = TRUE,
  verbose = TRUE, mv.row.names = T,
  pattern = "(\\s\\[|\\()(\\=|\\.|\\,|\\s|\\w|\\?)*(\\]|\\))",
  col.ranks = c("kingdom", "phylum", "class", "order", "family"))

assignTaxaByRank(cm.taxa, unclassified = 0, aggre.FUN = sum,
  pattern = "(\\s\\[|\\()(\\=|\\.|\\,|\\s|\\w|\\?)*(\\]|\\))")

summaryTaxaAssign(ta.list, ta.OTU.list = list(), exclude.rank = c(-1),
  exclude.unclassified = TRUE, sort.rank = getRanks())

combineTaxaAssign(ta.list, keywords = c("Eukaryota"), ignore.case = TRUE,
  replace.to = c(), min.row.comb = 2)

summaryRank(ta.list, rank = "kingdom", exclude.unclassified = TRUE)

groupsTaxaMembers(taxa.assign, cm.taxa, rank = "phylum",
  rm.unclassified = TRUE, regex1 = "(\\|[0-9]+)", regex2 = "",
  ignore.case = TRUE, verbose = TRUE)

`taxa.table`	A data frame to contain taxonomic classifications of OTUs. Columns are taxonomy at the rank or lineage, rows are OTUs which need to match rows from community matrix. Use `readTaxaTable` to get it from file.
`taxa.group`	The taxonomic group, the values can be 'all', 'assigned', or Group 'all' includes everything. Group 'assigned' removes all uncertain classifications including 'root', 'cellular organisms', 'No hits', 'Not assigned'. Alternatively, any high-ranking taxonomy in your taxonomy file can be used as a group or multi-groups (seperated by "\|"), such as 'BACTERIA', 'Proteobacteria', etc. But they have to be in the same rank column in the file. Default to remove all uncertain classifications, even when group(s) assigned.
`rank`	The rank to specify which column name in `taxa.table` to search.
`include`	Define whether include or exclude given `taxa.group`. Default to TRUE.
`ignore.case`	If TRUE, as default, case insensitive for taxon names.
`community.matrix`	Community matrix (OTU table), where rows are OTUs or individual species and columns are sites or samples. See `ComMA`.
`verbose`	More details. Default to TRUE.
`drop.taxa`	TRUE, as default, to drop all taxonomy columns, and only keep `community.matrix` samples.
`col.ranks`	A vector or string of column name(s) of taxonomic ranks in the taxa table, which will determine the aggregated abundence matrix. They have to be full set or subset of `c("superkingdom", "kingdom", "phylum", "class", "order", "family", "genus", "species")`. Default to `c("kingdom", "phylum", "class", "order", "family")`.
`txt.unclassified`	The key word to represent unclassified taxonomy.
`pattern`	The pattern for `gsub` "perl = TRUE". Default to "(\s\[\|\()(\=\|\.\|\,\|\s\|\w\|\?)*(\]\|\))". Set NA to skip it.
`classifier`	The classifier is used to generate `taxa.table`. Value is MEGAN or RDP. Default to MEGAN.
`min.conf`	The confidence threshold to drop rows < min.conf.
`has.total`	If 0, then only return abundance by samples (columns) of community matrix. If 1, then only return total abundance. If 2, then return abundance by samples (columns) and total. Default to 1.
`sort`	Sort the taxonomy rank by rank. Default to TRUE.
`preprocess`	If TRUE, as default, replace "root\|cellular organisms\|No hits\|Not assigned\|unclassified sequences" from MEGAN result, or mark OTUs as 'unclassified' in RDP result whose confidence < `min.conf` threshold.
`mv.row.names`	Default to TRUE to move the column 'Row.names' created by `merge` into data frame row.names, in order to keep the 1st column same as community matrix. Suggest not to change it.
`cm.taxa`	The data frame combined community matrix with taxonomic classifications generated by `mergeCMTaxa`. The row.names are OTUs, 1st column is the start of community matrix, `ncol.cm` column is the end of abundence, and `length(col.ranks)` columns taxonomy at different ranks. It should have attributes `ncol.cm` and `col.ranks`. Note: From 1 to `ncol.cm` columns, the last column may be 'total' that is rowSums(cm) determined by `has.total` in `mergeCMTaxa`.
`unclassified`	An interger to instruct how to deal with "unclassified" taxa. Default to 0, which keeps all "unclassified" but moves them to the last rows. If 1, then remove the row whose taxon name is exact "unclassified". See the detail. If 2, then remove the row whose taxon name is exact "unclassified", but also merge all the rest "unclassified ???" to "unclassified rank", such as "unclassified family". If 3, then remove every rows containing "unclassified". If 4, then do nothing.
`aggre.FUN`	A function for `FUN` in `aggregate`. Default to `sum` to provide the reads abundance. Make `aggre.FUN=function(x) sum(x>0)` provide the OTU abundance.
`ta.list, ta.OTU.list`	The list of taxonomic assignments created by `assignTaxaByRank` based on either number of reads or OTUs, where 'total' column is required. If ta.OTU.list is an empty list, as default, then do not count OTUs. See `has.total` in `mergeCMTaxa` for the detail to get the total.
`exclude.rank`	The first n elements (ranks) to exclude from the summary, default to -1, which is normally the kingdom.
`exclude.unclassified`	Default to TRUE, not to count the taxonomy having the "unclassified" keyword.
`sort.rank`	The order used to sort the summary dataframe by "rank" column, default to `c("superkingdom", "kingdom", "phylum", "class", "order", "family", "genus", "species")`.
`keywords`	The vector of keywords for `grep`. The combined row will use the keyword as new row name.
`replace.to`	The new names are used for combined rows, which should be either empty or the same length of the vector `keywords`. If replace.to=c() as default, use the `keywords` as new row names.
`min.row.comb`	The minimun number of rows from `grep` to combine. Default to 2, to ignore the single row selected by `grep` given a keyword. Set to 1 to inlcude it in the combination process.
`taxa.assign`	The data frame of taxonomic assignments with abundance at the `rank`, where rownames are taxonomy at that rank, and columns are the sample names (may include total). It can be one element of the list generated by `assignTaxaByRank`. See the detail.
`rm.unclassified`	Drop all unclassified rows (OTUs). Default to TRUE.
`regex1, regex2`	Use for `gsub(regex1, regex2, row.names)` to remove or replace annotation from original labels. Default to `regex1="(\\|[0-9]+)", regex2=""`, which removes size annotation seperated by "\|".
`ignore.case`	Default to TRUE, same to `ignore.case` in `grep`.
`rank`	The rank given to select the list of taxa assignments produced by `assignTaxaByRank`.

subsetTaxaTable takes or excludes a subset of given a taxa table at given rank.

subsetCM returns a subset community matrix regarding taxa.group at a given rank column in taxa.table, which is also the alternative choice of mergeCMTaxa if only simply merge is required. If either taxa.group or rank is NA, as default, then use the whole taxa.table, otherwise take the subset of taxa.table by subsetTaxaTable.

prepTaxonomy replace repeated high rank taxa to unclassified high rank in MEGAN result, or replace the blank value to unclassified in RDP result, in order to make taxonomy table taxa.table (can be cm.taxa) to make names look nice. col.ranks vector have to be rank column names in taxa.table.

mergeCMTaxa creates a data frame cm.taxa combined community matrix with taxonomic classification table. The 1st column is "row.names" that are OTUs/individuals, the next "ncol.cm" columns are abundence that can be sample-based or total, and the last "length(col.ranks)" columns are the ranks.

All sequences either classified as "root|cellular organisms|No hits|Not assigned|unclassified sequences" from BLAST + MEGAN, or confidence < min.conf threshold from RDP, are changed to "unclassified", which will be moved to the last row.

assignTaxaByRank provides a list of taxonomic assignments with abundance from community matrix at different rank levels, where rownames are taxonomy at that rank, and columns are the sample names (may include total). The function is iterated through col.ranks, and aggregates abundance into taxonomy based on the rank in col.ranks.

summaryTaxaAssign summarises the number of reads, OTUs, and taxonomy from the result of assignTaxaByRank.

combineTaxaAssign combines the total of taxonomy matching a given each of keywords with the row names of the taxonomy assignment in the list from assignTaxaByRank. The function is only working for the taxonomy assignment having 1 column "total" at the moment.

summaryRank directly converts the result of assignTaxaByRank at a given rank into a data frame as the summary.

groupsTaxaMembers groups the members (rows, also OTUs) from cm.taxa for each taxa in taxa.assign at the rank, and returns a list of members (OTUs) grouped by taxonomy. Default to drop all unclassified members (OTUs).

It is impossible to trace back members after assignTaxaByRank, so that this function only has one option except the default, which assign the rest of members (OTUs) not picked up from other taxa into "unclassified". The result relies on using the identical cm.taxa in both assignTaxaByRank and groupsTaxaMembers.

ncol.cm and col.ranks are attributes of cm.taxa generated by mergeCMTaxa.

ncol.cm indicates how many column(s) is/are abundence in cm.taxa.

col.ranks records what ranks column(s) is/are in cm.taxa, which is also the input of mergeCMTaxa.

tt.sub <- subsetTaxaTable(tt.megan, taxa.group="Proteobacteria", rank="phylum")
tt.sub <- subsetTaxaTable(tt.megan, taxa.group="Cnidaria|Brachiopoda|Echinodermata|Porifera", rank="phylum", include=FALSE)

sub.cm <- subsetCM(cm, tt, taxa.group="BACTERIA", rank="kingdom")

tt <- prepTaxonomy(taxa.table, col.ranks=c("kingdom", "phylum", "class"))

cm.taxa <- mergeCMTaxa(community.matrix, tt.megan) 
ta.megan <- assignTaxaByRank(cm.taxa)

cm.taxa <- mergeCMTaxa(community.matrix, tt.rdp, classifier="RDP", has.total=0)
ta.rdp <- assignTaxaByRank(cm.taxa, unclassified=2)
colSums(ta.rdp[["phylum"]])

summary.ta.df <- summaryTaxaAssign(ta.list, ta.OTU.list)

combined.ta.list <- combineTaxaAssign(ta.list, c("Fungi", "Eukaryota", "Streptophyta|Viridiplantae", "Bacteria"))
combined.ta.list <- combineTaxaAssign(ta.list, c("Streptophyta|Viridiplantae"), replace.to=c("Plant"))

summary.kingdom.df <- summaryRank(ta.list, rank="kingdom")

taxa.members <- groupsTaxaMembers(ta.rdp[["phylum"]], tt.rdp)
taxa.members <- groupsTaxaMembers(ta.rdp[["family"]], tt.rdp, rank="family")