Nothing
#' Maps an input taxonomy table onto a different taxonomic nomenclature.
#'
#' @author Dylan Catlett
#' @author Kevin Son
#'
#' @param tt The input taxonomy table you would like to map onto a new
#' taxonomic nomenclature. Should be a dataframe of type char or list (no
#' factors).
#' @param tt.ranks A character vector of the column names where taxonomic
#' names are found in tt. Supply them heirarchically (e.g. kingdom --> species)
#' @param tax2map2 The taxonomic nomenclature you would like to map onto. pr2
#' v4.12.0, Silva SSU v138 nr, GreenGenes v13.8 clustered at 97% similarity, and
#' the RDP train set v16 are included in the ensembleTax package. You can map to
#' these by specifying "pr2", "Silva", "gg", or "rdp". Otherwise should be a
#' dataframe of type character or list (no factors) with each column
#' corresponding to a taxonomic rank.
#' @param exceptions A character vector of taxonomic names at the basal/root
#' rank of tt that will be propagated onto the mapped taxonomy. ASVs assigned
#' to these names will retain these names at their basal/root rank in the mapped
#' taxonomy. All other ranks are assigned NA.
#' @param ignore.format If TRUE, the algorithm modifies taxonomic names in tt to
#' account for common variations in taxonomic name syntax and/or formatting
#' commonly encountered in reference databases (e.g. Pseudo-nitzschia will map
#' to Pseudonitzschia). If FALSE, formatting issues may preclude mapping of
#' synonymous taxonomic names (e.g. Pseudo-nitzschia will NOT map to
#' Pseudonitzschia). An exhaustive list of formatting details is included in
#' Details. Note that formatting variants are only generated for the names in
#' tt. This can cause some issues for mapping in the other direction (e.g.
#' Pseudonitzschia in tt will NOT map to Pseudo-nitzschia in tax2map2 whether or
#' not ignore.format is TRUE).
#' @param synonym.file If "default", taxmapper uses taxonomic synonyms included
#' with the ensembleTax package. If a custom taxonomic synonym file is
#' preferred, a string corresponding to the name of the csv file should be
#' supplied. Taxonomic synonyms are searched when exact name matches are not
#' found in tax2map2. ignore.format applies to synonyms if TRUE. Specify NULL if
#' you wish to forego synonym searches.
#' @param streamline If TRUE, only the mapped version of tt is returned as a
#' dataframe. If FALSE, a 3-element list is returned where element 1 is the
#' mapping key returned as a dataframe, element 2 is a character vector of all
#' names that could not be mapped (no exact matches found in tax2map2), and
#' element 3 is the mapped version of tt (a dataframe).
#' @param outfilez If NULL, mapping files are not saved to the current working
#' directory. Otherwise should be a 3-element character vector including, in
#' this order, the name of the file to store the taxonomic mapping key, the name
#' of the file to store the names that could not be mapped, and the name of the
#' file to store the ASVs supplied with tt with their mapped taxonomic
#' assignments. Each element of the vector should end in csv (only csv files
#' may be saved)
#'
#' @details Exceptions should be used when the user knows a particular taxonomic
#' group is not found in tax2map2. The user is responsible for supplying valid
#' taxonomic names as these must be found in tt and will be propagated as
#' given to all ASVs that are assigned this name in tt. This should only be
#' used for high-level taxonomic groups that are not found in a database (e.g.
#' for retaining Eukaryota when mapping onto a prokaryote-only taxonomic
#' nomenclature).
#'
#' When ignore.format = TRUE, names for which taxmapper cannot find exact
#' matches in tax2map2 are altered in case an exact match was not found due to
#' formatting issues. To do this taxmapper first removes square brackets ("[]").
#' It then checks for hyphens "-", underscores "_", and single spaces " ". If
#' these are found, variants of the name with the hyphen/underscore/spaces
#' replaced by each of the other two, as well as all subnames spearated by these
#' characters, and all subnames pasted together with none of these special
#' characters, are searched against tax2map2 for exact matches. It also creates
#' all-lower and all-upper case versions of these elements and again searches
#' for exact name matches in tax2map2. Words generated by this process that are
#' 2 characters or less are not searched for matches in tax2map2. All
#' alternative names created when ignore.format = TRUE are also searched for
#' synonyms in synonym.file if specified.
#'
#' To prevent matching of arbitrary names often used in reference databases (eg,
#' "Clade_X"), and after creating all of the above alternative names if
#' ignore.format = TRUE, those that BEGIN with any of the words below are
#' are not use in exact name matching. Instead, the lowest assigned
#' non-ambiguous name is determined (any name that begins with a word NOT
#' included in the list below) and is appended to the ambiguous name separated
#' by a hyphen. The words taxmapper flags as ambiguous are: "Clade", "CLADE",
#' "clade", "Group", "GROUP", "group", "Class", "CLASS", "class",
#' "Subclass", "SubClass", "SUBCLASS", "subclass", "Subclade", "SubClade",
#' "SUBCLADE", "subclade", "Subgroup", "SubGroup", "SUBGROUP", "subgroup",
#' "Sub group", "Sub Group", "SUB GROUP", "sub group", "Sub clade", "Sub Clade",
#' "SUB CLADE", "sub clade", "Sub class", "Sub Class", "SUB CLASS", "sub class",
#' "Sub_group", "Sub_Group", "SUB_GROUP", "sub_group", "Sub_clade", "Sub_Clade",
#' "SUB_CLADE", "sub_clade", "Sub_class", "Sub_Class", "SUB_CLASS", "sub_class",
#' "Sub-group", "Sub-Group", "SUB-GROUP", "sub-group", "Sub-clade", "Sub-Clade",
#' "SUB-CLADE", "sub-clade", "Sub-class", "Sub-Class", "SUB-CLASS", "sub-class",
#' "incertae sedis", "INCERTAE SEDIS", "Incertae sedis", "Incertae Sedis",
#' "incertae-sedis", "INCERTAE-SEDIS", "Incertae-sedis", "Incertae-Sedis",
#' "incertae_sedis", "INCERTAE_-SEDIS", "Incertae_sedis", "Incertae_Sedis",
#' "incertaesedis", "INCERTAESEDIS", "Incertaesedis", "IncertaeSedis",
#' "unclassified", "UNCLASSIFIED", "Unclassified", "Novel", "novel", "NOVEL",
#' "sp", "sp.", "spp", "spp.", "lineage", "Lineage", "LINEAGE"
#'
#' For high-throughput implementation of taxmapper, it's recommended to set
#' streamline = TRUE.
#'
#' @return If streamline = TRUE, a dataframe formatted for use with ensembleTax
#' that contains mapped taxonomic assignments for each ASV/OTU in the data set.
#'
#' If streamline = FALSE, a 3-element list where the first element is a
#' dataframe that contains all unique input taxonomic assignments and their
#' corresponding mapped outputs, the second element is a character vector that
#' contains all taxonomic names that could not be mapped, and the third element
#' contains mapped taxonomic assignments for each ASV in the data set.
#'
#' If is.null(outfilez) = FALSE, three csv files are saved in the current
#' working directory containing each of the three list elements above.
#'
#' @seealso idtax2df, bayestax2df, ensembleTax
#'
#' @examples
#' fake.silva <- data.frame(ASV = c("AAAA", "ATCG", "GCGC", "TATA", "TCGA"),
#' domain = c("Bacteria", "Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota"),
#' phylum = c("Firmicutes", "Diatomea", "Retaria", "MAST-12", "Diatomea"),
#' class = c(NA, "Coscinodiscophytina_cl", "Polycystinea", "MAST-12A",
#' "Mediophyceae"),
#' order = c(NA, "Fragilariales", "Collodaria", NA, NA),
#' family = c(NA, "Fragilariales_fa", "Collodaria_fa", NA, NA),
#' genus = c(NA, "Podocystis", "Collophidium", NA, NA),
#' stringsAsFactors = FALSE)
#' head(fake.silva)
#' mapped.silva <- taxmapper(fake.silva,
#' tt.ranks = colnames(fake.silva)[2:ncol(fake.silva)],
#' tax2map2 = "pr2",
#' exceptions = c("Archaea", "Bacteria"),
#' ignore.format = FALSE,
#' synonym.file = "default",
#' streamline = TRUE,
#' outfilez = NULL)
#'
#' @export
taxmapper <- function(tt,
tt.ranks = colnames(tt),
tax2map2 = "pr2",
exceptions = c("Archaea", "Bacteria"),
ignore.format = FALSE,
synonym.file = "default",
streamline = TRUE,
outfilez = NULL) {
if (length(tt.ranks) == ncol(tt)) {
stop("You have not included any ASV-identifying data in your input
taxonomy table. Please do this and try again.")
}
if (is.data.frame(tax2map2)){
# do nothing
} else if (tax2map2 == "pr2") {
tax2map2 <- ensembleTax::pr2v4.12.0
} else if (tax2map2 == "Silva") {
tax2map2 <- ensembleTax::silva.nr.v138
} else if (tax2map2 == "rdp") {
tax2map2 <- ensembleTax::rdp_train_set_16
} else if (tax2map2 == "gg") {
tax2map2 <- ensembleTax::gg_13_8_train_set_97
} else {
stop("No valid tax2map2 object supplied.")
}
tax2map2.ranks <- colnames(tax2map2)
# function to test for ambiguous names (clade, group, subgroup, etc)
testAmbigNames <- function(taxonomy) {
# designate ambiguous names:
ambigu <- c("Clade", "CLADE", "clade", # variants of clade
"Group", "GROUP", "group", # variants of group
"Class", "CLASS", "class", # variants of class
"Subgroup", "SubGroup", "SUBGROUP", "subgroup", # subgroup
"Subclade", "SubClade", "SUBCLADE", "subclade", # subclade
"Subclass", "SubClass", "SUBCLASS", "subclass", # subclass
"Sub group", "Sub Group", "SUB GROUP", "sub group",
"Sub clade", "Sub Clade", "SUB CLADE", "sub clade",
"Sub class", "Sub Class", "SUB CLASS", "sub class",
"Sub_group", "Sub_Group", "SUB_GROUP", "sub_group",
"Sub_clade", "Sub_Clade", "SUB_CLADE", "sub_clade",
"Sub_class", "Sub_Class", "SUB_CLASS", "sub_class",
"Sub-group", "Sub-Group", "SUB-GROUP", "sub-group",
"Sub-clade", "Sub-Clade", "SUB-CLADE", "sub-clade",
"Sub-class", "Sub-Class", "SUB-CLASS", "sub-class",
"incertae sedis", "INCERTAE SEDIS", "Incertae sedis", "Incertae Sedis", # Incertae sedis
"incertae-sedis", "INCERTAE-SEDIS", "Incertae-sedis", "Incertae-Sedis", # more...
"incertae_sedis", "INCERTAE_-SEDIS", "Incertae_sedis", "Incertae_Sedis", # more...
"incertaesedis", "INCERTAESEDIS", "Incertaesedis", "IncertaeSedis",
"unclassified", "UNCLASSIFIED", "Unclassified",
"Novel", "novel", "NOVEL",
"sp", "sp.", "spp.", "spp",
"lineage", "Lineage", "LINEAGE")
# get start location of all ambiguous names within taxonomy (returns NA if not found)
loki <- stringr::str_locate(taxonomy, ambigu)[ , "start"]
test.val <- base::any(loki == 1) # TRUE if taxonomy starts with an ambiguous name. This means name should not be mapped
if (base::is.na(test.val)) { test.val <- FALSE } # account for NA's
return(test.val)
}
# function to remove hyphens, underscores, upper case of name
preprocessTax <- function(taxonomy) {
if (stringr::str_detect(taxonomy, c("\\[|\\]"))) {
taxonomy <- stringr::str_replace_all(taxonomy, c("\\[|\\]"), "")
}
alt.full <- c(stringr::str_replace_all(taxonomy, "-"," "),
stringr::str_replace_all(taxonomy, "-","_"),
stringr::str_replace_all(taxonomy, " ","-"),
stringr::str_replace_all(taxonomy, " ","_"),
stringr::str_replace_all(taxonomy, "_","-"),
stringr::str_replace_all(taxonomy, "_"," "),
stringr::str_replace_all(taxonomy, c("_|-"), " "),
stringr::str_replace_all(taxonomy, c("_| "), "-"),
stringr::str_replace_all(taxonomy, c(" |-"), "_"))
# split terms by hyphens
no.hyphen <- base::strsplit(taxonomy, "-")
# split terms by underscores
no.underscore <- base::strsplit(taxonomy, "_")
# split by space:
no.spc <- base::strsplit(taxonomy, " ")
# combine previous splits
taxs <- c(taxonomy,
alt.full,
no.hyphen[[1]], no.underscore[[1]], no.spc[[1]],
paste(no.hyphen[[1]], sep = '', collapse = ''),
paste(no.underscore[[1]], sep = '', collapse = ''),
paste(no.spc[[1]], sep = '', collapse = ''))
# remove duplicates
taxs <- base::unique(taxs)
# convert all to lower and uppercase
no.upper <- base::tolower(taxs)
no.lower <- base::toupper(taxs)
# create alternative suffixes for certain taxonomies
final.taxs <- unique(c(taxs, no.upper, no.lower))
ambig.ones <- base::sapply(final.taxs, FUN = testAmbigNames)
ll <- stringr::str_length(final.taxs)
rm.rows <- c(which(ambig.ones),
which(ll <= 2))
if (length(rm.rows) == 0) {
# do nothing to final.taxs
} else if (length(rm.rows > 0)) {
final.taxs <- final.taxs[-rm.rows]
}
# this ensures that the original name is always the first one mapped,
# and that longer names take priority over shorter names (thought being that
# longer names are more informative)
ii <- base::sort(stringr::str_length(final.taxs), decreasing = TRUE, index.return = TRUE)
ft.sorted <- final.taxs[ii$ix]
final.taxs <- c(taxonomy, ft.sorted)
return(final.taxs)
}
# function to search through the tax2map2 to find a match for the taxonomy name inputted
findMapping <- function(taxonomy, tax2map2) {
# iterate through the most specific ranking to the most generic ranking
cols <- base::rev(names(tax2map2))
for (i in 1:length(cols)) {
matchings <- tax2map2[which(tax2map2[, cols[i]] == taxonomy), ] # find rows that match at that rank
if (nrow(matchings) != 0) {
# create respective row for the match
# make everything downstream of the rank found to be NA's
matched.row <- base::data.frame(matrix(rep(NA, length(cols)), ncol = length(cols), nrow = 1))
colnames(matched.row) <- base::rev(cols)
# grab only the first match found
matched.row[1:(length(cols)-i+1)] <- matchings[1, ][1:(length(cols)-i+1)]
return(matched.row)
}
}
return(NA)
}
# function to search through the synonyms data frame to find synonyms for given taxonomy name
getSynonyms <- function(taxonomy, syn.df) {
if (is.null(syn.df)) {
return(c(taxonomy))
}
found.rows <- syn.df[which(syn.df == taxonomy, arr.ind=TRUE)[,'row'],] # find rows for synonym
if (length(found.rows) > 0) {
# populate the taxonomy with its synonyms
v <- as.character(as.matrix(found.rows))
return (unique(c(taxonomy, v[!is.na(v)])))
}
else {
# if no synonyms found, just return the taxonomy
return(c(taxonomy))
}
}
# rename tax2map2 columns for uniqueness
colnames(tax2map2) <- base::paste("tax2map2", colnames(tax2map2), sep="_")
# grab only the taxonomies part of the dataframes
taxin.u <- base::unique(tt[, (names(tt) %in% tt.ranks)])
tax2map2.u <- base::unique(tax2map2)
# read in the synonyms file
if (is.null(synonym.file)) {
synonyms <- NULL
} else if (synonym.file != "default") {
synonyms <- utils::read.csv(synonym.file, stringsAsFactors = FALSE)
synonyms <- synonyms[, colnames(synonyms)[startsWith(colnames(synonyms), "Name")]]
} else if (synonym.file == "default") {
synonyms <- ensembleTax::synonyms_v2
synonyms <- synonyms[, colnames(synonyms)[startsWith(colnames(synonyms), "Name")]]
}
taxin.cols <- base::rev(names(taxin.u))
# keep track of the taxonomy names that are not mapped
not.mapped <- vector()
# finialized mapping table from taxin to tax2map2 with only the taxonomy names
mapped <- base::data.frame(matrix(ncol=(ncol(taxin.u) + ncol(tax2map2.u)),nrow=0, dimnames=list(NULL, c(names(taxin.u), names(tax2map2.u)))))
# iterate through each row and column of taxin data frame
for (row in 1:nrow(taxin.u)) {
# keep track of the most generic taxonomy name
highest.tax <- taxin.u[row, taxin.cols[ncol(taxin.u)]]
# see if it is in the exceptions to skip the row
if (base::is.element(highest.tax, exceptions)) {
# create a NA row assignment since part of exceptions
null.row <- base::data.frame(matrix(rep(NA, ncol(tax2map2.u)), ncol = ncol(tax2map2.u), nrow = 1, dimnames=list(NULL, names(tax2map2.u))))
null.row[1] <- highest.tax
combined <- base::cbind(taxin.u[row, ], null.row)
mapped <- base::rbind(mapped, combined)
}
else {
for (col in 1:ncol(taxin.u)) {
# keep track of the original taxonomy name
orig.tax <- taxin.u[row, taxin.cols[col]]
# new addition: test for ambiguous names and make unambiguous if found
if (testAmbigNames(orig.tax) & !(is.na(orig.tax))) {
# if ambiguous name is found, grab the lowest annotated name that is not ambiguous
realnm.finder <- base::sapply(taxin.u[row , names(taxin.u)[base::seq(from = 1, to = (ncol(taxin.u) - col), by = 1)]], FUN = testAmbigNames)
realnm <- taxin.u[row , max(which(!(realnm.finder)))]
# then, append this name to the ambiguous name. This should make it unambiguous
orig.tax <- base::paste(realnm, orig.tax, sep = "-")
}
# process the name to get alternatives by igorning its format
if (ignore.format & !(is.na(orig.tax))) {
pos.taxs <- preprocessTax(orig.tax)
for(tax in pos.taxs) {
pos.taxs <- c(pos.taxs, getSynonyms(tax, synonyms))
}
pos.taxs <- base::unique(c(orig.tax, getSynonyms(orig.tax, synonyms), pos.taxs))
} else {
pos.taxs <- c(orig.tax, getSynonyms(orig.tax, synonyms))
}
# flag to keep track of when the row is already matched
matched <- FALSE
# counter to keep track what column number we are on
counter <- 1
# iterate through all alternatives of the taxonomy name with original one first
for (tax2map in pos.taxs) {
last <- FALSE
if (counter == length(pos.taxs)) {
last <- TRUE
}
if (!is.na(tax2map)) {
# find matching
match <- findMapping(taxonomy = tax2map, tax2map2 = tax2map2.u)
if (is.data.frame(match)) {
combined <- cbind(taxin.u[row, ], match)
mapped <- rbind(mapped, combined)
matched <- TRUE
break
} else { # if no matching is found, add to not.mapped
if (last) {
not.mapped <- c(not.mapped, orig.tax)
}
}
}
counter <- counter + 1
}
# if a match is found, we can move onto the next row
if (matched) {
break
}
}
}
}
# use the mapped table created to left join the original data frame with metadata
asv.mapped <- base::merge(x=tt, y=mapped, by=colnames(taxin.u), all.x=TRUE)
asv.mapped <- asv.mapped[ , !(colnames(asv.mapped) %in% colnames(taxin.u))]
# remove the unique addition of tax2map2 column names
colnames(asv.mapped) <- base::gsub("tax2map2_", "", colnames(asv.mapped))
# filter out duplicates for not mapped taxonomy names
not.mapped <- base::unique(not.mapped)
zz <- base::apply(asv.mapped, MARGIN = 2, FUN = as.character)
df <- base::as.data.frame(zz, stringsAsFactors = FALSE)
asv.mapped <- df
asv.mapped <- sort_my_taxtab(asv.mapped, ranknames = tax2map2.ranks)
rownames(asv.mapped) <- NULL
if (!(base::is.null(outfilez))) {
utils::write.csv(mapped, outfilez[1], row.names=FALSE)
not.mapped.df <- base::as.data.frame(not.mapped)
utils::write.table(not.mapped.df, outfilez[2], row.names=FALSE, col.names=FALSE)
utils::write.csv(asv.mapped, outfilez[3], row.names=FALSE)
}
if (streamline) {
return(asv.mapped)
} else {
return(list(mapped, not.mapped, asv.mapped))
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.