D Catlett 8/26/2020
This vignette demonstrates how to create a custom tax2map2 object for use with taxmapper. A tax2map2 object is just a collection of all unique taxonomic assignments that can be made using a particular reference database.
The files this particular script works with are pulled from the dada2 webpage (https://benjjneb.github.io/dada2/training.html), and are standard fasta files with taxonomic names separated by semi-colons. The input files were too big for me to push onto Github, but if you want to follow along go grab them from the dada2 page and just change the file paths below.
We'll do the RDP training set first, starting with reading it in as a DNAStringSet object using Biostrings.
rm(list = ls())
library("Biostrings")
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
ff <- "~/Documents/R/rdp_train_set_16.fa" # CHANGE to your fasta file name
fastaFile <- Biostrings::readDNAStringSet(ff)
head(fastaFile)
## A DNAStringSet instance of length 6
## width seq names
## [1] 1445 GAACGCTGGCGGCGTGCTTAACA...ATCGGCGATTGGGACGAAGTCGT Bacteria;Actinoba...
## [2] 1536 GTTTGATCCTGGCTCAGATTGAA...CTAGGGGAACCTGCGGCTGGATC Bacteria;Proteoba...
## [3] 1482 TAGAGTTTGATCCTGGCTCAGGA...GAAGTCGTAACAAGGTAGCCGTA Bacteria;Actinoba...
## [4] 1386 AAGAGTTTGATCCTGGCTCAGAG...TTGACCTTAANCCGGNGAGCGAA Bacteria;Proteoba...
## [5] 1451 ATTGAACGCTGGCGGCATGCCTT...CGGCAGGGTTCGTGACTGGGGTG Bacteria;Proteoba...
## [6] 1470 GACGAACGCTGGCGGCGTGCTTA...CAAGGTAGCGTACCGGAAGGTGC Bacteria;Actinoba...
Looks good, now let's work up the sequence names and split them by the ";". We'll get help from the stringr package.
library("stringr")
seq_name = names(fastaFile)
eh <- stringr::str_split(seq_name, pattern = ";", simplify = TRUE)
eh <- base::as.data.frame(eh[, -ncol(eh)], stringsAsFactors = FALSE)
head(eh)
## V1 V2 V3 V4
## 1 Bacteria Actinobacteria Actinobacteria Actinomycetales
## 2 Bacteria Proteobacteria Gammaproteobacteria Vibrionales
## 3 Bacteria Actinobacteria Actinobacteria Actinomycetales
## 4 Bacteria Proteobacteria Alphaproteobacteria Rhodospirillales
## 5 Bacteria Proteobacteria Betaproteobacteria Burkholderiales
## 6 Bacteria Actinobacteria Actinobacteria Actinomycetales
## V5 V6
## 1 Mycobacteriaceae Mycobacterium
## 2 Vibrionaceae Vibrio
## 3 Mycobacteriaceae Mycobacterium
## 4 Acetobacteraceae Acetobacter
## 5 Burkholderiales_incertae_sedis Aquincola
## 6 Micromonosporaceae Couchioplanes
We've successfully created a dataframe with ranks along the columns and rows corresponding to taxonomic assignments. Now we'll just assign rank names, make sure there are no NA's (these will create bugs in taxmapper) and make sure the taxonomic assignments are unique (to speed up taxmapper).
r <- c("domain", "phylum", "class", "order", "family", "genus")
colnames(eh) <- r
eh <- base::unique(eh, MARGIN = 1)
any(is.na(eh))
## [1] FALSE
And one last check of the data frame:
head(eh)
## domain phylum class order
## 1 Bacteria Actinobacteria Actinobacteria Actinomycetales
## 2 Bacteria Proteobacteria Gammaproteobacteria Vibrionales
## 4 Bacteria Proteobacteria Alphaproteobacteria Rhodospirillales
## 5 Bacteria Proteobacteria Betaproteobacteria Burkholderiales
## 6 Bacteria Actinobacteria Actinobacteria Actinomycetales
## 7 Bacteria Proteobacteria Alphaproteobacteria Rhodobacterales
## family genus
## 1 Mycobacteriaceae Mycobacterium
## 2 Vibrionaceae Vibrio
## 4 Acetobacteraceae Acetobacter
## 5 Burkholderiales_incertae_sedis Aquincola
## 6 Micromonosporaceae Couchioplanes
## 7 Rhodobacteraceae Maritimibacter
class(eh)
## [1] "data.frame"
typeof(eh)
## [1] "list"
Alright. This data set is now ready to plug into taxmapper.
We'll repeat for greengenes:
ff <- "~/Documents/R/gg_13_8_train_set_97.fa"
fastaFile2 <- Biostrings::readDNAStringSet(ff)
seq_name2 = names(fastaFile2)
eh2 <- stringr::str_split(seq_name2, pattern = ";", simplify = TRUE)
eh2 <- base::as.data.frame(eh2[, -ncol(eh2)], stringsAsFactors = FALSE)
r2 <- c("domain", "phylum", "class", "order", "family", "genus","species")
colnames(eh2) <- r2
eh2 <- base::unique(eh2, MARGIN = 1)
head(eh2)
## domain phylum class order
## 1 k__Bacteria p__Gemmatimonadetes c__Gemm-1 o__
## 2 k__Bacteria p__Bacteroidetes c__Flavobacteriia o__Flavobacteriales
## 3 k__Bacteria p__Chloroflexi c__Chloroflexi o__[Roseiflexales]
## 4 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacteriales
## 5 k__Bacteria p__TM6 c__SJA-4 o__
## 6 k__Bacteria p__OP11 c__OP11-4 o__
## family genus species
## 1 f__ g__ s__
## 2 f__Flavobacteriaceae g__Flavobacterium s__
## 3 f__[Roseiflexaceae] g__Roseiflexus s__
## 4 f__Enterobacteriaceae g__ s__
## 5 f__ g__ s__
## 6 f__ g__ s__
Ohh this one's different. But the stringr package is so dope* this is no biggie:
*I have no affiliation with the creator of the stringr package.
replace.me.plz <- c("k__", "p__", "c__", "o__", "f__", "g__", "s__")
for (i in 1:ncol(eh2)) {
eh2[ , i] <- str_replace_all(eh2[ , i], replace.me.plz[i], "")
}
any(is.na(eh2))
## [1] FALSE
head(eh2)
## domain phylum class order
## 1 Bacteria Gemmatimonadetes Gemm-1
## 2 Bacteria Bacteroidetes Flavobacteriia Flavobacteriales
## 3 Bacteria Chloroflexi Chloroflexi [Roseiflexales]
## 4 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## 5 Bacteria TM6 SJA-4
## 6 Bacteria OP11 OP11-4
## family genus species
## 1
## 2 Flavobacteriaceae Flavobacterium
## 3 [Roseiflexaceae] Roseiflexus
## 4 Enterobacteriaceae
## 5
## 6
class(eh2)
## [1] "data.frame"
typeof(eh2)
## [1] "list"
Now it looks just like the other RDP tax2map2 object we made above. Yeah!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.