The BiMiCo (Biomap Microbiome Core-tools) package offers single-step preprocessing of 16S microbial marker gene sequencing raw data files for 454 and Illumina platforms. Using as input fastq files, sample metadata and a taxonomy database of choice, it outputs: a) ASV table b) synchronized metadata c) taxonomy table; or alternatively, a phyloseq object containing all three. The current tutorial and example data is aimed at the simple preprocessing of single-end Illumina reads, from swab samples of a healthy cohort from the Dorrenstein 3D Metabolic Map study (SRA:PRJEB5758).
evpar <- FALSE
library(BiMiCo) library(Biobase)
The workflow requires three types of input (in the same directory for the below example): 1) fastq files for analysis 2) Sample metadata table in .csv format (samples in rows, variables in columns) 3) Taxonomy database to use with DADA2, e.g., SILVA, RDP, GTDB etc...
Working directory (add trailing separator / or \, depending on OS):
wdir <- "/Your/working/dir/"
Folder containing fastq files:
fqdir <- "Zen_Mouse_PE/" rawfqs <- paste0(wdir, fqdir)
Metadata .csv table, ROWNAMES should contain the file basename without read 1/2 or file extension (see example):
mfile <- "mouse.dpw.csv" phedat <- read.csv(paste0(wdir,fqdir,mfile)) # Match actual filenames with phenodata rownames rownames(phedat) <- phedat$Filename phedat <- phedat[ order(rownames(phedat)), ] phedat$fastq <- phedat$Filename
directory to write quality-filtered fastq files to:
fdir <- "dada_filtfqs" filt_fqs_dir <- paste0(wdir,fdir)
directory to write various outputs (figures & graphs) to:
rdir <- "tutorial_results" results_dir <- paste0(wdir,rdir)
File format: please specify pattern= if other than ".fastq"
fastqs <- sort(list.files(rawfqs, pattern=".fastq")) fwd_fqs <- sort(list.files(rawfqs, pattern="_R1.fastq")) rev_fqs <- sort(list.files(rawfqs, pattern="_R2.fastq"))
Check if files can be located correctly
file.exists(paste0(rawfqs,fastqs)) file.exists(paste0(rawfqs,fwd_fqs)) file.exists(paste0(rawfqs,rev_fqs))
After checks, specify input files
infqs <- paste0(rawfqs,fastqs) in_fwd <- paste0(rawfqs,fwd_fqs) in_rev <- paste0(rawfqs,rev_fqs)
Please explicitly specify PRIMER TYPE (amplified region) of your study as a character string column of phenodata (e.g. "V3-V4" region for the demo data):
phedat$primer_type <- "V4"
Sync phenodata and input fastq files
phedat <- phedat[ phedat$fastq %in% fastqs, ] summary(fastqs==phedat$fastq)
For tutorial purposes, we create a random sequencing batch variable as a column of phenodata
phedat$batch <- c(rep("b1",4),rep("b2",4),rep("b3",4))
Subset phenodata table to represent single samples instead of PE reads
phedat_filt <- phedat[phedat$read=="_R1.fastq", ] rownames(phedat_filt) <- phedat_filt$Sample
taxonomy database to use with DADA2 (Greengenes v.13.8 used in the tutorial for speed, but not recommended otherwise, since it is outdated):
txset <- paste0(wdir,"gg_13_8_train_set_97.fa.gz")
# Filter all samples in one go prep_illPE(fwd_reads = in_fwd, rev_reads = in_rev, rawext = "_R1.fastq", filt_out = filt_fqs_dir, mtthread = T) # DADA2, error modeling per batch asvtab_pl01 <- asvtab_illPE_v2(filt_out = filt_fqs_dir, batch = phedat_filt$batch) ## Match names btw metadata and asv table, can be use case-specific colnames(asvtab_pl01) <- gsub('_df_R1.fastq.gz','',colnames(asvtab_pl01)) summary(colnames(asvtab_pl01)==rownames(phedat_filt)) #save write.csv(asvtab_pl01, file = "asvtab_tutorial.csv") write.csv(phedat, file = "phedat_tutorial.csv")
Assign taxonomy
taxtab_pl01 <- asgntax(asvtab_pl01, taxdat = txset, revcomp = T, mtthread = T) # Check if names match summary(rownames(taxtab_pl01)==rownames(asvtab_pl01)) #save write.csv(taxtab_pl01, file = "taxtab_tutorial.csv")
Create and save as Phyloseq object
epl01 <- create_phylo(asvtab_pl01, taxtab_pl01, phedat_filt) saveRDS(epl01, file = "rawasvs_tutorial_PE.Rds")
dc1 <- MicrobIEM_decontamination(as.data.frame(t1), SAMPLE = colnames(sp_set)[sp_set$env_biome=="ENVO:human-associated habitat"], NEG1 = colnames(sp_set)[sp_set$Description=="Blank sample"], NEG2 = colnames(sp_set)[sp_set$Description=="Control sample"], ratio_NEG1_threshold = 0.1, ratio_NEG2_threshold = 0.1, span_NEG1_threshold = 0.1, span_NEG2_threshold = 0.1) # save decontam table write.csv(dc1, file = "./Tutorial_decontam.csv")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.