The BiMiCo (Biomap Microbiome Core-tools) package offers single-step preprocessing of 16S microbial marker gene sequencing raw data files for 454 and Illumina platforms. Using as input fastq files, sample metadata and a taxonomy database of choice, it outputs: a) ASV table b) synchronized metadata c) taxonomy table; or alternatively, a phyloseq object containing all three. The current tutorial and example data is aimed at the simple preprocessing of single-end Illumina reads, from swab samples of a healthy cohort from the Dorrenstein 3D Metabolic Map study (SRA:PRJEB5758).


evpar <- FALSE

Load required packages

library(BiMiCo)
library(Biobase)



Input files

The workflow requires three types of input (in the same directory for the below example): 1) fastq files for analysis 2) Sample metadata table in .csv format (samples in rows, variables in columns) 3) Taxonomy database to use with DADA2, e.g., SILVA, RDP, GTDB etc...

Working directory (add trailing separator / or \, depending on OS):

wdir <- "/Your/working/dir/"

Folder containing fastq files:

fqdir <- "Zen_Mouse_PE/"
rawfqs <- paste0(wdir, fqdir)

Metadata .csv table, ROWNAMES should contain the file basename without read 1/2 or file extension (see example):

mfile <- "mouse.dpw.csv"
phedat <- read.csv(paste0(wdir,fqdir,mfile))
# Match actual filenames with phenodata rownames
rownames(phedat) <- phedat$Filename
phedat <- phedat[ order(rownames(phedat)), ]
 phedat$fastq <- phedat$Filename

directory to write quality-filtered fastq files to:

fdir <- "dada_filtfqs"
filt_fqs_dir <- paste0(wdir,fdir)

directory to write various outputs (figures & graphs) to:

rdir <- "tutorial_results"
results_dir <- paste0(wdir,rdir)

File format: please specify pattern= if other than ".fastq"

fastqs <- sort(list.files(rawfqs, pattern=".fastq"))
fwd_fqs <- sort(list.files(rawfqs, pattern="_R1.fastq"))
rev_fqs <- sort(list.files(rawfqs, pattern="_R2.fastq"))

Check if files can be located correctly

file.exists(paste0(rawfqs,fastqs))
file.exists(paste0(rawfqs,fwd_fqs))
file.exists(paste0(rawfqs,rev_fqs))

After checks, specify input files

infqs <- paste0(rawfqs,fastqs)
in_fwd <- paste0(rawfqs,fwd_fqs)
in_rev <- paste0(rawfqs,rev_fqs)

Please explicitly specify PRIMER TYPE (amplified region) of your study as a character string column of phenodata (e.g. "V3-V4" region for the demo data):

phedat$primer_type <- "V4"

Sync phenodata and input fastq files

 phedat <- phedat[ phedat$fastq %in% fastqs, ]
summary(fastqs==phedat$fastq)

For tutorial purposes, we create a random sequencing batch variable as a column of phenodata

phedat$batch <- c(rep("b1",4),rep("b2",4),rep("b3",4))

Subset phenodata table to represent single samples instead of PE reads

phedat_filt <- phedat[phedat$read=="_R1.fastq", ]
rownames(phedat_filt) <- phedat_filt$Sample

taxonomy database to use with DADA2 (Greengenes v.13.8 used in the tutorial for speed, but not recommended otherwise, since it is outdated):

txset <- paste0(wdir,"gg_13_8_train_set_97.fa.gz")



Preprocess fastqs per batch with dada2
# Filter all samples in one go
prep_illPE(fwd_reads = in_fwd,
           rev_reads = in_rev,
           rawext = "_R1.fastq",
           filt_out = filt_fqs_dir,
           mtthread = T)

# DADA2, error modeling per batch
asvtab_pl01 <- asvtab_illPE_v2(filt_out = filt_fqs_dir, 
                            batch = phedat_filt$batch)

## Match names btw metadata and asv table, can be use case-specific
colnames(asvtab_pl01) <- gsub('_df_R1.fastq.gz','',colnames(asvtab_pl01))
summary(colnames(asvtab_pl01)==rownames(phedat_filt))

#save
write.csv(asvtab_pl01, file = "asvtab_tutorial.csv")
write.csv(phedat, file = "phedat_tutorial.csv")

Assign taxonomy

taxtab_pl01 <- asgntax(asvtab_pl01,
                  taxdat = txset,
                  revcomp = T,
                  mtthread = T)
# Check if names match
summary(rownames(taxtab_pl01)==rownames(asvtab_pl01))

#save
write.csv(taxtab_pl01, file = "taxtab_tutorial.csv")

Create and save as Phyloseq object

epl01 <- create_phylo(asvtab_pl01, taxtab_pl01, phedat_filt)

saveRDS(epl01, file = "rawasvs_tutorial_PE.Rds")

Decontamination single dataset, variables highly specific to use case

dc1 <- MicrobIEM_decontamination(as.data.frame(t1), 
                                 SAMPLE = colnames(sp_set)[sp_set$env_biome=="ENVO:human-associated habitat"],
                                 NEG1 = colnames(sp_set)[sp_set$Description=="Blank sample"],
                                 NEG2 = colnames(sp_set)[sp_set$Description=="Control sample"],
                                 ratio_NEG1_threshold = 0.1,
                                 ratio_NEG2_threshold = 0.1,
                                 span_NEG1_threshold = 0.1,
                                 span_NEG2_threshold = 0.1)

# save decontam table
write.csv(dc1, file = "./Tutorial_decontam.csv")


peterolah001/BiMiCo documentation built on April 24, 2023, 3:35 a.m.