README.md

FASTQ2OTU

:warning: This package is still under active development

FASTQ2OTU was developed as a easy and effective tool for downloading, analyzing, and processing large-scale microbiome rRNA gene data obtained from NCBI's SRA database. The package uses many functions from DADA2 to analyze sequence data. The primary objective of FASTQ2OTU is to (1) increase the reproducibility of microbiome analysis and (2) encourage the analysis of archived data to obtain new knowledge. This

FASTQ2OTU's workflow can be broken down into multiple stages: 1. Get Sequences - Sequences can be downloaded using FASTQDUMP or wget (if FTP links are available). 2. Plot Quality Distribution - Generate a figure that shows the quality distribution of the dataset. This step can be run independently if necessary. 3. Filter and Trim 4. Learn Errors and Denoise 5. Find and Remove Chimeric Sequences 6. Merge Paired-End Sequences 7. Assign Taxonomy 8. Merge OTU Tables - Merges individual OTU tables to create a single table that can used in downstream analyses.

Advantages of using FASTQ2OTU

Getting Started

After installing FASTQ2OTU, the following input files and/or directories will be required to begin processing data: - A YML-formatted config file containing all parameters (more information about the config file can be found below). - A directory of single or paired-end FASTQ files OR a text file containing SRA ids to download from NCBI. - A bit of knowledge about the sequences - Ideal trimming parameters - Forward and reverse primer lengths - If merging, the desired overlap length - Working knowledge of DADA2 workflow

Execute pipeline

# Load package into environoment
library("ananata/fastq2otu")

# Path to config file
paired_config <- "path/to/my_paired-example_config.yml"

# Run pipeline
runPipeline(configFile = paired_config, isPaired = TRUE, getQuality = TRUE, getMergedSamples = TRUE, getDownloadedSeqs = TRUE, getGeneratedReport = FALSE)

The runPipeline() function will allow the the entire DADA2 pipeline to be run. The parameters in the function allows users to specify which steps of the pipeline they would like to execute. The following table provides a description of each parameter and the action(s) it controls.

| Parameter | Description | Directions | | --------------- |-------------|------------| |configFile | Path to YML-file containing all user inputs | The file must be formatted with the correct variable names (please refer to template)| |isPaired | TRUE if handling paired-end data and FALSE if handling single-end. | Please note that paired-end and single-end data must be processed seperately (the package cannot analyze both datatypes simultaneously). | | getQuality | TRUE if you would like to generate a quality distribution plot and FALSE if you would like to skip the step. | This step can be run independently. | | getMergedSamples | TRUE if you would like to generate a merged sample table and FALSE if you would like to skip the step. | Generates a single table containing data from all samples. | | getDownloadedSeqs | TRUE if you would like to use fastq-dump or wget to download data directly from NCBI's SRA database. | Requires a text file containing all SRA sample IDs or FTP download links | | getGeneratedReport | If TRUE, a FASTQC report is generate using the FASTQCR R-package | This step can also be run independently. |

Quick Start Guide

DADA2 is an R package that allows users to preform high-resolution taxanomy analyses from FASTQ files. This package will allow most users to analyse datasets using the DADA2 pipeline. This procedure will cover some basics of R programming, installing and running the package on R server, and interpreting some of the outputs generated. There are two objectives for this document: 1. Introduce new users to DADA2’s functions; 2. To set-up a pipeline for 16S rRNA analyses of target bacterial isolates.

Plot Quality Distribution

DADA2’s plotQualityProfile() function creates a plot(s) that visualizes the overall distribution of quality scores within a dataset. Users can use the plots to make informed decisions about how they would like their data to be processed (i.e. filtering and trimming). The generated quality graphs show colored lines that signify different statistics. Green is the mean quality score for all reads in a single dataset Orange is the median * Dashed orange lines demarcate the 25th and 75th quantiles.

Merging Samples

Sequence tables generated by DADA2’s makeSequenceTable() function are formatted as single-row matrices (contain only one row), with consensus sequences as column headings and read counts as elements in the row. OTU Tables (given by DADA2's assignTaxonomy() or assignSpecies() function) contain taxonomic assignments and sequence variants (ASV). FASTQ2OTU's mergeSamples function will merge data from sequence and OTU tables obtained from different samples to generate a single table. The final table can be used to make inter-sample comparisons that may inform downstream analyses.

Downloading Data from NCBI

Public data can be accessed from NCBI’s SRA website . To view datasets, enter a project ID (i.e. PRJEB8073), click "Search" and select “Send results to Run Selector" link to view the results interactively. To access the Run Selector tool directly, the following link can also be used. To obtain a list of all SRA accession IDs within a given project, click the "Accession List" button in middle the "Select" panel and wait for the text file to be downloaded.

Using FASTQ-DUMP

To download datasets using NCBI's fastq-dump utility, download the sra-toolkit from NCBI and obtain the path to the fastq-dump tool. Record the paths to the fastq-dump script the SRA accession list in the config file (described below). Make sure to set the getDownloadedSeqs parameter to TRUE, when executing the runPipeline() function.

Using WGET

To download datasets from SRA using wget, navigate to SRA-Explorer and input your project ID. Once you click the search icon, a table should appear at the bottom of the window. Select all rows in the tables and store the results by clicking the blue "Add to Collection" button on the right. Please not that the search only outputs a certain number of results each time (with the max being 500). In order to obtain data on more than 500 samples, you must update the "Start at Record" text box after each search. Once you have stored all your samples in your collection, click the shopping cart icon on the top right. Click the tab that says "Raw FastQ Download URLs" and select the download link. Record the path to the downloaded text file in the config file and make sure to set the getDownloadedSeqs parameter to TRUE, when executing the runPipeline() function.

Installation

This application is designed to be lightweight and simple to use. The intended use is via a remote server, however it can also be run using RStudio (the package was written in R 3.5.3) and can be downloaded from Github using devtools.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("Biostrings")
BiocManager::install("ShortRead")
BiocManager::install("dada2")
BiocManager::install("gtools")

# Install package into environment
install.packages("devtools")
library(devtools)
install_github("ananata/fastq2otu")
library("FASTQ2OTU")

Using a config file

| Variable | Type | Default | Description | | --------------- |--------------|----------|-------------| |projectPrefix |Character|"myproject"| Prefix to append to newly created files (i.e. _filtered_files/ is created to store filtered files)| |outDir|Character|Current working directory|Path to output directory that the contain all output files and documents.| |pathToData |Character|N/A|Path to directory storing all input data.| |verbose | Logical| FALSE | Sets verbose parameter for all functions | |multithread | Logical| FALSE | Sets the multithread parameter for all functions | |pathToSampleIDs|Character|N/A|The path to a text file containing SRA Accession IDs.| |fastaPattern | Character| ^.*[1,2]?.fastq(.gz)?$ | Regex pattern to use when parsing directories for FASTQ files. | |aggregateQual |Logical|N/A|Provide TRUE if you would like to aggregate your quality profile diagram. | |qualN||Numeric|0|Enter the number of bases to sample to learn seqence error rates.| |useFastqDump|Logical|FALSE|Provide TRUE if you would like to download sequences using a locally installed version of SRA's FASTQDUMP| |pathToFastqDump|Character|N/A| Path to fastq-dump script. Required if useFastqDump parameter is TRUE. | |pathToSampleURLs|Character|N/A| Path to text file containing FTP download links. | |pathToFastqc|Character|N/A| Path to fastqc software. Required to use FASTQCR| |installFastqc|Logical|FALSE|If TRUE, FASTQC will be automatically downloaded into the users home directory. Unless an input for pathToFastqc is provided, then the new download will overwrite the older version. | |pathToFastqcResults|Character| N/A | Path to the directory storing the FASTQC reports. | |taxDatabase|Character|N/A| Required. Path to reference taxonomy database. |

Please refer to a template config file for a more comprehensive list of the available parameters.

Authors

License

This project is licensed under the GNU GPLv3 License. This license restricts the usage of this application for non-open sourced systems. Please contact the authors for questions related to relicensing of this software in non-open sourced systems.

Acknowledgments



ananata/fastq2otu documentation built on Feb. 2, 2022, 4:20 p.m.