An Ultra-Fast All-in-One FASTQ preprocessor


title: "An Ultra-Fast All-in-One FASTQ preprocessor" author: "Wei Wang periwinkle.david@gmail.com" date: "r format(Sys.Date(), '%m/%d/%Y')" package: Rfastp

output: BiocStyle::html_document: number_sections: yes toc: true vignette: > %\VignetteIndexEntry{Rfastp} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc}

bibliography: - fastp.bib


knitr::opts_chunk$set(tidy=FALSE, cache=FALSE,
                      #dev="png",
                      message=FALSE, error=FALSE, warning=TRUE)
options(width=100)

Introduction

The Rfastp package provides an interface to the all-in-one preprocessing for FastQ files toolkit fastp[@10.1093/bioinformatics/bty560].

Installation

Use the BiocManager package to download and install the package from Bioconductor as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rfastp")

If required, the latest development version of the package can also be installed from GitHub.

BiocManager::install("remotes")
BiocManager::install("RockefellerUniversity/Rfastp")

Once the package is installed, load it into your R session:

library(Rfastp)

FastQ Quality Control with rfastp

The package contains three example fastq files, corresponding to a single-end fastq file, a pair of paired-end fastq files.

se_read1 <- system.file("extdata","Fox3_Std_small.fq.gz",package="Rfastp")
pe_read1 <- system.file("extdata","reads1.fastq.gz",package="Rfastp")
pe_read2 <- system.file("extdata","reads2.fastq.gz",package="Rfastp")
outputPrefix <- tempfile(tmpdir = tempdir())

a normal QC run for single-end fastq file.

Rfastp support multiple threads, set threads number by parameter thread.

se_json_report <- rfastp(read1 = se_read1, 
    outputFastq = paste0(outputPrefix, "_se"), thread = 4)

a normal QC run for paired-end fastq files.

pe_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2,
    outputFastq = paste0(outputPrefix, "_pe"))

merge paired-end fastq files after QC.

pe_merge_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, merge = TRUE,
    outputFastq = paste0(outputPrefix, '_unpaired'),
    mergeOut = paste0(outputPrefix, "_merged.fastq.gz"))

UMI processing

a normal UMI processing for 10X Single-Cell library.

umi_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, 
    outputFastq = paste0(outputPrefix, '_umi1'), umi = TRUE, umiLoc = "read1",
    umiLength = 16)

Set a customized UMI prefix and location in sequence name.

the following example will add prefix string before the UMI sequence in the sequence name. An "_" will be added between the prefix string and UMI sequence. The UMI sequences will be inserted into the sequence name before the first space.

umi_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, 
    outputFastq = paste0(outputPrefix, '_umi2'), umi = TRUE, umiLoc = "read1",
    umiLength = 16, umiPrefix = "#", umiNoConnection = TRUE, 
    umiIgnoreSeqNameSpace = TRUE)

A QC example with customized cutoffs and adapter sequence.

Trim poor quality bases at 3' end base by base with quality higher than 5; trim poor quality bases at 5' end by a 29bp window with mean quality higher than 20; disable the polyG trimming, specify the adapter sequence for read1.

clipr_json_report <- rfastp(read1 = se_read1, 
    outputFastq = paste0(outputPrefix, '_clipr'),
    disableTrimPolyG = TRUE,
    cutLowQualFront = TRUE,
    cutFrontWindowSize = 29,
    cutFrontMeanQual = 20,
    cutLowQualTail = TRUE,
    cutTailWindowSize = 1,
    cutTailMeanQual = 5,
    minReadLength = 29,
    adapterSequenceRead1 = 'GTGTCAGTCACTTCCAGCGG'
)

multiple input files for read1/2 in a vector.

rfastq can accept multiple input files, and it will concatenate the input files into one and the run fastp.

pe001_read1 <- system.file("extdata","splited_001_R1.fastq.gz",
    package="Rfastp")
pe002_read1 <- system.file("extdata","splited_002_R1.fastq.gz",
    package="Rfastp")
pe003_read1 <- system.file("extdata","splited_003_R1.fastq.gz",
    package="Rfastp")
pe004_read1 <- system.file("extdata","splited_004_R1.fastq.gz",
    package="Rfastp")
inputfiles <- c(pe001_read1, pe002_read1, pe003_read1, pe004_read1)
cat_rjson_report <- rfastp(read1 = inputfiles, 
    outputFastq = paste0(outputPrefix, "_merged1"))

concatenate multiple fastq files.

catfastq concatenate all the input files into a new file.

pe001_read2 <- system.file("extdata","splited_001_R2.fastq.gz",
    package="Rfastp")
pe002_read2 <- system.file("extdata","splited_002_R2.fastq.gz",
    package="Rfastp")
pe003_read2 <- system.file("extdata","splited_003_R2.fastq.gz",
    package="Rfastp")
pe004_read2 <- system.file("extdata","splited_004_R2.fastq.gz",
    package="Rfastp")
inputR2files <- c(pe001_read2, pe002_read2, pe003_read2, pe004_read2)
catfastq(output = paste0(outputPrefix,"_merged2_R2.fastq.gz"), 
    inputFiles = inputR2files)

Generate report tables/plots

A data frame for the summary.

dfsummary <- qcSummary(pe_json_report)

a ggplot2 object of base quality plot.

p1 <- curvePlot(se_json_report)
p1

a ggplot2 object of GC Content plot.

p2 <- curvePlot(se_json_report, curve="content_curves")
p2

a data frame for the trimming summary.

dfTrim <- trimSummary(pe_json_report)

Miscellaneous helper functions

usage of rfastp:

?rfastp

usage of catfastq:

?catfastq

usage of qcSummary:

?qcSummary

usage of trimSummary:

?trimSummary

usage of curvePlot:

?curvePlot

Acknowledgments

Thank you to Ji-Dung Luo for testing/vignette review/critical feedback, Doug Barrows for critical feedback/vignette review and Ziwei Liang for their support.

Session info

sessionInfo()

References



Try the Rfastp package in your browser

Any scripts or data that you put into this service are public.

Rfastp documentation built on Nov. 8, 2020, 5:52 p.m.