data.proc.paired: Data processing - paired ends

View source: R/data.proc.paired.R

data.proc.pairedR Documentation

Data processing - paired ends

Description

data.proc.paired is a first attempt to develop a function to process (quality checking, error and chimeras filtering) paired end reads from a NGS run after these have been deconvoluted.

Usage

data.proc.paired(
  dir.in = NULL,
  dir.out = NULL,
  bp = 0,
  truncQ = 2,
  qrep = FALSE,
  dada = TRUE,
  pool = FALSE,
  plot.err = FALSE,
  chim = TRUE,
  minOverlap = 12,
  maxMismatch = 0,
  trimOverhang = FALSE,
  orderBy = "abundance",
  paired = TRUE,
  verbose = TRUE
)

Arguments

dir.in

The directory where the fastq files are located. It needs a vector of length=2

dir.out

The directory where to save the results. If NULL (default) then dir.out <- dir.in

bp

An integer indicating the expected length (base-pairs) of the reads. If zero (default) no truncation is applied

truncQ

Truncate reads at the first instance of a quality score less than or equal to truncQ when conducting quality filtering. See fastqFilter for details

qrep

Logical. Should the quality report be generated? (default FALSE)

dada

Logical. Should the dada analysis be conducted? (default TRUE)

pool

Logical. Should samples be pooled together prior to sample inference? (default FALSE). See dada for details

plot.err

Logical. Whether error rates obtained from dada should be plotted

chim

Logical. Should the bimera search and removal be performed? (default TRUE)

minOverlap

When merging paired ends, what is the minimum of basis that should overlap

maxMismatch

When merging paired ends, the maximum number of allowed mismatches

orderBy

Character vector specifying how the returned sequence table should be sorted. Default "abundance". See makeSequenceTable for details

verbose

Logical. Whether information on progress should be outputted (default: TRUE)

trimOverhang

When merging paired ends, whether the overhanging extreme of the sequence should be trimmed off

Details

data.proc.paired locates the .fastq files (can be compressed) in the directory indicated in dir.in. Because the location of the fwd and rev reads need to be provided it will fail if this argument is missing

This is the first implementation to deal with paired-end reads and assumes that adapters, primers and indexes have been already removed and that each file represents a sample. Because of this, this function cannot be run in the wrapper raw2data.proc.

Also, it is assumed that the file names follow the convention where "R1" and "R2" identify forward and reverse reads (with the root of the file being identical).

The data.proc.paired pipeline is as follows: fastq files are read in. A filter is applied to each pair that truncates reads at the first instance of a quality score less than truncQ, remove reads that are of low quality (currently the threshold is hard-coded and reads are discarded if the expected errors is higher than 3 - from documentation in the R package dada2, the expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10)) - and remove reads that (after truncation) do not match the target length. A quality report can be (optionally) generated with the R package ShortReads to verify the quality of the reads retained after this step. Reads are then dereplicated. Optionally, the dada (Callahan et al 2015) algorithm is applied and bimeras are searched and removed with default settings of the relative functions in the package dada2. The sequences that were retained at completion of data.proc.paired are saved in fasta files in the subfolder "Final_seqs" and a .csv with a summary of the number of reads that have been retained in each step is also written. These two outputs are also returned at the end of the function.

The sequence data handling is done by using functionalities from the packages dada2 and ShortRead, so make sure to cite them (in addition to amplicR of course!) if you report your results in a paper or report.

Value

Return a list with several elements:

  • $luniseqsFinal: A list with unique sequences (names) that were retained at completion of data.proc and their abundance (values).

  • $lsummary: A list where each element is a summary of the number of reads that were retained in each step. This can be converted in a data.frame by running the following

    summary <- plyr::join_all(lsummary, by="Sample", type="left")

  • $stable: The sequence table

  • $seq_list: The sequences and matching sequence IDs

  • $call: The function call

Several files are also returned, these include:

  • Seq_table.csv Sequence table

  • Seq_list.csv List of sequences and their matching IDs

  • data.proc.summary.csv A summary of the number of reads that were retained in each step

  • data.proc.rda R data file containing the list returned by data.proc (see above)

References

Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy J Johnson, Susan P Holmes (2015). DADA2: High resolution sample inference from amplicon data.

See Also

dada, makeSequenceTable

Examples

# Select the directory where the example data are stored
example.data <- system.file("extdata", "HTJ", package="amplicR")
# Select a temporary directory where to store the outputs
out <- tempdir()

HTJ.test <- data.proc(example.data, out, bp=140)

# To clean up the temp directory
unlink(out, recursive=TRUE)

carlopacioni/amplicR documentation built on Aug. 19, 2023, 7:59 p.m.