View source: R/data.proc.paired.R
data.proc.paired | R Documentation |
data.proc.paired
is a first attempt to develop a function to process
(quality checking, error and chimeras filtering) paired end reads from a NGS
run after these have been deconvoluted.
data.proc.paired(
dir.in = NULL,
dir.out = NULL,
bp = 0,
truncQ = 2,
qrep = FALSE,
dada = TRUE,
pool = FALSE,
plot.err = FALSE,
chim = TRUE,
minOverlap = 12,
maxMismatch = 0,
trimOverhang = FALSE,
orderBy = "abundance",
paired = TRUE,
verbose = TRUE
)
dir.in |
The directory where the fastq files are located. It needs a vector of length=2 |
dir.out |
The directory where to save the results. If NULL (default)
then |
bp |
An integer indicating the expected length (base-pairs) of the reads. If zero (default) no truncation is applied |
truncQ |
Truncate reads at the first instance of a quality score less
than or equal to truncQ when conducting quality filtering. See
|
qrep |
Logical. Should the quality report be generated? (default
|
dada |
Logical. Should the dada analysis be conducted? (default
|
pool |
Logical. Should samples be pooled together prior to sample
inference? (default |
plot.err |
Logical. Whether error rates obtained from |
chim |
Logical. Should the bimera search and removal be performed?
(default |
minOverlap |
When merging paired ends, what is the minimum of basis that should overlap |
maxMismatch |
When merging paired ends, the maximum number of allowed mismatches |
orderBy |
Character vector specifying how the returned sequence table
should be sorted. Default "abundance". See
|
verbose |
Logical. Whether information on progress should be outputted (default: TRUE) |
trimOverhang |
When merging paired ends, whether the overhanging extreme of the sequence should be trimmed off |
data.proc.paired
locates the .fastq files (can be compressed) in the
directory indicated in dir.in
. Because the location of the fwd and rev
reads need to be provided it will fail if this argument is missing
This is the first implementation to deal with paired-end reads and assumes
that adapters, primers
and indexes have been already removed and that each file represents a sample.
Because of this, this function cannot be run in the wrapper raw2data.proc
.
Also, it is assumed that the file names follow the convention where "R1" and "R2" identify forward and reverse reads (with the root of the file being identical).
The data.proc.paired
pipeline is as follows: fastq files are read in. A
filter is applied to each pair that truncates reads at the first instance of
a quality score
less than truncQ
, remove reads that are of low quality (currently the
threshold is hard-coded and reads are discarded if the expected errors is
higher than 3 - from documentation in the R package dada2
, the
expected errors are calculated from the nominal definition of the quality
score: EE = sum(10^(-Q/10)) - and remove reads that (after truncation) do not
match the target length. A quality report can be (optionally) generated with the
R package ShortReads
to verify the quality of the reads retained
after this step. Reads are then dereplicated. Optionally, the dada (Callahan
et al 2015) algorithm is applied and bimeras are searched and removed with
default settings of the relative functions in the package dada2
. The
sequences that were retained at completion of data.proc.paired
are saved in
fasta files in the subfolder "Final_seqs" and a .csv with a summary of the
number of reads that have been retained in each step is also written. These
two outputs are also returned at the end of the function.
The sequence data handling is done by using functionalities from the packages
dada2
and ShortRead
, so make sure to cite them (in addition to
amplicR
of course!) if you report your results in a paper or report.
Return a list with several elements:
$luniseqsFinal: A list with unique sequences (names) that
were retained at completion of data.proc
and their abundance
(values).
$lsummary: A list where each element is a summary of the
number of reads that were retained in each step. This can be converted in a
data.frame
by running the following
summary <- plyr::join_all(lsummary, by="Sample", type="left")
$stable: The sequence table
$seq_list: The sequences and matching sequence IDs
$call: The function call
Several files are also returned, these include:
Seq_table.csv Sequence table
Seq_list.csv List of sequences and their matching IDs
data.proc.summary.csv A summary of the number of reads that were retained in each step
data.proc.rda R data file containing the list returned by data.proc (see above)
Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy J Johnson, Susan P Holmes (2015). DADA2: High resolution sample inference from amplicon data.
dada
, makeSequenceTable
# Select the directory where the example data are stored
example.data <- system.file("extdata", "HTJ", package="amplicR")
# Select a temporary directory where to store the outputs
out <- tempdir()
HTJ.test <- data.proc(example.data, out, bp=140)
# To clean up the temp directory
unlink(out, recursive=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.