raw2data.proc: From raw data to data.proc()

View source: R/clean.deconv.R

raw2data.procR Documentation

From raw data to data.proc()

Description

This function is a wrapper for rmEndAdapter, deconv and data.proc. It takes in a raw fastq file, removes the end adapter, separates the reads based on their forward primers. Within each of the identified group, separates the reads based on barcodes (indexes) and eventually calls data.proc to process (quality checking, denoising and chimeras filtering) the retained data from the NGS run.

Usage

raw2data.proc(
  fn,
  nRead = 1e+08,
  rmEnd = FALSE,
  EndAdapter = "P7_last10",
  adapter.mismatch = 0,
  info.file,
  sample.IDs = "Sample_IDs",
  Fprimer = "F_Primer",
  Rprimer = "R_Primer",
  primer.mismatch = 0,
  Find = "F_ind",
  Rind = "R_ind",
  index.mismatch = 0,
  gene = "Gene",
  amplic.size = "Amplicon",
  truncQ = 2,
  qrep = FALSE,
  dada = TRUE,
  pool = FALSE,
  plot.err = FALSE,
  chim = TRUE,
  orderBy = "abundance"
)

Arguments

fn

Fully qualified name (i.e. the complete path) of the fastq file

nRead

The number of bytes or characters to be read at one time. See FastqStreamer for details

rmEnd

Whether rmEndAdapter should be performed (default: FALSE)

EndAdapter

A character vector with the sequence of the end adapter, "P7" or "P7_last10" (See details)

adapter.mismatch

The maximum number of allowed mismatch (See details)

info.file

Fully qualified name (i.e. the complete path) of the CSV file with the information needed on primers, indexes etc. (See details)

sample.IDs

A character vector with the name of the column in info.file containing the sample IDs

Fprimer, Rprimer

A character vector with the name of the column in info.file containing the forward and reverse primer sequence, respectively

primer.mismatch

The maximum number of primer mismatch

Find, Rind

A character vector with the name of the column in info.file containing the forward and reverse index sequence respectively

index.mismatch

The maximum number of index mismatch

gene

A character vector with the name of the column in info.file containing the name of the gene or other group idenifiers (see details)

amplic.size

A character vector with the name of the column in info.file containing the amplicon size of the PCR product

truncQ

Truncate reads at the first instance of a quality score less than or equal to truncQ when conducting quality filtering. See fastqFilter for details

qrep

Logical. Should the quality report be generated? (default FALSE)

dada

Logical. Should the dada analysis be conducted? (default TRUE)

pool

Logical. Should samples be pooled together prior to sample inference? (default FALSE). See dada for details

plot.err

Logical. Whether error rates obtained from dada should be plotted

chim

Logical. Should the bimera search and removal be performed? (default TRUE)

orderBy

Character vector specifying how the returned sequence table should be sorted. Default "abundance". See makeSequenceTable for details

Details

Note that the amplicon size for data.proc is obtained from the comma delimited file info.file, searching in the column with the heading indicated in amplic.size. Zeros can be used in this column if no truncation is wanted. For each entry in the column indicated with the argument gene, the function will use the first entry found in amplic.size for the relevant gene. If the same gene identifier is used for multiple forward primers, refer to the documentation for the deconv to see how multiple PCR product can be grouped together using the gene column). Note that withing each gene, the same amplicon length is used raw2data.proc. To use different amplicon sizes within a gene, run the three functions (rmEndAdapter, deconv and data.proc) manually, rather than with raw2data.proc.

By default, dir.out is set to the location where the input file is and verbose=FALSE.

Please, see documentations for each functions for more information.

Value

A list that has for elements the output of data.proc for each PCR product

Also, in addition to the output files described in the documentations for rmEndAdapter, deconv and data.proc, a text file, named "summary_nReads.txt" is saved in the same location where the raw data are, summarising the number of reads retained in each step of the analysis


carlopacioni/amplicR documentation built on Aug. 19, 2023, 7:59 p.m.