deconv | R Documentation |
deconv
takes a fastq file and will search for the forward primer and
use this to separate the reads. That is, different PCR products ('genes')
will be separated based on the forward primer. Within each gene, reads are
then separated based on forward and/or reverse index (if present). The end
products are several fastq files - one for each samples, in as many folders
as how many gene identifiers were provided with the info.file
- where
primers and indexes were removed.
deconv(
fn,
nRead = 1e+08,
info.file,
sample.IDs = "Sample_IDs",
Fprimer = "F_Primer",
Rprimer = "R_Primer",
primer.mismatch = 0,
Find = "F_ind",
Rind = "R_ind",
index.mismatch = 0,
gene = "Gene",
dir.out = NULL,
verbose = FALSE
)
fn |
Fully qualified name (i.e. the complete path) of the fastq file |
nRead |
The number of bytes or characters to be read at one time. See
|
info.file |
Fully qualified name (i.e. the complete path) of the CSV file with the information needed on primers, indexes etc. (See details) |
sample.IDs |
A character vector with the name of the column in info.file containing the sample IDs |
Fprimer, Rprimer |
A character vector with the name of the column in info.file containing the forward and reverse primer sequence, respectively |
primer.mismatch |
The maximum number of primer mismatch |
Find, Rind |
A character vector with the name of the column in info.file containing the forward and reverse index sequence respectively |
index.mismatch |
The maximum number of index mismatch |
gene |
A character vector with the name of the column in info.file containing the name of the gene or other group idenifiers (see details) |
dir.out |
The directory where to save the results. If NULL (default) then it will be set the same location where the input data was located |
verbose |
Whether print out information on hits (default: FALSE) |
If a search for indexes is conducted, this function applies only to reads with in-line indexes. That is, where the architecture of the reads is as follows:
F_index—F_primer—Target_sequence—R_primer—R_index
Note that the P7 adapter can be removed with rmEndAdapter
,
although, because deconv
scans the whole length of the reads, removing
the end adapter beforehand is not compulsory.
It is possible to control the number of mismatch, and IUPAC ambiguities codes
can be used only in the search for primers (i.e. the search for the primers
is conducted with fixed=FALSE
, which means (from Biostring): "an IUPAC
ambiguity code in the pattern can match any letter in the subject that is
associated with the code, and vice versa". Note that indexes are searched
with fixed=TRUE
).
Information about the reads are passed with a comma separated file (CSV),
whose path and name is passed with info.file
. This must contain a
column for each: the foward index, the reverse index, the forward primer, the
reverse primer, the sample IDs, and an identifier of the PCR product being
amplified, typically the gene's name. The column headings where these
information are stored in info.file
are passed with the function
arguments. While it is possible to include other columns where the users can
record additional information, these are effectively ignored. It is not
mandatory to have both, the forward and reverse indexes, but if one is not
used, there is still the need to include a blank column in info.file
and indicate the column heading. Note that, when importing info.file
,
R will automatically convert illegal characters (e.g. sapces, paranthesis) in
dots ('.'), so it is probably safer to only use alpha-numeric characters
and/or dots or underscores ('_') in the function's arguments.
All sequences for indexes and primers are passed (as character vector) in 5' to 3' direction and are internally reversed and complemented when necessary.
deconv
initially searches for the forward primer and separates the
reads creating (if not existing already) a folder named as for the relevant
information provided in the column gene
, which is typically an
identifier of the targeted genes. It is possible to use the column
gene
to group PCR products in other logical way than genes, but all
identical forward primer should have the same gene
information. This
is because (for efficiency) deconv
uses only the first line for each
unique forward primer to identify where the processed data should be saved
and if multiple codes are used for gene
for the same forward primer,
these are actually ignored.
After reads are separated based on the forward primer, indexes and primers
are removed and processed samples are written to fastq files. Only reads
where both primers (forward and reverse) and indexes (if there is information
for both in info.file
) were found are retained.
#' If verbose=TRUE
, a warning is reported if reads have multiple hits
in the serach for the pattern (indexes or primers). When there are multiple
hits, the most external one is used if not mismatches are allowed. If
mismatches are allowed, then the match with the lowest edit distance
(calculated using srdistance
) between the pattern
and the match is used (i.e. if there is for example a match with zero and a
match with one mismatch, the match with zero mismatch will be used). When
there are multiple matches with the same lowest edit distance, the most
external one will be used. Because emasuring the edit distance is
computetionally demanding, allowing mismatches may slow down the data
processing by several folds.
A fastq file with the reads that were retained after removing the indexes (with the suffix "_IndRm") and after removing the primers (with the suffix "_Ind_primerRm") were end adapter was found (and removed) saved. A list with the total number of reads that were processed and retained is also returned.
When relevant, a text file with the sequence IDs that had multiple hits in the preliminary search for the forward primer.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.