deconv: Separate reads by genes and deconvolute them based on...

View source: R/clean.deconv.R

deconvR Documentation

Separate reads by genes and deconvolute them based on barcodes

Description

deconv takes a fastq file and will search for the forward primer and use this to separate the reads. That is, different PCR products ('genes') will be separated based on the forward primer. Within each gene, reads are then separated based on forward and/or reverse index (if present). The end products are several fastq files - one for each samples, in as many folders as how many gene identifiers were provided with the info.file - where primers and indexes were removed.

Usage

deconv(
  fn,
  nRead = 1e+08,
  info.file,
  sample.IDs = "Sample_IDs",
  Fprimer = "F_Primer",
  Rprimer = "R_Primer",
  primer.mismatch = 0,
  Find = "F_ind",
  Rind = "R_ind",
  index.mismatch = 0,
  gene = "Gene",
  dir.out = NULL,
  verbose = FALSE
)

Arguments

fn

Fully qualified name (i.e. the complete path) of the fastq file

nRead

The number of bytes or characters to be read at one time. See FastqStreamer for details

info.file

Fully qualified name (i.e. the complete path) of the CSV file with the information needed on primers, indexes etc. (See details)

sample.IDs

A character vector with the name of the column in info.file containing the sample IDs

Fprimer, Rprimer

A character vector with the name of the column in info.file containing the forward and reverse primer sequence, respectively

primer.mismatch

The maximum number of primer mismatch

Find, Rind

A character vector with the name of the column in info.file containing the forward and reverse index sequence respectively

index.mismatch

The maximum number of index mismatch

gene

A character vector with the name of the column in info.file containing the name of the gene or other group idenifiers (see details)

dir.out

The directory where to save the results. If NULL (default) then it will be set the same location where the input data was located

verbose

Whether print out information on hits (default: FALSE)

Details

If a search for indexes is conducted, this function applies only to reads with in-line indexes. That is, where the architecture of the reads is as follows:

F_index—F_primer—Target_sequence—R_primer—R_index

Note that the P7 adapter can be removed with rmEndAdapter, although, because deconv scans the whole length of the reads, removing the end adapter beforehand is not compulsory.

It is possible to control the number of mismatch, and IUPAC ambiguities codes can be used only in the search for primers (i.e. the search for the primers is conducted with fixed=FALSE, which means (from Biostring): "an IUPAC ambiguity code in the pattern can match any letter in the subject that is associated with the code, and vice versa". Note that indexes are searched with fixed=TRUE).

Information about the reads are passed with a comma separated file (CSV), whose path and name is passed with info.file. This must contain a column for each: the foward index, the reverse index, the forward primer, the reverse primer, the sample IDs, and an identifier of the PCR product being amplified, typically the gene's name. The column headings where these information are stored in info.file are passed with the function arguments. While it is possible to include other columns where the users can record additional information, these are effectively ignored. It is not mandatory to have both, the forward and reverse indexes, but if one is not used, there is still the need to include a blank column in info.file and indicate the column heading. Note that, when importing info.file, R will automatically convert illegal characters (e.g. sapces, paranthesis) in dots ('.'), so it is probably safer to only use alpha-numeric characters and/or dots or underscores ('_') in the function's arguments.

All sequences for indexes and primers are passed (as character vector) in 5' to 3' direction and are internally reversed and complemented when necessary.

deconv initially searches for the forward primer and separates the reads creating (if not existing already) a folder named as for the relevant information provided in the column gene, which is typically an identifier of the targeted genes. It is possible to use the column gene to group PCR products in other logical way than genes, but all identical forward primer should have the same gene information. This is because (for efficiency) deconv uses only the first line for each unique forward primer to identify where the processed data should be saved and if multiple codes are used for gene for the same forward primer, these are actually ignored.

After reads are separated based on the forward primer, indexes and primers are removed and processed samples are written to fastq files. Only reads where both primers (forward and reverse) and indexes (if there is information for both in info.file) were found are retained.

#' If verbose=TRUE, a warning is reported if reads have multiple hits in the serach for the pattern (indexes or primers). When there are multiple hits, the most external one is used if not mismatches are allowed. If mismatches are allowed, then the match with the lowest edit distance (calculated using srdistance) between the pattern and the match is used (i.e. if there is for example a match with zero and a match with one mismatch, the match with zero mismatch will be used). When there are multiple matches with the same lowest edit distance, the most external one will be used. Because emasuring the edit distance is computetionally demanding, allowing mismatches may slow down the data processing by several folds.

Value

A fastq file with the reads that were retained after removing the indexes (with the suffix "_IndRm") and after removing the primers (with the suffix "_Ind_primerRm") were end adapter was found (and removed) saved. A list with the total number of reads that were processed and retained is also returned.

When relevant, a text file with the sequence IDs that had multiple hits in the preliminary search for the forward primer.


carlopacioni/amplicR documentation built on Aug. 19, 2023, 7:59 p.m.