demultiplex: De-multiplexing fastq files
In larssnip/midiv: MiDiv-lab bioinformatics

View source: R/demultiplex.R

demultiplex

R Documentation

De-multiplexing fastq files

Description

De-multiplexing Illumina data based on the extra forward barcode used by MiDiv lab.

Usage

demultiplex(metadata.tbl, in.folder, out.folder, trim.primers = TRUE)

Arguments

`metadata.tbl`	Table with data for each sample, see Details below.
`in.folder`	Name of folder where raw fastq files are located.
`out.folder`	Name of folder to output de-multiplexed fastq files.
`trim.primers`	Logical indicating if PCR-primers should be trimmed from start of R1 and R2 reads.
`compress.out`	Logical to indicate compressed output or not.
`pattern`	The pattern to recognize the raw fastq files from other files

Details

The input metadata.tbl must be a table (tibble or data.frame) with one row for each sample. It must follow the MiDiv metadata table standard format. The columns used by this function are: * ProjectID * SequencingRunID * SampleID * Rawfile_R1 * Rawfile_R2 * Barcode * Forward_primer * Reverse_primer

The ProjectID, SequencingRunID and SampleID should all be a short text (sampleID may be just an integer). The names of the de-multiplexed fastq files will follow the format: ProjectID_SequencingRunID_SampleID_Rx.fastq.gz, where x is 1 or 2, so avoid using symbols not recommended in filenames (e.g. space, slash).

De-multiplexing means extracting subsets of reads from raw fastq files, those named in columns Rawfile_R1 and Rawfile_R2 (if single-end reads, only Rawfile_R1). The subset of read-pairs for each sample is identified by a barcode sequence, and this must be listed in the Barcode column. The Barcode sequence is matched at the start of the R1-reads only.

If trim.primers=TRUE the start of the R1 sequence is trimmed by the length of Forward_primer, and the start of the R2 read trimmed by the length of Reverse_primer. NOTE: There is no primer-matching here. No reads are discarded, only trimmed by primer lengths.

The files listed in Rawfile_R1 and Rawfile_R2 must all be in the in.folder. These files may be compressed (.gz).

Value

The function will output the de-multiplexed fastq-files to the out.folder. The name of each file consists of the corresponding ProjectID_SequencingRunID_SampleID, with the extensions _R1.fastq.gz or _R2.fastq.gz.

The function will return in R a table with the number of read-pairs for each sample. You may then add this as a new column to the existing metadata.tbl by full_join(metadata.tbl, demultiplex.tbl, by = c("ProjectID", "SequencingRunID", "SampleID"), where demultiplex.tbl indicates the output from this function.