demultiplex: De-multiplexing fastq files

View source: R/demultiplex.R

demultiplexR Documentation

De-multiplexing fastq files

Description

De-multiplexing Illumina data based on the extra forward barcode used by MiDiv lab.

Usage

demultiplex(metadata.tbl, in.folder, out.folder, trim.primers = TRUE)

Arguments

metadata.tbl

Table with data for each sample, see Details below.

in.folder

Name of folder where raw fastq files are located.

out.folder

Name of folder to output de-multiplexed fastq files.

trim.primers

Logical indicating if PCR-primers should be trimmed from start of R1 and R2 reads.

compress.out

Logical to indicate compressed output or not.

pattern

The pattern to recognize the raw fastq files from other files

Details

The input metadata.tbl must be a table (tibble or data.frame) with one row for each sample. It must follow the MiDiv metadata table standard format. The columns used by this function are: * ProjectID * SequencingRunID * SampleID * Rawfile_R1 * Rawfile_R2 * Barcode * Forward_primer * Reverse_primer

The ProjectID, SequencingRunID and SampleID should all be a short text (sampleID may be just an integer). The names of the de-multiplexed fastq files will follow the format: ProjectID_SequencingRunID_SampleID_Rx.fastq.gz, where x is 1 or 2, so avoid using symbols not recommended in filenames (e.g. space, slash).

De-multiplexing means extracting subsets of reads from raw fastq files, those named in columns Rawfile_R1 and Rawfile_R2 (if single-end reads, only Rawfile_R1). The subset of read-pairs for each sample is identified by a barcode sequence, and this must be listed in the Barcode column. The Barcode sequence is matched at the start of the R1-reads only.

If trim.primers=TRUE the start of the R1 sequence is trimmed by the length of Forward_primer, and the start of the R2 read trimmed by the length of Reverse_primer. NOTE: There is no primer-matching here. No reads are discarded, only trimmed by primer lengths.

The files listed in Rawfile_R1 and Rawfile_R2 must all be in the in.folder. These files may be compressed (.gz).

Value

The function will output the de-multiplexed fastq-files to the out.folder. The name of each file consists of the corresponding ProjectID_SequencingRunID_SampleID, with the extensions _R1.fastq.gz or _R2.fastq.gz.

The function will return in R a table with the number of read-pairs for each sample. You may then add this as a new column to the existing metadata.tbl by full_join(metadata.tbl, demultiplex.tbl, by = c("ProjectID", "SequencingRunID", "SampleID"), where demultiplex.tbl indicates the output from this function.

Author(s)

Lars Snipen.


larssnip/midiv documentation built on Jan. 20, 2025, 6:22 p.m.