docs/FILEFORMATS.md

< Table Of Contents

File Formats

Experimental Design File

REQUIRED: DiMSum requires a table (e.g. using Microsoft Excel) describing the experimental design that has been saved as tab-separated plain text file (see arguments). You can download this file to use as a template.

Your file must have the following columns: sample_name A sensible sample name e.g. 'input1' (alphanumeric characters only). experiment_replicate An integer identifier denoting distinct experiments (e.g. distinct plasmid library transformations) i.e. a set of input and output replicates originating from the same input biological replicate (strictly positive integer). selection_id An integer inidicating whether samples were sequenced before (0) or after (1) selection. selection_replicate (Output samples only) An integer denoting distinct replicate selections (or biological output replicates) each derived from the same input sample (strictly positive integer). Entries should be blank (empty string) for all input samples (each input sample corresponds to a unique experiment). NOTE: DiMSum simply sums variant counts over selection replicates (i.e. no error modelling of replicate selections associated with the same input sample). technical_replicate An integer denoting technical replicates (a strictly positive integer) corresponding to sample re-sequencing i.e. extracted DNA originating from the same sample split between separate sequencing lanes or files. Leave this column blank (empty string) when no technical replicates are present. pair1 (WRAP only) FASTQ file name of the first read in a given pair. * pair2 (WRAP only) FASTQ file name of the second read in a given pair (omit for single-end library designs, see arguments).

Optional columns for growth-rate based assays (download template file here): generations (Output samples only) An estimate of the number of generations in order to normalize fitness and error estimates accordingly. cell_density An estimate of the cell density (optical density or similar) in order to estimate variant growth rates. * selection_time (Output samples only) The selection time in hours in order to estimate variant growth rates.

Below is a schematic of a generic deep mutational scanning experiment indicating the corresponding entries which should be made in the experimental design file (red text).

In addition to these mandatory columns, additional columns may be included to specify Stage 2-specific options (see arguments), which relate to constant region trimming. This allows sample-specific trimming behaviour if necessary. Options specified by columns in the experimental design file override global arguments provided on the command-line.

FASTQ Files

OPTIONAL: If processing of raw sequencing reads is required (with WRAP), DiMSum requires FASTQ files saved in a common directory and with a common file extension (see arguments). Either FASTQ files or a Variant Count File can be supplied (not both).

Variant Count File

OPTIONAL: If raw sequencing reads have already been processed independently of DiMSum, processing and analysis of variant counts (with DiMSum STEAM) requires a table (e.g. using Microsoft Excel) with variant sequences and counts for all samples (see arguments). You can download this file to use as a template. Either FASTQ Files or a Variant Count File can be supplied (not both).

Barcode Design File

OPTIONAL: If FASTQ files contain multiplexed samples, DiMSum requires a table (e.g. using Microsoft Excel) describing how index tags map to samples that has been saved as tab-separated plain text file (see arguments). You can download this file to use as a template.

Your file must have the following columns: pair1 FASTQ file name of the first read in a given pair. pair2 FASTQ file name of the second read in a given pair (omit for single-end library designs, see arguments). barcode Sample index tag (A/C/G/T characters only). new_pair_prefix FASTQ file prefix of demultiplexed sample reads i.e. excluding file extension (alphanumeric and underscore characters only).

When including a Barcode Design File, ensure that all 'new_pair_prefix' column entries correspond to 'pair1' and 'pair2' column entries in the Experimental Design File by appending '1.fastq' and '2.fastq' to the prefix for the first and second read respectively.

Variant Identity File

OPTIONAL: If the supplied sequences (supplied in the FASTQ Files or Variant Count File) contain variant barcodes, DiMSum requires a table (e.g. using Microsoft Excel) describing how barcodes map to variants that has been saved as tab-separated plain text file (see arguments). You can download this file to use as a template.

Your file must have the following columns: barcode DNA barcode (A/C/G/T characters only). variant Associated DNA variant (A/C/G/T characters only).

Synonym Sequences File

OPTIONAL: In order to obtain fitness and error estimates for synonymous substitution variants corresponding to additional reference variants (other than the wild-type), simply include them in a plain text file with one coding nucleotide sequence per line (single column, no header, A/C/G/T characters only). You can download this file to use as a template.

Output Files

Primary output files:

Additional output files:



lehner-lab/DiMSum documentation built on April 10, 2024, 4:15 a.m.