DiMSum: A pipeline for processing deep mutational scanning data

File Formats

Experimental Design File
FASTQ Files
Variant Count File
Barcode Design File
Variant Identity File
Synonym Sequences File
Output Files

REQUIRED: DiMSum requires a table (e.g. using Microsoft Excel) describing the experimental design that has been saved as tab-separated plain text file (see arguments). You can download this file to use as a template.

Your file must have the following columns: sample_name A sensible sample name e.g. 'input1' (alphanumeric characters only). experiment_replicate An integer identifier denoting distinct experiments (e.g. distinct plasmid library transformations) i.e. a set of input and output replicates originating from the same input biological replicate (strictly positive integer). selection_id An integer inidicating whether samples were sequenced before (0) or after (1) selection. selection_replicate (Output samples only) An integer denoting distinct replicate selections (or biological output replicates) each derived from the same input sample (strictly positive integer). Entries should be blank (empty string) for all input samples (each input sample corresponds to a unique experiment). NOTE: DiMSum simply sums variant counts over selection replicates (i.e. no error modelling of replicate selections associated with the same input sample). technical_replicate An integer denoting technical replicates (a strictly positive integer) corresponding to sample re-sequencing i.e. extracted DNA originating from the same sample split between separate sequencing lanes or files. Leave this column blank (empty string) when no technical replicates are present. pair1 (WRAP only) FASTQ file name of the first read in a given pair. * pair2 (WRAP only) FASTQ file name of the second read in a given pair (omit for single-end library designs, see arguments).

Optional columns for growth-rate based assays (download template file here): generations (Output samples only) An estimate of the number of generations in order to normalize fitness and error estimates accordingly. cell_density An estimate of the cell density (optical density or similar) in order to estimate variant growth rates. * selection_time (Output samples only) The selection time in hours in order to estimate variant growth rates.

Below is a schematic of a generic deep mutational scanning experiment indicating the corresponding entries which should be made in the experimental design file (red text).

In addition to these mandatory columns, additional columns may be included to specify Stage 2-specific options (see arguments), which relate to constant region trimming. This allows sample-specific trimming behaviour if necessary. Options specified by columns in the experimental design file override global arguments provided on the command-line.

OPTIONAL: If processing of raw sequencing reads is required (with WRAP), DiMSum requires FASTQ files saved in a common directory and with a common file extension (see arguments). Either FASTQ files or a Variant Count File can be supplied (not both).

OPTIONAL: If raw sequencing reads have already been processed independently of DiMSum, processing and analysis of variant counts (with DiMSum STEAM) requires a table (e.g. using Microsoft Excel) with variant sequences and counts for all samples (see arguments). You can download this file to use as a template. Either FASTQ Files or a Variant Count File can be supplied (not both).

OPTIONAL: If FASTQ files contain multiplexed samples, DiMSum requires a table (e.g. using Microsoft Excel) describing how index tags map to samples that has been saved as tab-separated plain text file (see arguments). You can download this file to use as a template.

Your file must have the following columns: pair1 FASTQ file name of the first read in a given pair. pair2 FASTQ file name of the second read in a given pair (omit for single-end library designs, see arguments). barcode Sample index tag (A/C/G/T characters only). new_pair_prefix FASTQ file prefix of demultiplexed sample reads i.e. excluding file extension (alphanumeric and underscore characters only).

When including a Barcode Design File, ensure that all 'new_pair_prefix' column entries correspond to 'pair1' and 'pair2' column entries in the Experimental Design File by appending '1.fastq' and '2.fastq' to the prefix for the first and second read respectively.

OPTIONAL: If the supplied sequences (supplied in the FASTQ Files or Variant Count File) contain variant barcodes, DiMSum requires a table (e.g. using Microsoft Excel) describing how barcodes map to variants that has been saved as tab-separated plain text file (see arguments). You can download this file to use as a template.

Your file must have the following columns: barcode DNA barcode (A/C/G/T characters only). variant Associated DNA variant (A/C/G/T characters only).

OPTIONAL: In order to obtain fitness and error estimates for synonymous substitution variants corresponding to additional reference variants (other than the wild-type), simply include them in a plain text file with one coding nucleotide sequence per line (single column, no header, A/C/G/T characters only). You can download this file to use as a template.

Primary output files:

report.html DiMSum pipeline summary report and diagnostic plots in html format.
DiMSum_Project_fitness_replicates.RData R data object with replicate (and merged) fitness scores and associated errors.
DiMSum_Project_variant_data_merge.RData R data object with variant counts and statistics.

Additional output files:

fitness_wildtype.txt Wild-type fitness score and associated error.
fitness_singles.txt Single amino acid or nucleotide substitution variant fitness scores and associated errors.
fitness_doubles.txt Double amino acid or nucleotide substitution variant fitness scores and associated errors.
fitness_synonymous.txt Synonymous substitution variant fitness scores and associated errors (for coding sequences only).
fitness_singles_MaveDB.csv MaveDB compatible .csv file with single amino acid or nucleotide substitution variant fitness scores and associated errors.
DiMSum_Project_variant_data_merge.tsv Tab-separated plain text file with variant counts and statistics.
DiMSum_Project_nobarcode_variant_data_merge.tsv Tab-separated plain text file with sequenced barcodes that were not found in the variant identity file.
DiMSum_Project_indel_variant_data_merge.tsv Tab-separated plain text file with rejected indel variants.
DiMSum_Project_rejected_variant_data_merge.tsv Tab-separated plain text file with remaining rejected variants (internal constant region mutants, mutations inconsistent with the library design or variants with too many substitutions).