DiMSum: A pipeline for processing deep mutational scanning data

Pipeline Stages

Stage 0: DEMULTIPLEX raw reads
Stage 1: QC raw reads
Stage 2: TRIM constant regions
Stage 3: ALIGN paired-end reads
Stage 4: PROCESS variants
Stage 5: ANALYSE counts

Demultiplex samples and trim read barcodes using Cutadapt (optional). This stage is run if a Barcode Design File is supplied (see arguments).

Produce raw read quality reports using FastQC (and unzip and split FASTQ files if necessary).

Remove constant region sequences from read 5’ and 3’ ends using Cutadapt. By default the sequences of 3' constant regions are assumed to be the reverse complement of 5' constant region sequences (see stage-specific arguments).

Align overlapping read pairs using VSEARCH and filter resulting variants according to base quality, expected number of errors and constituent read length (see stage-specific arguments). Unique variant sequences are then tallied using Starcode. For Trans library designs, read pairs are simply concatenated. For single-end libraries, reads are only filtered.

Combine sample-wise variant counts and statistics to produce a unified results data.table. After aggregating counts across technical replicates, variants are processed and filtered according to user specifications (see stage-specific arguments): 4.1 For Barcoded library designs, read counts are aggregated at the variant level for barcode/variant mappings specified in the Variant Identity File. Undefined/misread barcodes are ignored. 4.2 Indel variants (defined as those not matching the wild-type nucleotide sequence length) are removed if necessary (see '--indels' argument). 4.3 If internal constant region(s) are specified, these are excised from all substitution variants if a perfect match is found (see '--wildtypeSequence' argument). 4.4 Substitution variants with mutations inconsistent with the library design are removed (see '--permittedSequences' argument). 4.5 Substitution variants with more substitutions than desired are also removed (see '--maxSubstitutions' argument). 4.6 Finally, nonsynonymous substitution variants with synonymous substitutions in other codons are removed if necessary (see '--mixedSubstitutions' argument).

Calculate fitness and error estimates for a user-specified subset of variants (see stage-specific arguments): 5.1 Optionally remove low count variants according to user-specified soft/hard thresholds to minimise the impact of "fictional" variants from sequencing errors. 5.2 Calculate replicate normalisation parameters (scale and shift) to minimise inter-replicate fitness differences. 5.3 Fit the error model to a high confidence subset of variants to determine additive and multiplicative error terms. 5.4 Aggregate variant fitness and error at the amino acid level if the target molecule is a coding sequence. 5.5 Optionally normalise fitness and error estimates by the number of generations in the case of a growth-rate based assay (see Experiment Design File). 5.6 Merge fitness scores between replicates in a weighted manner that takes into account their respective errors.