digestFastqs: Read, filter and digest sequences from fastq file(s).
In fmicompbio/mutscan: Preprocessing and Analysis of Deep Mutational Scanning Data

digestFastqs

R Documentation

Read, filter and digest sequences from fastq file(s).

Description

Read sequences for one or a pair of fastq files and digest them (extract umis, constant and variable parts, filter, extract mismatch information from constant and count the observed unique variable parts). Alternatively, primer sequences could be specified, in which case the sequence immediately following the primer will be considered the variable sequence.

Usage

digestFastqs(
  fastqForward,
  fastqReverse = NULL,
  mergeForwardReverse = FALSE,
  minOverlap = 0,
  maxOverlap = 0,
  minMergedLength = 0,
  maxMergedLength = 0,
  maxFracMismatchOverlap = 1,
  greedyOverlap = TRUE,
  revComplForward = FALSE,
  revComplReverse = FALSE,
  adapterForward = "",
  adapterReverse = "",
  elementsForward = "",
  elementLengthsForward = numeric(0),
  elementsReverse = "",
  elementLengthsReverse = numeric(0),
  primerForward = c(""),
  primerReverse = c(""),
  wildTypeForward = "",
  wildTypeReverse = "",
  constantForward = c(""),
  constantReverse = c(""),
  avePhredMinForward = 20,
  avePhredMinReverse = 20,
  variableNMaxForward = 0,
  variableNMaxReverse = 0,
  umiNMax = 0,
  nbrMutatedCodonsMaxForward = 1,
  nbrMutatedCodonsMaxReverse = 1,
  nbrMutatedBasesMaxForward = -1,
  nbrMutatedBasesMaxReverse = -1,
  forbiddenMutatedCodonsForward = "",
  forbiddenMutatedCodonsReverse = "",
  useTreeWTmatch = FALSE,
  collapseToWTForward = FALSE,
  collapseToWTReverse = FALSE,
  mutatedPhredMinForward = 0,
  mutatedPhredMinReverse = 0,
  mutNameDelimiter = ".",
  constantMaxDistForward = -1,
  constantMaxDistReverse = -1,
  variableCollapseMaxDist = deprecated(),
  variableCollapseMinReads = deprecated(),
  variableCollapseMinRatio = deprecated(),
  umiCollapseMaxDist = 0,
  filteredReadsFastqForward = "",
  filteredReadsFastqReverse = "",
  maxNReads = -1,
  verbose = FALSE,
  nThreads = 1,
  chunkSize = 1e+05,
  maxReadLength = 1024
)

Arguments

`fastqForward`, `fastqReverse`	Character vectors, paths to gzipped FASTQ files corresponding to forward and reverse reads, respectively. If more than one forward/reverse sequence file is given, they need to be provided in the same order. Note that if multiple fastq files are provided, they are all assumed to correspond to the same sample, and will effectively be concatenated.
`mergeForwardReverse`	Logical scalar, whether to fuse the forward and reverse variable sequences.
`minOverlap`, `maxOverlap`	Numeric scalar, the minimal and maximal allowed overlap between the forward and reverse reads when merging. Only used if `mergeForwardReverse` is `TRUE`. If set to 0, only overlaps covering the full length of the shortest of the two reads will be considered.
`minMergedLength`, `maxMergedLength`	Numeric scalar, the minimal and maximal allowed total length of the merged product (if `mergeForwardReverse` is `TRUE`). If set to 0, any length is allowed.
`maxFracMismatchOverlap`	Numeric scalar, maximal mismatch rate in the overlap. Only used if `mergeForwardReverse` is `TRUE`.
`greedyOverlap`	Logical scalar. If `TRUE`, the first overlap satisfying `minOverlap`, `maxOverlap`, `minMergedLength`, `maxMergedLength` and `maxFracMismatchOverlap` will be retained. If `FALSE`, all valid overlaps will be scored and the one with the highest score (largest number of matches) will be retained.
`revComplForward`, `revComplReverse`	Logical scalar, whether to reverse complement the forward/reverse variable and constant sequences, respectively.
`adapterForward`, `adapterReverse`	Character scalars, the adapter sequence for forward/reverse reads, respectively. If a forward/reverse read contains the corresponding adapter sequence, the sequence pair will be filtered out. If set to `NULL`, no adapter filtering is performed. The number of filtered read pairs are reported in the return value.
`elementsForward`, `elementsReverse`	Character scalars representing the composition of the forward and reverse reads, respectively. The strings should consist only of the letters S (skip), C (constant), U (umi), P (primer), V (variable), and cover the full extent of the read. Most combinations are allowed (and a given letter can appear multiple times), but there can be at most one occurrence of P. If a given letter is included multiple times, the corresponding sequences will be concatenated in the output.
`elementLengthsForward`, `elementLengthsReverse`	Numeric vectors containing the lengths of each read component from `elementsForward`/`elementsReverse`, respectively. If the length of one element is set to -1, it will be inferred from the other lengths (as the remainder of the read). At most one number (or one number on each side of the primer P) can be set to -1. The indicated length of the primer is not used (instead it's inferred from the provided primer sequence) and can also be set to -1.
`primerForward`, `primerReverse`	Character vectors, representing the primer sequence(s) for forward/reverse reads, respectively. Only read pairs that contain perfect matches to both the forward and reverse primers (if given) will be retained. Multiple primers can be specified - they will be considered in order and the first match will be used.
`wildTypeForward`, `wildTypeReverse`	Character scalars or named character vectors, the wild type sequence for the forward and reverse variable region. If given as a single string, the reference sequence will be named 'f' (for forward) or 'r' (for reverse).
`constantForward`, `constantReverse`	Character vectors giving, the expected constant forward and reverse sequences. If more than one sequence is provided, they must all have the same length.
`avePhredMinForward`, `avePhredMinReverse`	Numeric scalar, the minimum average Phred score in the variable region for a read to be retained. If a read pair contains both forward and reverse variable regions, the minimum average Phred score has to be achieved in both for a read pair to be retained.
`variableNMaxForward`, `variableNMaxReverse`	Numeric scalar, the maximum number of Ns allowed in the variable region for a read to be retained.
`umiNMax`	Numeric scalar, the maximum number of Ns allowed in the UMI for a read to be retained.
`nbrMutatedCodonsMaxForward`, `nbrMutatedCodonsMaxReverse`	Numeric scalar, the maximum number of mutated codons that are allowed. Note that for the forward and reverse sequence, respectively, exactly one of `nbrMutatedCodonsMax` and `nbrMutatedBasesMax` must be -1, and the other must be a non-negative number. The one that is not -1 will be used to filter and name the identified mutants.
`nbrMutatedBasesMaxForward`, `nbrMutatedBasesMaxReverse`	Numeric scalar, the maximum number of mutated bases that are allowed. Note that for the forward and reverse sequence, respectively, exactly one of `nbrMutatedCodonsMax` and `nbrMutatedBasesMax` must be -1, and the other must be a non-negative number. The one that is not -1 will be used to filter and name the identified mutants.
`forbiddenMutatedCodonsForward`, `forbiddenMutatedCodonsReverse`	Character vector of codons (can contain ambiguous IUPAC characters, see `IUPAC_CODE_MAP`). If a read pair contains a mutated codon matching this pattern, it will be filtered out.
`useTreeWTmatch`	Logical scalar. Should a tree-based matching to wild type sequences be used if possible? If the number of allowed mismatches is small, and the number of wild type sequences is large, this is typically faster.
`collapseToWTForward`, `collapseToWTReverse`	Logical scalar, indicating whether to just represent the observed variable sequence by the closest wildtype sequence rather than retaining the information about the mutations.
`mutatedPhredMinForward`, `mutatedPhredMinReverse`	Numeric scalar, the minimum Phred score of a mutated base for the read to be retained. If any mutated base has a Phred score lower than `mutatedPhredMin`, the read (pair) will be discarded.
`mutNameDelimiter`	Character scalar, the delimiter used in the naming of mutants. Generally, mutants will be named as XX.YY.NNN, where XX is the closest provided reference sequence, YY is the mutated base or codon number (depending on whether `nbrMutatedBases` or `nbrMutatedCodons` is specified), and NNN is the mutated base or codon. Here, . is the provided `mutNameDelimiter`. The delimiter must be a single character (not "_"), and can not appear in any of the provided reference sequence names.
`constantMaxDistForward`, `constantMaxDistReverse`	Numeric scalars, the maximum allowed Hamming distance between the extracted and expected constant sequence. If multiple constant sequences are provided, the most similar one is used. Reads with a larger distance to the expected constant sequence are discarded. If set to -1, no filtering is done.
`variableCollapseMaxDist`, `variableCollapseMinReads`, `variableCollapseMinRatio`	Deprecated. Collapsing of variable sequences is no longer performed in `digestFastqs`. Please use `collapseMutantsBySimilarity` instead.
`umiCollapseMaxDist`	Numeric scalar defining the tolerances for collapsing similar UMI sequences. If the value is in [0, 1), it defines the maximal Hamming distance in terms of a fraction of sequence length: (`round(umiCollapseMaxDist * nchar(umiSeq))`). A value greater or equal to 1 is rounded and directly used as the maximum allowed Hamming distance.
`filteredReadsFastqForward`, `filteredReadsFastqReverse`	Character scalars, the names of a (pair of) FASTQ file(s) where filtered-out reads will be written. The name(s) should end in .gz (the output will always be compressed). If empty, filtered reads will not be written to a file.
`maxNReads`	Integer scalar, the maximum number of reads to process. The first `maxNReads` read (pairs) in the FASTQ file(s) will be used. If set to -1, all reads in the FASTQ file(s) will be processed.
`verbose`	Logical scalar, whether to print out progress messages.
`nThreads`	Numeric scalar, the number of threads to use for parallel processing.
`chunkSize`	Numeric scalar, the number of read (pairs) to keep in memory for parallel processing. Reduce from the default value if you run out of memory.
`maxReadLength`	Numeric scalar, the maximum allowed read length. Longer read lengths lead to higher memory allocation, and may require the `chunkSize` to be decreased.

Details

The processing of a read pair goes as follows:

Search for perfect matches to forward/reverse adapter sequences, filter out the read pair if a match is found in either the forward or reverse read.
If primer sequences are provided, search for perfect matches, and filter out the read pair if not all provided primer sequences can be found.
Extract the UMI, constant and variable sequence from forward and reverse reads, based on the definition of the respective read composition.
If requested, collapse forward and reverse variable regions by retaining, for each position, the base with the highest reported base quality.
Filter out the read (pair) if the average quality in the variable region is below avePhredMinForward/avePhredMinReverse, in either the forward or reverse read (or the merged read).
Filter out the read (pair) if the number of Ns in the variable region exceeds variableNMaxForward/variableNMaxReverse.
Filter out the read (pair) if the number of Ns in the combined forward and reverse UMI sequence exceeds umiNMax
If one or more wild type sequences (for the variable region) are provided, find the mismatches between the (forward/reverse) variable region and the provided wild type sequence (if more than one wild type sequence is provided, first find the one that is closest to the read).
Filter out the read (pair) if any mutated base has a quality below mutatedPhredMinForward/mutatedPhredMinReverse.
Filter out the read (pair) if the number of mutated codons exceeds nbrMutatedCodonsMaxForward/nbrMutatedCodonsMaxReverse.
Filter out the read (pair) if any of the mutated codons match any of the codons encoded by forbiddenMutatedCodonsForward/forbiddenMutatedCodonsReverse.
Assign a 'mutation name' to the read (pair). This name is a combination of parts of the form XX.YY.NNN, where XX is the name of the most similar reference sequence, YY is the mutated codon number, and NNN is the mutated codon. . is a delimiter, specified via mutNameDelimiter. If no wildtype sequences are provided, the variable sequence will be used as the mutation name'.

Based on the retained reads following this filtering process, count the number of reads, and the number of unique UMIs, for each variable sequence (or pair of variable sequences).

Value

A list with four entries:

summaryTable: A data.frame that contains, for each observed mutation combination, the corresponding variable region sequences (or pair of sequences), the number of observed such sequences, and the number of unique UMIs observed for the sequence. It also has additional columns: 'maxNbrReads' contains the number of reads for the most frequent observed sequence represented by the feature (only relevant if similar variable regions are collapsed). 'nbrMutBases', 'nbrMutCodons' and 'nbrMutAAs' give the number of mutated bases, codons or amino acids in each variant. Alternative variant names based on base, codon or amino acid sequence are provided in columns 'mutantNameBase', 'mutantNameCodon', 'mutantNameAA'. In addition, 'mutantNameBaseHGVS' and 'mutantNameAAHGVS' give base- and amino acid-based names following the HGVS nomenclature (https://varnomen.hgvs.org/). Please note that the provided reference sequence names are used for the HGVS sequence identifiers. It is up to the user to use appropriately named reference sequences in order to obtain valid HGVS variant names.
filterSummary: A data.frame that contains the number of input reads, the number of reads filtered out in the processing, and the number of retained reads. The filters are named according to the convention "fxx_filter", where "xx" indicates the order in which the filters were applied, and "filter" indicates the type of filter. Note that filters are applied successively, and the reads filtered out in one step are not considered for successive filtering steps.
errorStatistics: A data.frame that contains, for each Phred quality score between 0 and 99, the number of bases in the extracted constant sequences with that quality score that match/mismatch with the provided reference constant sequence.
parameters: A list with all parameter settings that were used in the processing. Also contains the version of the package and the time of processing.

Examples

## See the vignette for complete worked-out examples for different types of 
## data sets

## ----------------------------------------------------------------------- ## 
## Process a single-end data set, assume that the full read represents  
## the variable region
out <- digestFastqs(
    fastqForward = system.file("extdata", "cisInput_1.fastq.gz", 
                               package = "mutscan"), 
    elementsForward = "V", elementLengthsForward = -1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary

## ----------------------------------------------------------------------- ## 
## Process a single-end data set, specify the read as a combination of 
## UMI, constant region and variable region (skip the first base)
out <- digestFastqs(
    fastqForward = system.file("extdata", "cisInput_1.fastq.gz", 
                               package = "mutscan"), 
    elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96), 
    constantForward = "AACCGGAGGAGGGAGCTG"
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics

## ----------------------------------------------------------------------- ## 
## Process a single-end data set, specify the read as a combination of 
## UMI, constant region and variable region (skip the first base), provide 
## the wild type sequence to compare the variable region to and limit the 
## number of allowed mutated codons to 1
out <- digestFastqs(
    fastqForward = system.file("extdata", "cisInput_1.fastq.gz", 
                               package = "mutscan"), 
    elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96), 
    constantForward = "AACCGGAGGAGGGAGCTG", 
    wildTypeForward = c(FOS = paste0("ACTGATACACTCCAAGCGGAGACAGACCAACTAGAAGATGAGAAGTC", 
                                     "TGCTTTGCAGACCGAGATTGCCAACCTGCTGAAGGAGAAGGAAAAACTA")),
    nbrMutatedCodonsMaxForward = 1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics

## ----------------------------------------------------------------------- ## 
## Process a paired-end data set where both the forward and reverse reads 
## contain the same variable region and thus should be merged to generate 
## the final variable sequence, specify the reads as a combination of 
## UMI, constant region and variable region (skip the first and/or last
## base), provide the wild type sequence to compare the variable region to 
## and limit the number of allowed mutated codons to 1
out <- digestFastqs(
    fastqForward = system.file("extdata", "cisInput_1.fastq.gz", 
                               package = "mutscan"),
    fastqReverse = system.file("extdata", "cisInput_2.fastq.gz",
                               package = "mutscan"), 
    mergeForwardReverse = TRUE, revComplForward = FALSE, revComplReverse = TRUE, 
    elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96),
    elementsReverse = "SUCVS", elementLengthsReverse = c(1, 7, 17, 96, -1),
    constantForward = "AACCGGAGGAGGGAGCTG", 
    constantReverse = "GAGTTCATCCTGGCAGC",
    wildTypeForward = c(FOS = paste0("ACTGATACACTCCAAGCGGAGACAGACCAACTAGAAGATGAGAAGTC", 
                                     "TGCTTTGCAGACCGAGATTGCCAACCTGCTGAAGGAGAAGGAAAAACTA")),
    nbrMutatedCodonsMaxForward = 1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics

## ----------------------------------------------------------------------- ## 
## Process a paired-end data set where the forward and reverse reads 
## contain variable regions corresponding to different proteins, and thus 
## should not be merged, specify the reads as a combination of 
## UMI, constant region and variable region (skip the first base), provide 
## the wild type sequence to compare the variable region to and limit the 
## number of allowed mutated codons to 1
out <- digestFastqs(
    fastqForward = system.file("extdata", "transInput_1.fastq.gz", 
                               package = "mutscan"),
    fastqReverse = system.file("extdata", "transInput_2.fastq.gz",
                               package = "mutscan"), 
    mergeForwardReverse = FALSE,  
    elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96),
    elementsReverse = "SUCV", elementLengthsReverse = c(1, 8, 20, 96),
    constantForward = "AACCGGAGGAGGGAGCTG", 
    constantReverse = "GAAAAAGGAAGCTGGAGAGA",
    wildTypeForward = c(FOS = paste0("ACTGATACACTCCAAGCGGAGACAGACCAACTAGAAGATGAGAAGTC", 
                                     "TGCTTTGCAGACCGAGATTGCCAACCTGCTGAAGGAGAAGGAAAAACTA")),
    wildTypeReverse = c(JUN = paste0("ATCGCCCGGCTGGAGGAAAAAGTGAAAACCTTGAAAGCTCAGAACTC", 
                                     "GGAGCTGGCGTCCACGGCCAACATGCTCAGGGAACAGGTGGCACAGCTT")), 
    nbrMutatedCodonsMaxForward = 1,
    nbrMutatedCodonsMaxReverse = 1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics

fmicompbio/mutscan documentation built on Feb. 22, 2025, 11:47 a.m.