knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The STRMPS
package is designed to extract and collect the short tandem repeat (STR) information from the fastq
files produced by massively parallel sequencing (MPS).
The STRMPS
-package depends on R
(>= 4.4), methods
, utils
, tidyr
, tibble
, dplyr
, stringr
, purrr
, parallel
, as well as the bioconductor packages Biostrings
, pwalign
, ShortRead
, and IRanges
.
Using the readFastq
function, we can load a data file into R
.
library("Biostrings") library("pwalign") library("ShortRead") readPath <- system.file('extdata', 'sampleSequences.fastq', package = 'STRMPS') sequences <- readFastq(readPath)
sequences@sread
We are interested in extracting the STR regions of these sequences. They are extracted by searching for marker specific sequences in the regions surrounding the STR region -- called the flanking regions. The following is an example of a tibble
containing the necessary flanking region information. NOTE: this is just an example, the following loaded flanking region tibble
is not usable for actual applications.
library("STRMPS") data("flankingRegions")
head(flankingRegions, 5)
The columns of the tibble
contains the following information:
Marker
and Type
are the name and chromosomal type (autosomal, X, or Y) of each marker.ForwardFlank
and ReverseFlank
contain the marker specific sequences used to identify the STR regions. The forward and reverse flanks should occur before and after the STR regions, respectively.Motif
and MotifLength
are the structure and length of the STR regions motif, respectively.Offset
is used to adjust the numeric allele designation so it can be compared to the corresponding CE allele designation. ForwardShift
and ReverseShift
(not shown) are used to trim the extracted sequences to just include the STR regions. If these are set to zero, then the extracted regions contain some useful flanking region information.Using the flankingRegions
file, we can identify the STR regions of the sequences by calling the identifySTRRegions
function. The function takes four arguments:
reads
: The sequences to search through.flankingRegions
: A tibble
or data.frame
in the same style as shown above.numberOfMutation
: The number of allowed mismatches, when searching the sequences.control
: A control object setting additional parameters (to see more type ?identifySTRRegions.control
).identifiedSTRs <- identifySTRRegions( reads = sequences, flankingRegions = flankingRegions, numberOfMutation = 1, control = identifySTRRegions.control( numberOfThreads = 1, includeReverseComplement = FALSE ) )
The function returns a list with the following:
n_reads
: The total number of reads in the supplied reads
object.reverseComplement
: TRUE/FALSE value -- did we search for the reverse complementary flanking regions?identifiedMarkers
: A list containing indecies, and the start and end positions of STR regions for each of the markers in the flankingRegiions
object. identifiedMarkersSequencesUniquelyAssigned
: Almost identical to identifiedMarkers
, but if multiple markers have been found in a sequences it is disregarded.remainingSequences
: A vector of indecies of sequences where no marker was identified.An example of what is stored in the identifiedMarkersSequencesUniquelyAssigned
list
names(identifiedSTRs$identifiedMarkersSequencesUniquelyAssigned$CSF1PO)
That is, we find the name of the marker, the indecies of the sequences identified as belonging to the marker, the start of the forward flank, the end of the reverse flank, and then trimmed versions of the sequences, where 'junk' has been removed (any information outside the searched for flanking sequences).
Given the identified STR regions, we can aggregate the identified strings, to get their coverage. When doing so, we also supply the length of the motifs of every marker, and their type, i.e.\ whether the marker is located on an autosomal, the X, or the Y chromosome. In order to supply the motif length and type information, we need to know which markers were actually identified, and in which order they were identified.
This could be done as follows:
sortedIncludedMarkers <- sapply( names(identifiedSTRs$identifiedMarkersSequencesUniquelyAssigned), function(m) which(m == flankingRegions$Marker) )
The aggregation is then performed by calling the stringCoverage
function with the arguments extractedReadsListObject
, motifLength
, Type
, and control
. In the control
argument, we pass an argument called simpleReturn
. If this argument is set to FALSE
, the aggregation is performed based on both the STR region and the forward and reverse flanks. If it is set to TRUE
, the aggregation is just performed based on the STR regions. That is, if it is FALSE
, we will find more distinct strings, with lower coverage. Note that when setting simpleReturn = TRUE
, the quality is also summarised by the geometric average for each base.
stringCoverageList <- stringCoverage( extractedReadsListObject = identifiedSTRs, flankingRegions = flankingRegions, control = stringCoverage.control( numberOfThreads = 1, trace = FALSE, simpleReturn = TRUE ) )
The stringCoverage
returns a list of tibbles
one for each of the identified markers. The result can be seen here:
stringCoverageList$CSF1PO
We see on marker CSF1PO
, we have found three distinct markers two with a coverage above 950, and one with a coverage of 37. If the sample only contained a single contributor, we would expect that the two strings with a coverage of above 950 to be the alleles of the individual. While the string with low coverage is either a stutter (as it is in the stutter position), or an error.
If the aggregated stringCoverageList
-object, contains DNA from a single contributor and was made with a large amount of input DNA, we can determine the genotype of the contributor. A way of doing so is to identify the string with the largest coverage, and determine whether or not another allele is within some pre-defined heterozygous threshold (by default this is set to $0.35$).
We determine the genotype of each marker, by using the getGenotype
function.
genotypeList <- getGenotype(stringCoverageList)
The created genotypeList
-object contains the genotypes determined by the thresholdHeterozygosity
. Continuing the example from before, the genotypes of the marker CSF1PO
are:
genotypeList$CSF1PO
We see that the getGenotype
function has identified the two strings with coverage above 950, as the potential alleles of the marker for the contributor of the sample. Furthermore, note that the function adds a column called Indices
containing the indices of the corresponding strings in the stringCoverageList
-object. That is, in this example, Indices
contains the values 1 and 3.
In a similar way to the genotype identification, we implemented simple rules for noise identification. A string is classified as noise if it is less than the coverage of the most prevalent string times thresholdSignal
. By default this factor is set to $1\%$.
noiseList <- identifyNoise(stringCoverageList, thresholdSignal = 0.03)
The identifyNoise
function returns a list with an element for every marker in the stringCoverageList
-object. If an observation is classified as noise, it is then removed from the data. Returning to our example from before, marker CSF1PO
has no strings with less than $1\%$ of the coverage of the most prevalent string. That is, the following still contains three distinct sequences:
noiseList$CSF1PO
Note that as with the getGenotype
function, the identifyNoise
function adds a column called Indices
containing the indices of the corresponding strings in the stringCoverageList
-object. That is, in this example, Indices
contains the values 1, 2, and 3.
Stuttering refers to the loss (or gain) of a motif in the allele sequence during PCR amplification. A predictor of the rate of stuttering (quantified by the stutter ratio or stutter proportion), is the Block Length of the Missing Motif (BLMM). If we compare the allele to a stutter sequence, then we can identify the approximate location (down to the nearest block) of the motif which stuttered. The BLMM is the length of the block (sub-sequence), which has lost a motif.
The findStutter
function for every called allele, found with the getGenotype
function, finds all potential stutters of the alleles and calculates the stutter ratio and stutter proportions between the allele and potential stutter. It also identifies the missing motif and finds the length of the corresponding block, as well as the longest uninterrupted stretch (LUS), i.e. the length of the longest block. It takes a merge of the stringCoverageList
and the genotypeList
created using the mergeGenotypeStringCoverage
function.
stringCoverageGenotypeList <- mergeGenotypeStringCoverage(stringCoverageList, genotypeList) stutterList <- findStutter(stringCoverageGenotypeList) stutterTibble <- subset(do.call("rbind", stutterList), !is.na(Genotype)) head(stutterTibble, 5)
Note that the output also contains a series of flags which could be useful:
FLAGStutterIdentifiedMoreThanOnce
: Is the stutter string been identified as a stutter for more than one allele?FLAGMoreThanTwoAlleles
: Were more than two alleles provided by the merged stringCoverageGenotypeList
object?FLAGAlleleDifferenceOne
: Was the difference between the two alleles one motif? If TRUE
the shorter will appear as a stutter of the longer and, thereby, skew the results of any analysis. These cases should be removed. FLAGMoreThanOneBlock
: Was more than one block identified when comparing the potenetial stutter and allele sequences?FLAGBlocksWithDifferentLengths
: If FLAGMoreThanOneBlock
was TRUE
, did the identified blocks have different lengths? Note: This should not be possible in theory, but if it happens it is at most a difference of one and the longest is always chosen (as it should be more likely).Instead of calling all the above functions, we can call the workflow function STRMPSWorkflow
. The function takes a path to a file, and either returns a series of .RData
files in the provided output
directory, or it returns the stringCoverageList
object created by the stringCoverage
function. The function can also continue from previously created files using the continueCheckpoint
argument (if the files are placed in the output
directory).
STRMPSWorkflow( read_path, control = workflow.control( restrictType = "Autosomal", numberOfThreads = 1 ) )
There also exists a batch version of the function called STRMPSWorkflowBatch
which takes an input
directory containing the .fastq
files to be analysed.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.