TrimDNA: Trims DNA Sequences to the High Quality Region Between...

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/TrimDNA.R

Description

Aids in trimming DNA sequences to the high quality region between a set of patterns that are potentially present on the left and right sides.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
TrimDNA(myDNAStringSet,
        leftPatterns,
        rightPatterns,
        type = "ranges",
        quality = NULL,
        maxDistance = 0.1,
        minOverlap = 5,
        allowInternal = TRUE,
        alpha = 0.1,
        threshold = 0.01,
        maxAverageError = threshold,
        maxAmbiguities = 0.1,
        minWidth = 36,
        verbose = TRUE)

Arguments

myDNAStringSet

A DNAStringSet object containing the sequences to be trimmed.

leftPatterns

A DNAStringSet or character vector of patterns to remove from the left side of myDNAStringSet, or "" to prevent trimming patterns on the left.

rightPatterns

A DNAStringSet or character vector of patterns to remove from the right side of myDNAStringSet, or "" to prevent trimming patterns on the right.

type

Character string indicating the type of results desired. This should be (an abbreviation of) either "ranges", "sequences" or "both".

quality

Either NULL (the default) to skip quality trimming, or a PhredQuality, SolexaQuality, or IlluminaQuality object containing the quality scores corresponding to myDNAStringSet.

maxDistance

Numeric between zero and one giving the maximum distance of a match from the leftPatterns and rightPatterns to initiate trimming. For example, 0.1 (the default) would allow up to 10% mismatches between a pattern and sequence.

minOverlap

Integer specifying the minimum number of nucleotides the leftPatterns and rightPatterns must overlap a sequence to initiate trimming.

allowInternal

Logical initiating whether to search for the leftPatterns and rightPatterns within myDNAStringSet, or (FALSE for) only overlapping the ends.

alpha

Numeric between zero and one giving the smoothing parameter for an exponential moving average that is applied to the quality scores before trimming. Higher values result in less smoothing than lower values.

threshold

Numeric between zero and one specifying the threshold above which to trim the poor quality regions of the sequence. Higher values allow more sequence to be preserved at the expense of a greater error rate.

maxAverageError

Numeric between zero and threshold indicating the maximum average error rate of the trimmed region of the sequence. Trimmed sequences with average error rates above maxAverageError will be rejected. Note that the expected number of errors in a sequence is equal to the average error rate multiplied by the length of the sequence.

maxAmbiguities

Numeric between zero and one giving the maximum fraction of ambiguous (e.g., "N") positions that are tolerated within the trimmed region of the sequence. Trimmed sequences with a greater fraction of ambiguities than maxAmbiguities will be rejected.

minWidth

Integer giving the minimum number of nucleotides a pattern must overlap the sequence to initiate trimming.

verbose

Logical indicating whether to display progress.

Details

After a sequencing run, it is often necessary to trim the resulting sequences to the high quality region located between a set of patterns. TrimDNA works as follows: first left and right patterns are identified within the sequences if allowInternal is TRUE (the default). If the patterns are not found internally, then a search is conducted at the flanking ends for patterns that partially overlap the sequence. The region between the leftPatterns and rightPatterns is then returned, unless quality information is provided. Note that the patterns must be in the same orientation as the sequence, which may require using the reverseComplement of a PCR primer.

If quality contains quality scores, these are converted to error probabilities and an exponential moving average is applied to smooth the signal. The longest region between the leftPatterns and rightPatterns where the average error probability is below threshold is then returned, so long as it has an average error rate of at most maxAverageError. Note that it is possible to only filter by maxAverageError by setting threshold to 1, or vise-versa by setting maxAverageError to threshold.

Value

TrimDNA can return two types of results: IRanges that can be used for trimming myDNAStringSet, or a trimmed DNAStringSet containing only those sequences over minWidth nucleotides after trimming. Note that ambiguity codes (IUPAC_CODE_MAP) are supported in the leftPatterns and rightPatterns, but not in myDNAStringSet to prevent trivial matches (e.g., runs of N's).

If type is "ranges" (the default) the output is an IRanges object with the start, end, and width of every sequence. This information can be accessed with the corresponding accessor function (see examples below). Note that the start will be 1 and the end will be 0 for sequences that were not at least minWidth nucleotides after trimming.

If type is "sequences" then the trimmed sequences are returned that are at least minWidth nucleotides in length.

If type is "both" the output is a list of two components, the first containing the ranges and the second containing the sequences.

Author(s)

Erik Wright eswright@pitt.edu

See Also

CorrectFrameshifts

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# simple example of trimming a single sequence
dna <- DNAStringSet("AAAAAAAAAATTACTTCCCCCCCCCC")
qscores <- PhredQuality("0000000000AAAAAAAAAAAAAAAA")

x <- TrimDNA(dna,
            leftPatterns="AAAAAA",
            rightPatterns="CCCCCC",
            quality=qscores,
            minWidth=1,
            allowInternal=TRUE,
            type="both")

x[[1]]
start(x[[1]])
end(x[[1]])
width(x[[1]])

x[[2]]

# example of trimming a FASTQ file by quality scores
fpath <- system.file("extdata",
	"s_1_sequence.txt",
	package="Biostrings")
reads <- readDNAStringSet(fpath, "fastq", with.qualities=TRUE)
TrimDNA(reads,
	leftPatterns="",
	rightPatterns="",
	type="sequences",
	quality=PhredQuality(mcols(reads)$qualities))

Example output

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package:BiocGenericsThe following objects are masked frompackage:parallel:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked frompackage:stats:

    IQR, mad, sd, var, xtabs

The following objects are masked frompackage:base:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package:S4VectorsThe following object is masked frompackage:base:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package:BiostringsThe following object is masked frompackage:base:

    strsplit

Loading required package: RSQLite
Finding left pattern: 100% internal, 0% flanking

Finding right pattern: 100% internal, 0% flanking

Trimming by quality score: 100% left, 0% right

Time difference of 0.06 secs

IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        15        16         2
[1] 15
[1] 16
[1] 2
DNAStringSet object of length 1:
    width seq
[1]     2 TT
Trimming by quality score: 0% left, 0% right

Time difference of 0.05 secs

DNAStringSet object of length 256:
      width seq                                             names               
  [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT            HWI-EAS88_1_1_1_1...
  [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC            HWI-EAS88_1_1_1_8...
  [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT            HWI-EAS88_1_1_1_9...
  [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA            HWI-EAS88_1_1_1_8...
  [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT            HWI-EAS88_1_1_1_9...
  ...   ... ...
[252]    36 GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGT            HWI-EAS88_1_1_1_6...
[253]    36 GTTTTACAGACACCTAAAGCTACATCGTCAACGTTA            HWI-EAS88_1_1_1_1...
[254]    36 GATGAACTAAGTCAACCTCAGCACTAACCTTGCGAG            HWI-EAS88_1_1_1_3...
[255]    36 GTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTA            HWI-EAS88_1_1_1_8...
[256]    36 GCAATCTGCCGACCACTCGCGATTCAATCATGACTT            HWI-EAS88_1_1_1_8...

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.