srFilter: Functions for user-created and built-in ShortRead filters
In ShortRead: FASTQ input and manipulation

Description Usage Arguments Details Value Author(s) See Also Examples

These functions create user-defined (srFitler) or built-in instances of SRFilter objects. Filters can be applied to objects from ShortRead, returning a logical vector to be used to subset the objects to include only those components satisfying the filter.

srFilter(fun, name = NA_character_, ...)
## S4 method for signature 'missing'
srFilter(fun, name=NA_character_, ...)
## S4 method for signature 'function'
srFilter(fun, name=NA_character_, ...)

compose(filt, ..., .name)

idFilter(regex=character(0), fixed=FALSE, exclude=FALSE,
         .name="idFilter")
occurrenceFilter(min=1L, max=1L,
                 withSread=c(NA, TRUE, FALSE),
                 duplicates=c("head", "tail", "sample", "none"),
                 .name=.occurrenceName(min, max, withSread,
                                       duplicates))
nFilter(threshold=0L, .name="CleanNFilter")
polynFilter(threshold=0L, nuc=c("A", "C", "T", "G", "other"),
           .name="PolyNFilter")
dustyFilter(threshold=Inf, batchSize=NA, .name="DustyFilter")
srdistanceFilter(subject=character(0), threshold=0L,
                 .name="SRDistanceFilter")

##
## legacy filters for ungapped alignments
##

chromosomeFilter(regex=character(0), fixed=FALSE, exclude=FALSE,
                 .name="ChromosomeFilter")
positionFilter(min=-Inf, max=Inf, .name="PositionFilter")
strandFilter(strandLevels=character(0), .name="StrandFilter")
alignQualityFilter(threshold=0L, .name="AlignQualityFilter")
alignDataFilter(expr=expression(), .name="AlignDataFilter")

`fun`	An object of class `function` to be used as a filter. `fun` must accept a single named argument `x`, and is expected to return a logical vector such that `x[fun(x)]` selects only those elements of `x` satisfying the conditions of `fun`
`name`	A `character(1)` object to be used as the name of the filter. The `name` is useful for debugging and reference.
`filt`	A `SRFilter` object, to be used with additional arguments to create a composite filter.
`.name`	An optional `character(1)` object used to over-ride the name applied to default filters.
`regex`	Either `character(0)` or a `character(1)` regular expression used as `grep(regex, chromosome(x))` to filter based on chromosome. The default (`character(0)`) performs no filtering
`fixed`	`logical(1)` passed to `grep`, influencing how pattern matching occurs.
`exclude`	`logical(1)` which, when `TRUE`, uses `regex` to exclude, rather than include, reads.
`min`	`numeric(1)`
`max`	`numeric(1)`. For `positionFilter`, `min` and `max` define the closed interval in which position must be found `min <= position <= max`. For `occurrenceFilter`, `min` and `max` define the minimum and maximum number of times a read occurs after the filter.
`strandLevels`	Either `character(0)` or `character(1)` containing strand levels to be selected. `ShortRead` objects have standard strand levels `NA, "+", "-", ""`, with `NA` meaning strand information not available and `""` meaning strand information not relevant.
`withSread`	A `logical(1)` indicating whether uniqueness includes the read sequence (`withSread=TRUE`), is based only on chromosome, position, and strand (`withSread=FALSE`), or only the read sequence (`withSread=NA`), as described for `occurrenceFilter` below..
`duplicates`	Either `character{1}`, a function `name`, or a function taking a single argument. Influence how duplicates are handled, as described for `occurrenceFilter` below.
`threshold`	A `numeric(1)` value representing a minimum (`srdistanceFilter`, `alignQualityFilter`) or maximum (`nFilter`, `polynFilter`, `dustyFilter`) criterion for the filter. The minima and maxima are closed-interval (i.e., `x >= threshold`, `x <= threshold` for some property `x` of the object being filtered).
`nuc`	A `character` vector containing IUPAC symbols for nucleotides or the value `"other"` corresponding to all non-nucleotide symbols, e.g., `N`.
`batchSize`	`NA` or an `integer(1)` vector indicating the number of DNA sequences to be processed simultaneously by `dustyFilter`. By default, all reads are processed simultaneously. Smaller values use less memory but are computationally less efficient.
`subject`	A `character()` of any length, to be used as the corresponding argument to `srdistance`.
`expr`	A `expression` to be evaluated with `pData(alignData(x))`.
`...`	Additional arguments for subsequent methods; these arguments are not currently used.

srFilter allows users to construct their own filters. The fun argument to srFilter must be a function accepting a single argument x and returning a logical vector that can be used to select elements of x satisfying the filter with x[fun(x)]

The signature(fun="missing") method creates a default filter that returns a vector of TRUE values with length equal to length(x).

compose constructs a new filter from one or more existing filter. The result is a filter that returns a logical vector with indices corresponding to components of x that pass all filters. If not provided, the name of the filter consists of the names of all component filters, each separated by " o ".

The remaining functions documented on this page are built-in filters that accept an argument x and return a logical vector of length(x) indicating which components of x satisfy the filter.

idFilter selects elements satisfying grep(regex, id(x), fixed=fixed).

chromosomeFilter selects elements satisfying grep(regex, chromosome(x), fixed=fixed).

positionFilter selects elements satisfying min <= position(x) <= max.

strandFilter selects elements satisfying match(strand(x), strand, nomatch=0) > 0.

occurrenceFilter selects elements that occur >=min and <=max times. withSread determines how reads will be treated: TRUE to include the sread, chromosome, strand, and position when determining occurrence, FALSE to include chromosome, strand, and position, and NA to include only sread. The default is withSread=NA. duplicates determines how reads with more than max reads are treated. head selects the first max reads of each set of duplicates, tail the last max reads, and sample a random sample of max reads. none removes all reads represented more than max times. The user can also provide a function (as used by tapply) of a single argument to select amongst reads.

nFilter selects elements with fewer than threshold 'N' symbols in each element of sread(x).

polynFilter selects elements with fewer than threshold copies of any nucleotide indicated by nuc.

dustyFilter selects elements with high sequence complexity, as characterized by their dustyScore. This emulates the dust command from WindowMaker software. Calculations can be memory intensive; use batchSize to process the argument to dustyFilter in batches of the specified size.

srdistanceFilter selects elements at an edit distance greater than threshold from all sequences in subject.

alignQualityFilter selects elements with alignQuality(x) greater than threshold.

alignDataFilter selects elements with pData(alignData(x)) satisfying expr. expr should be formulated as though it were to be evaluated as eval(expr, pData(alignData(x))).

srFilter returns an object of SRFilter.

Built-in filters return a logical vector of length(x), with TRUE indicating components that pass the filter.

Martin Morgan <mtmorgan@fhcrc.org>

SRFilter.

sp <- SolexaPath(system.file("extdata", package="ShortRead"))
aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example

# a 'chromosome 5' filter
filt <- chromosomeFilter("chr5.fa")
aln[filt(aln)]
# filter during input
readAligned(sp, "s_2_export.txt", filter=filt)

# x- and y- coordinates stored in alignData, when source is SolexaExport
xy <- alignDataFilter(expression(abs(x-500) > 200 & abs(y-500) > 200))
aln[xy(aln)]

# both filters as a single filter
chr5xy <- compose(filt, xy)
aln[chr5xy(aln)]

# both filters as a collection
filters <- c(filt, xy)
subsetByFilter(aln, filters)
summary(filters, aln)

# read, chromosome, strand, position tuples occurring exactly once
aln[occurrenceFilter(withSread=TRUE, duplicates="none")(aln)]
# reads occurring exactly once
aln[occurrenceFilter(withSread=NA, duplicates="none")(aln)]
# chromosome, strand, position tuples occurring exactly once
aln[occurrenceFilter(withSread=FALSE, duplicates="none")(aln)]

# custom filter: minimum calibrated base call quality >20
goodq <- srFilter(function(x) {
    apply(as(quality(x), "matrix"), 1, min, na.rm=TRUE) > 20
}, name="GoodQualityBases")
goodq
aln[goodq(aln)]