selex.seqfilter: Create a sequence filter
In SELEX: Functions for analyzing SELEX-seq data

Description Usage Arguments Details Value See Also Examples

A function used to create a sequence filter object to conveniently and precisely include or exclude sequences from being counted or displayed. The filters are formed using Java regular expressions and can be used by a variety of functions within the package.

selex.seqfilter(variableRegionIncludeRegex=NULL, 
  variableRegionExcludeRegex=NULL, variableRegionGroupRegex=NULL, 
  kmerIncludeRegex=NULL, kmerExcludeRegex=NULL, kmerIncludeOnly=NULL,
  viewIncludeRegex=NULL, viewExcludeRegex=NULL, viewIncludeOnly=NULL)

`variableRegionIncludeRegex`	Include reads with variable regions containing this regular expression.
`variableRegionExcludeRegex`	Exclude reads with variable regions containing this regular expression.
`variableRegionGroupRegex`	Select subsequences of variable regions matching this regular expression.
`kmerIncludeRegex`	Perform K-mer counting on variable regions containing this regular expression.
`kmerExcludeRegex`	Perform K-mer counting on variable regions not containing this regular expression.
`kmerIncludeOnly`	Perform K-mer counting on variable regions exactly matching this regular expression.
`viewIncludeRegex`	Display K-mers containing this reguar expression.
`viewExcludeRegex`	Display K-mers not containing this regular expression.
`viewIncludeOnly`	Display K-mers exactly matching this regular expression.

The filters described by selex.seqfilter are used to filter sequences in the different stages of the K-mer counting process: read filtering, variable region filtering, and K-mer filtering.

Read Filtering	Variable Region Filtering	K-mer Filtering
`variableRegionIncludeRegex`	`kmerIncludeRegex`	`viewIncludeRegex`
`variableRegionExcludeRegex`	`kmerExcludeRegex`	`viewExcludeRegex`
`variableRegionGroupRegex`	`kmerIncludeOnly`	`viewIncludeOnly`

Read filtering includes or excludes reads from the FASTQ file, acting as additional filters to those used to extract the variable regions. For example, consider an experimental design where the left barcode is TGG, right right barcode is TTAGC, and the variable region length is 10. FASTQ reads will be rejected unless they have the correct format; the sequences below represent hypothetical FASTQ reads:

5' TGG NNNNNNNNNN TTAGC 3'		template
5' TCG ATCAGTGGAC TTAGC 3'		fails (left match failed)
5' TGG NAGGTCAGAC TTAGC 3'		fails (indeterminate base in variable region)
5' TGG ATCAGTGGAC TTAGC 3'		passes

The read filter options then act as additional filters on the 10-bp variable region. variableRegionGroupRegex has the added functionality of selecting substrings from the variable region itself. Using the same example, variableRegionGroupRegex could be used to select 5-mers regions flanked by AA on the left and the right (or AA NNNNN AA):

5' TCG ATCAGTGGAC TTAGC 3'		fails template (left match failed)
5' TGG ATCAGTGGAC TTAGC 3'		passes template, fails filter (no match)
5' TGG TAAGTGCCAA TTAGC 3'		passes template and filter

When the variableRegionGroupRegex filter matches, only the subsequence will be used in future counting. In the above example, this would be GTGCC.

After the variable regions have been extracted from the FASTQ file, the next step involves K-mer counting. Variable region filtering comes into play here, allowing or preventing K-mer counting on these sequences. Lastly, K-mer filtering determines what K-mers are returned or displayed in tables.

Any function utilizing sequence filters will recompute results if, for a given sample, new values for the read filtering or variable region filter options are provided.

selex.seqfilter returns a sequence filter object.

selex.affinities, selex.counts, selex.getSeqfilter, selex.infogain, selex.kmax, selex.mm

# Raw K-mer counts
my.counts1 = selex.counts(sample=r0, k=16, top=100)

# Include reads whose variable regions begin with TGTA
regex = selex.seqfilter(variableRegionIncludeRegex="^TGTA") 
my.counts2 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude reads whose variables regions begin with TGT
regex = selex.seqfilter(variableRegionExcludeRegex="^TGT")
my.counts3 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Extract 13-bp substring from reads whose variable regions begin with TGT
regex = selex.seqfilter(variableRegionGroupRegex="^TGT([ACGT]{13})")
my.counts4 = selex.counts(sample=r0, k=13, top=100, seqfilter=regex)

# Extract 5-bp substring from reads whose variable regions begin with TGT
regex = selex.seqfilter(variableRegionGroupRegex="^TGT([ACGT]{5})")
my.counts5 = selex.counts(sample=r0, k=5, top=100, seqfilter=regex)

# Select variable regions beginning with A and ending with G
regex = selex.seqfilter(kmerIncludeRegex="^A.{14}G") 
my.counts6 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude variable regions beginning with A and ending with G 
regex = selex.seqfilter(kmerExcludeRegex="^A.{14}G") 
my.counts7 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude variable regions beginning with A and ending with G, and display
# 16-mers that start and end with T
regex = selex.seqfilter(kmerExcludeRegex="^A.{14}G", 
  viewIncludeRegex="^T[ACTG]{14}T") 
my.counts8 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude variable regions beginning with A and ending with G, and display
# 16-mers that do not start and end with T
regex = selex.seqfilter(kmerExcludeRegex="^A.{14}G", 
  viewExcludeRegex="^T[ACTG]{14}T") 
my.counts9 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Only count variable regions containing TGTAAAATCAGTGCTG or TGTAAGTGGACTCTCG
regex = selex.seqfilter(kmerIncludeOnly=c('TGTAAAATCAGTGCTG', 
  'TGTAAGTGGACTCTCG')) 
my.counts10 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Only display results for the K-mers TGTAAAATCAGTGCTG and TGTAAGTGGACTCTCG
regex = selex.seqfilter(viewIncludeOnly=c('TGTAAAATCAGTGCTG', 
  'TGTAAGTGGACTCTCG')) 
my.counts11 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)