selex.seqfilter: Create a sequence filter

Description Usage Arguments Details Value See Also Examples

View source: R/SELEX.R

Description

A function used to create a sequence filter object to conveniently and precisely include or exclude sequences from being counted or displayed. The filters are formed using Java regular expressions and can be used by a variety of functions within the package.

Usage

1
2
3
4
selex.seqfilter(variableRegionIncludeRegex=NULL, 
  variableRegionExcludeRegex=NULL, variableRegionGroupRegex=NULL, 
  kmerIncludeRegex=NULL, kmerExcludeRegex=NULL, kmerIncludeOnly=NULL,
  viewIncludeRegex=NULL, viewExcludeRegex=NULL, viewIncludeOnly=NULL)

Arguments

variableRegionIncludeRegex

Include reads with variable regions containing this regular expression.

variableRegionExcludeRegex

Exclude reads with variable regions containing this regular expression.

variableRegionGroupRegex

Select subsequences of variable regions matching this regular expression.

kmerIncludeRegex

Perform K-mer counting on variable regions containing this regular expression.

kmerExcludeRegex

Perform K-mer counting on variable regions not containing this regular expression.

kmerIncludeOnly

Perform K-mer counting on variable regions exactly matching this regular expression.

viewIncludeRegex

Display K-mers containing this reguar expression.

viewExcludeRegex

Display K-mers not containing this regular expression.

viewIncludeOnly

Display K-mers exactly matching this regular expression.

Details

The filters described by selex.seqfilter are used to filter sequences in the different stages of the K-mer counting process: read filtering, variable region filtering, and K-mer filtering.

Read Filtering Variable Region Filtering K-mer Filtering
variableRegionIncludeRegex kmerIncludeRegex viewIncludeRegex
variableRegionExcludeRegex kmerExcludeRegex viewExcludeRegex
variableRegionGroupRegex kmerIncludeOnly viewIncludeOnly

Read filtering includes or excludes reads from the FASTQ file, acting as additional filters to those used to extract the variable regions. For example, consider an experimental design where the left barcode is TGG, right right barcode is TTAGC, and the variable region length is 10. FASTQ reads will be rejected unless they have the correct format; the sequences below represent hypothetical FASTQ reads:

5' TGG NNNNNNNNNN TTAGC 3' template
5' TCG ATCAGTGGAC TTAGC 3' fails (left match failed)
5' TGG NAGGTCAGAC TTAGC 3' fails (indeterminate base in variable region)
5' TGG ATCAGTGGAC TTAGC 3' passes

The read filter options then act as additional filters on the 10-bp variable region. variableRegionGroupRegex has the added functionality of selecting substrings from the variable region itself. Using the same example, variableRegionGroupRegex could be used to select 5-mers regions flanked by AA on the left and the right (or AA NNNNN AA):

5' TCG ATCAGTGGAC TTAGC 3' fails template (left match failed)
5' TGG ATCAGTGGAC TTAGC 3' passes template, fails filter (no match)
5' TGG TAAGTGCCAA TTAGC 3' passes template and filter

When the variableRegionGroupRegex filter matches, only the subsequence will be used in future counting. In the above example, this would be GTGCC.

After the variable regions have been extracted from the FASTQ file, the next step involves K-mer counting. Variable region filtering comes into play here, allowing or preventing K-mer counting on these sequences. Lastly, K-mer filtering determines what K-mers are returned or displayed in tables.

Any function utilizing sequence filters will recompute results if, for a given sample, new values for the read filtering or variable region filter options are provided.

Value

selex.seqfilter returns a sequence filter object.

See Also

selex.affinities, selex.counts, selex.getSeqfilter, selex.infogain, selex.kmax, selex.mm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Raw K-mer counts
my.counts1 = selex.counts(sample=r0, k=16, top=100)

# Include reads whose variable regions begin with TGTA
regex = selex.seqfilter(variableRegionIncludeRegex="^TGTA") 
my.counts2 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude reads whose variables regions begin with TGT
regex = selex.seqfilter(variableRegionExcludeRegex="^TGT")
my.counts3 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Extract 13-bp substring from reads whose variable regions begin with TGT
regex = selex.seqfilter(variableRegionGroupRegex="^TGT([ACGT]{13})")
my.counts4 = selex.counts(sample=r0, k=13, top=100, seqfilter=regex)

# Extract 5-bp substring from reads whose variable regions begin with TGT
regex = selex.seqfilter(variableRegionGroupRegex="^TGT([ACGT]{5})")
my.counts5 = selex.counts(sample=r0, k=5, top=100, seqfilter=regex)

# Select variable regions beginning with A and ending with G
regex = selex.seqfilter(kmerIncludeRegex="^A.{14}G") 
my.counts6 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude variable regions beginning with A and ending with G 
regex = selex.seqfilter(kmerExcludeRegex="^A.{14}G") 
my.counts7 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude variable regions beginning with A and ending with G, and display
# 16-mers that start and end with T
regex = selex.seqfilter(kmerExcludeRegex="^A.{14}G", 
  viewIncludeRegex="^T[ACTG]{14}T") 
my.counts8 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Exclude variable regions beginning with A and ending with G, and display
# 16-mers that do not start and end with T
regex = selex.seqfilter(kmerExcludeRegex="^A.{14}G", 
  viewExcludeRegex="^T[ACTG]{14}T") 
my.counts9 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Only count variable regions containing TGTAAAATCAGTGCTG or TGTAAGTGGACTCTCG
regex = selex.seqfilter(kmerIncludeOnly=c('TGTAAAATCAGTGCTG', 
  'TGTAAGTGGACTCTCG')) 
my.counts10 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

# Only display results for the K-mers TGTAAAATCAGTGCTG and TGTAAGTGGACTCTCG
regex = selex.seqfilter(viewIncludeOnly=c('TGTAAAATCAGTGCTG', 
  'TGTAAGTGGACTCTCG')) 
my.counts11 = selex.counts(sample=r0, k=16, top=100, seqfilter=regex)

SELEX documentation built on Nov. 8, 2020, 5:22 p.m.