Sampling and streaming records from fastq files

Share:

Description

FastqFile represents a path and connection to a fastq file. FastqFileList is a list of such connections.

FastqSampler draws a subsample from a fastq file. yield is the method used to extract the sample from the FastqSampler instance; a short illustration is in the example below. FastqSamplerList is a list of FastqSampler elements.

FastqStreamer draws successive subsets from a fastq file, a short illustration is in the example below. FastqStreamerList is a list of FastqStreamer elements.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## FastqFile and FastqFileList
FastqFile(con, ...)
FastqFileList(..., class="FastqFile")
## S3 method for class 'ShortReadFile'
open(con, ...)
## S3 method for class 'ShortReadFile'
close(con, ...)
## S4 method for signature 'FastqFile'
readFastq(dirPath, pattern=character(), ...)

## FastqSampler and FastqStreamer
FastqSampler(con, n=1e6, readerBlockSize=1e8, verbose=FALSE,
    ordered = FALSE)
FastqSamplerList(..., n=1e6, readerBlockSize=1e8, verbose=FALSE,
    ordered = FALSE)
FastqStreamer(con, n, readerBlockSize=1e8, verbose=FALSE)
FastqStreamerList(..., n, readerBlockSize=1e8, verbose=FALSE)
yield(x, ...)

Arguments

con, dirPath

A character string naming a connection, or (for con) an R connection (e.g., file, gzfile).

n

For FastqSampler, the size of the sample (number of records) to be drawn. For FastqStreamer a numeric(1) (set to 1e6 when n is missing) providing the number of successive records to be returned on each yield, or an IRanges-class delimiting the (1-based) indicies of records returned by each yield; entries in n must have non-zero width and must not overlap.

readerBlockSize

The number of bytes or characters to be read at one time; smaller readerBlockSize reduces memory requirements but is less efficient.

verbose

Display progress.

ordered

logical(1) indicating whether sampled reads should be returned in the same order as they were encountered in the file.

x

An instance from the FastqSampler or FastqStreamer class.

...

Additional arguments. For FastqFileList, FastqSamplerList, or FastqStreamerList, this can either be a single character vector of paths to fastq files, or several instances of the corresponding FastqFile, FastqSampler, or FastqStreamer objects.

pattern

Ignored.

class

For developer use, to specify the underlying class contained in the FastqFileList.

Objects from the class

Available classes include:

FastqFile

A file path and connection to a fastq file.

FastqFileList

A list of FastqFile instances.

FastqSampler

Uniformly sample records from a fastq file.

FastqStreamer

Iterate over a fastq file, returning successive parts of the file.

Methods

The following methods are available to users:

readFastq,FastqFile-method:

see also ?readFastq.

writeFastq,ShortReadQ,FastqFile-method:

see also ?writeFastq, ?"writeFastq,ShortReadQ,FastqFile-method".

yield:

Draw a single sample from the instance. Operationally this requires that the underlying data (e.g., file) represented by the Sampler instance be visited; this may be time consuming.

Note

FastqSampler and FastqStreamer use OpenMP threads (when available) during creation of the return value. This may sometimes create problems when a process is already running on multiple threads, e.g., with an error message like

1
2
    libgomp: Thread creation failed: Resource temporarily unavailable
  

A solution is to precede problematic code with the following code snippet, to disable threading

1
2
3
    nthreads <- .Call(ShortRead:::.set_omp_threads, 1L)
    on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
  

See Also

readFastq, writeFastq, yield.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
fl <- file.path(analysisPath(sp), "s_1_sequence.txt")

f <- FastqFile(fl)
rfq <- readFastq(f)
close(f)

f <- FastqSampler(fl, 50)
yield(f)    # sample of size n=50
yield(f)    # independent sample of size 50
close(f)

## Return sample as ordered in original file
f <- FastqSampler(fl, 50, ordered=TRUE)
yield(f)
close(f)

f <- FastqStreamer(fl, 50)
yield(f)    # records 1 to 50
yield(f)    # records 51 to 100
close(f)

## iterating over an entire file
f <- FastqStreamer(fl, 50)
while (length(fq <- yield(f))) {
    ## do work here
    print(length(fq))
}
close(f)

## iterating over IRanges
rng <- IRanges(c(50, 100, 200), width=10:8)
f <- FastqStreamer(fl, rng)
while (length(fq <- yield(f))) {
    print(length(fq))
}
close(f)

## Internal fields, methods, and help; for developers
ShortRead:::.FastqSampler_g$methods()
ShortRead:::.FastqSampler_g$fields()
ShortRead:::.FastqSampler_g$help("yield")

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.