View source: R/reduceByYield.R
reduceByYield | R Documentation |
Rsamtools files can be created with a ‘yieldSize’ argument that
influences the number of records (chunk size) input at one time (see,
e.g,. BamFile
). reduceByYield
iterates
through the file, processing each chunk and reducing it with previously
input chunks. This is a memory efficient way to process large data files,
especially when the final result fits in memory.
reduceByYield(X, YIELD, MAP = identity, REDUCE = `+`,
DONE = function(x) is.null(x) || length(x) == 0L,
..., parallel = FALSE, iterate = TRUE, init)
REDUCEsampler(sampleSize=1000000, verbose=FALSE)
X |
A |
YIELD |
A function name or user-supplied function that operates
on
|
MAP |
A function of one or more arguments that operates on
the chunk of data from
|
REDUCE |
A function of one (
|
DONE |
A function of one argument, the |
... |
Additional arguments, passed to |
iterate |
logical(1) determines whether the call to
|
parallel |
logical(1) determines if the |
init |
(Optional) Initial value used for |
sampleSize |
Initial value used for |
verbose |
logical(1) determines if total records sampled are
reported at each iteration. Applicable to |
reduceByYield
: When iterate=TRUE
, REDUCE
requires 2 arguments and is
invoked with init
and the output from the first call to
MAP
. If init
is missing, it operates on the first two
outputs from MAP
.
When iterate=FALSE
, REDUCE
requires 1 argument and is
is invoked with a list containing a list containing all results from
MAP
.
REDUCEsampler
:REDUCEsampler
creates a function that can be used as the
REDUCE
argument to reduceByYield
.
Invoking REDUCEsampler
with sampleSize
returns a function
(call it myfun
) that takes two arguments, x
and y
.
As with any iterative REDUCE
function, x
represents records
that have been yield'ed and y
is the new chunk of records.
myfun
samples records from consecutive chunks returned by the
YIELD
function. (Re)sampling takes into consideration
the total number of records yield'ed, the sampleSize
, and the
size of the new chunk.
The value returned by the final invocation of REDUCE
, or init
if provided and no data were yield'ed, or list()
if init
is
missing and no data were yield'ed.
Martin Morgan and Valerie Obenchain
BamFile
and
TabixFile
for examples of 'X'.
reduceByFile
and reduceByRange
if (all(require(RNAseqData.HNRNPC.bam.chr14) &&
require(GenomicAlignments))) {
## -----------------------------------------------------------------------
## Nucleotide frequency of mapped reads
## -----------------------------------------------------------------------
## In this example nucleotide frequency of mapped reads is computed
## for a single file. The MAP step is run in parallel and REDUCE
## is iterative.
## Create a BamFile and set a 'yieldSize'.
fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
bf <- BamFile(fl, yieldSize=500)
## Define 'YIELD', 'MAP' and 'REDUCE' functions.
YIELD <- function(X, ...) {
flag = scanBamFlag(isUnmappedQuery=FALSE)
param = ScanBamParam(flag=flag, what="seq")
scanBam(X, param=param, ...)[[1]][['seq']]
}
MAP <- function(value, ...) {
requireNamespace("Biostrings", quietly=TRUE) ## for alphabetFrequency()
Biostrings::alphabetFrequency(value, collapse=TRUE)
}
REDUCE <- `+` # add successive alphabetFrequency matrices
## 'parallel=TRUE' runs the MAP step in parallel and is currently
## implemented for Unix/Mac only.
register(MulticoreParam(3))
reduceByYield(bf, YIELD, MAP, REDUCE, parallel=TRUE)
## -----------------------------------------------------------------------
## Coverage
## -----------------------------------------------------------------------
## If sufficient resources are available coverage can be computed
## across several large BAM files by combining reduceByYield() with
## bplapply().
## Create a BamFileList with a few sample files and a Snow cluster
## with the same number of workers as files.
bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1:3])
bpparam <- SnowParam(length(bfl))
## 'FUN' is run on each worker. Because these are Snow workers each
## variable used in 'FUN' must be explicitly passed. (This is not the case
## when using Multicore.)
FUN <- function(bf, YIELD, MAP, REDUCE, parallel, ...) {
requireNamespace("GenomicFiles", quietly=TRUE) ## for reduceByYield()
GenomicFiles::reduceByYield(bf, YIELD, MAP, REDUCE, parallel=parallel)
}
## Passing parallel=FALSE to reduceByYield() runs the MAP step in serial on
## each worker. In this example, parallel dispatch is at the file-level
## only (bplapply()).
YIELD <- `readGAlignments`
MAP <- function(value, ...) {
requireNamespace("GenomicAlignments", quietly=TRUE)
GenomicAlignments::coverage(value)[["chr14"]]
}
bplapply(bfl, FUN, YIELD=YIELD, MAP=MAP, REDUCE=`+`,
parallel=FALSE, BPPARAM = bpparam)
## -----------------------------------------------------------------------
## Sample records from a Bam file
## -----------------------------------------------------------------------
fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
bf <- BamFile(fl, yieldSize=1000)
yield <- function(x)
readGAlignments(x, param=ScanBamParam(what=c( "qwidth", "mapq" )))
map <- identity
## Samples records from successive chunks of aligned reads.
reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.