reduceByYield: Iterate through a BAM (or other) file, reducing output to a...
In Bioconductor/GenomicFiles: Distributed computing by file or by range

View source: R/reduceByYield.R

reduceByYield

R Documentation

Iterate through a BAM (or other) file, reducing output to a single result.

Description

Rsamtools files can be created with a ‘yieldSize’ argument that influences the number of records (chunk size) input at one time (see, e.g,. BamFile). reduceByYield iterates through the file, processing each chunk and reducing it with previously input chunks. This is a memory efficient way to process large data files, especially when the final result fits in memory.

Usage

reduceByYield(X, YIELD,  MAP = identity, REDUCE = `+`, 
              DONE = function(x) is.null(x) || length(x) == 0L, 
              ..., parallel = FALSE, iterate = TRUE, init)

REDUCEsampler(sampleSize=1000000, verbose=FALSE)

Arguments

`X`	A `BamFile` instance (or other class for which `isOpen`, `open`, `close` methods are defined, and which support extraction of sequential chunks).
`YIELD`	A function name or user-supplied function that operates on `X` to produce a `VALUE` that is passed to `DONE` and `MAP`. Generally `YIELD` will be a data extractor such as `readGAlignments`, `scanBam`, `yield`, etc. and `VALUE` is a chunk of data. YIELD(X)
`MAP`	A function of one or more arguments that operates on the chunk of data from `YIELD`. MAP(VALUE, ...)
`REDUCE`	A function of one (`iterate=FALSE` or two (`iterate=TRUE`) arguments, returning the reduction (e.g., sum, mean, concatenate) of the arguments. REDUCE(mapped, ...) ## iterate=FALSE REDUCE(x, y, ...) ## iterate=TRUE
`DONE`	A function of one argument, the `VALUE` output of the most recent call to `YIELD(X, ...)`. If missing, `DONE` is `function(VALUE) length(VALUE) == 0`.
`...`	Additional arguments, passed to `MAP`.
`iterate`	logical(1) determines whether the call to `REDUCE` is iterative (`iterate=TRUE`) or cumulative (`iterate=FALSE`).
`parallel`	logical(1) determines if the `MAP` step is run in parallel. `bpiterate` is used under the hood and is currently supported for Unix/Mac only. For Windows machines, `parallel` is ignored.
`init`	(Optional) Initial value used for `REDUCE` when `iterate=TRUE`.
`sampleSize`	Initial value used for `REDUCEsampler`.
`verbose`	logical(1) determines if total records sampled are reported at each iteration. Applicable to `REDUCEsampler` only.

Details

reduceByYield:

When iterate=TRUE, REDUCE requires 2 arguments and is invoked with init and the output from the first call to MAP. If init is missing, it operates on the first two outputs from MAP.

When iterate=FALSE, REDUCE requires 1 argument and is is invoked with a list containing a list containing all results from MAP.

REDUCEsampler:

REDUCEsampler creates a function that can be used as the REDUCE argument to reduceByYield.

Invoking REDUCEsampler with sampleSize returns a function (call it myfun) that takes two arguments, x and y. As with any iterative REDUCE function, x represents records that have been yield'ed and y is the new chunk of records. myfun samples records from consecutive chunks returned by the YIELD function. (Re)sampling takes into consideration the total number of records yield'ed, the sampleSize, and the size of the new chunk.

Value

The value returned by the final invocation of REDUCE, or init if provided and no data were yield'ed, or list() if init is missing and no data were yield'ed.

Author(s)

Martin Morgan and Valerie Obenchain

Examples


if (all(require(RNAseqData.HNRNPC.bam.chr14) &&
        require(GenomicAlignments))) { 

  ## -----------------------------------------------------------------------
  ## Nucleotide frequency of mapped reads
  ## -----------------------------------------------------------------------
 
  ## In this example nucleotide frequency of mapped reads is computed
  ## for a single file. The MAP step is run in parallel and REDUCE 
  ## is iterative.

  ## Create a BamFile and set a 'yieldSize'.
  fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
  bf <- BamFile(fl, yieldSize=500)

  ## Define 'YIELD', 'MAP' and 'REDUCE' functions.
  YIELD <- function(X, ...) {
      flag = scanBamFlag(isUnmappedQuery=FALSE)
      param = ScanBamParam(flag=flag, what="seq")
      scanBam(X, param=param, ...)[[1]][['seq']]
  }
  MAP <- function(value, ...) {
      requireNamespace("Biostrings", quietly=TRUE)  ## for alphabetFrequency()
      Biostrings::alphabetFrequency(value, collapse=TRUE)
  }
  REDUCE <- `+`        # add successive alphabetFrequency matrices

  ## 'parallel=TRUE' runs the MAP step in parallel and is currently
  ## implemented for Unix/Mac only.
  register(MulticoreParam(3))
  reduceByYield(bf, YIELD, MAP, REDUCE, parallel=TRUE)
 
  ## -----------------------------------------------------------------------
  ## Coverage
  ## -----------------------------------------------------------------------
 
  ## If sufficient resources are available coverage can be computed
  ## across several large BAM files by combining reduceByYield() with
  ## bplapply().

  ## Create a BamFileList with a few sample files and a Snow cluster
  ## with the same number of workers as files.
  bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1:3])
  bpparam <- SnowParam(length(bfl))

  ## 'FUN' is run on each worker. Because these are Snow workers each
  ## variable used in 'FUN' must be explicitly passed. (This is not the case
  ## when using Multicore.)
  FUN <- function(bf, YIELD, MAP, REDUCE, parallel, ...) {
    requireNamespace("GenomicFiles", quietly=TRUE)      ## for reduceByYield()
    GenomicFiles::reduceByYield(bf, YIELD, MAP, REDUCE, parallel=parallel)
  }
 
  ## Passing parallel=FALSE to reduceByYield() runs the MAP step in serial on 
  ## each worker. In this example, parallel dispatch is at the file-level 
  ## only (bplapply()).
  YIELD <- `readGAlignments`
  MAP <- function(value, ...) {
      requireNamespace("GenomicAlignments", quietly=TRUE)
      GenomicAlignments::coverage(value)[["chr14"]]
  }
  bplapply(bfl, FUN, YIELD=YIELD, MAP=MAP, REDUCE=`+`,
           parallel=FALSE, BPPARAM = bpparam) 


  ## -----------------------------------------------------------------------
  ## Sample records from a Bam file
  ## -----------------------------------------------------------------------

  fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
  bf <- BamFile(fl, yieldSize=1000)

  yield <- function(x)
      readGAlignments(x, param=ScanBamParam(what=c( "qwidth", "mapq" )))
  map <- identity

  ## Samples records from successive chunks of aligned reads.
  reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))
}

Bioconductor/GenomicFiles documentation built on June 2, 2025, 7:19 a.m.