h5dapply: h5dapply
In h5vc: Managing alignment tallies using a hdf5 backend

Description Usage Arguments Details Value Author(s) Examples

This is the central function of the h5vc package, allows an apply operation along common dimensions of datasets in a tally file.

## S4 method for signature 'numeric'
h5dapply( ..., blocksize, range)
## S4 method for signature 'GRanges'
h5dapply( ..., group, range)
## S4 method for signature 'IRanges'
h5dapply( ..., range)

`blocksize`	The size of the blocks in which to process the data (integer)
`...`	Further parameters to be handed over to `FUN`
`range`	The range along the specified dimensions which should be processed, this allows for limiting the apply to a specific region or set of samples, etc. - optional (defaults to the whole chromosome); This can be a `GRanges`, `IRanges` or numerical vector of length 2 (i.e [start, stop])
`group`	The group (location) within the HDF5 file, note that when range is `numeric` or `IRanges` this has to point to the location of the chromosome, e.g. `/ExampleTally/Chr7`. When range is a `GRanges` object, the chromosome information is encoded in the `GRanges` directly and `group` should only point to the root-group of the study, i.e. `/ExampleTally`

Additional function parameters are:

filename: The name of a tally file to process
group: The name of a group in that tally file
FUN: The function to apply to each block, defaults to function(x) x, which returns the data as is (a list of arrays)
names: The names of the datasets to extract, e.g. c("Counts","Coverages") - optional (defaults to all datasets)
dims: The dimension to apply along for each dataset in the same order as names, these should correspond to compatible dimensions between the datsets. - optional (defaults to the genomic position dimension)
samples: Character vector of sample names - must match contents of sampleData stored in the tallyFile
sampleDimMap: A list mapping dataset names to their respective sample dimensions - default provides values for "Counts", "Coverages", "Deletions" and "Reference"
verbose: Boolean flag that controls the amount of messages being printed by h5dapply
BPPARAM: BPPARAM object to be passed to the bplapply call used to apply FUN to the blocks - see BiocParallel documentation for details; if this is NULL a normal lapply will be used instead of bplapply.

This function applys parameter FUN to blocks along a specified axis within the tally file, group and specified datasets. It creates a list of arrays (one for each dataset) and processes that list with the function FUN.

This is by far the most essential and powerful function within this package since it allows the user to execute their own analysis functions on the tallies stored within the HDF5 tally file.

The supplied function FUN must have a parameter data or ... (the former is the expected behaviour), which will be supplied to FUN from h5dapply for each block. This structure is a list with one slot for each dataset specified in the names argument to h5dapply containing the array corresponding to the current block in the given dataset. Furthemore the slot h5dapplyInfo is reserved and contains another list with the following content:

Blockstart is an integer specifying the starting position of the current block (in the dimension specified by the dims argument to h5dapply)

Blockend is an integer specifying the end position of the current block (in the dimension specified by the dims argument to h5dapply)

Datasets Contains a data.frame as it is returned by h5ls listing all datasets present in the other slots of data with their group, name, dimensions, number of dimensions (DimCount) and the dimension that is used for splitting into blocks (PosDim)

Group contains the name of the group as specified by the group argument to h5dapply

A list with one entry per block, which is the result of applying FUN to the datasets specified in the parameter names within the block.

Paul Pyl

  # loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  # check the available samples and sampleData
  print(sampleData)
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Subsetting by Sample
  sampleData <- sampleData[sampleData$Patient == "Patient5",]
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    samples = sampleData$Sample,
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Using GRanges and IRanges
  library(GenomicRanges)
  library(IRanges)
  granges <- GRanges(
  c(rep("16", 10), rep("22", 10)),
  ranges = IRanges(
    start = c(seq(29000000,29009000, 1000), seq(39000000,39009000, 1000)),
    width = 1000
  ))
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = granges,
    verbose = TRUE
    )
  lapply( data, function(x) do.call(rbind, x) )