h5dapply: h5dapply

Description Usage Arguments Details Value Author(s) Examples

Description

This is the central function of the h5vc package, allows an apply operation along common dimensions of datasets in a tally file.

Usage

1
2
3
4
5
6
## S4 method for signature 'numeric'
h5dapply( ..., blocksize, range)
## S4 method for signature 'GRanges'
h5dapply( ..., group, range)
## S4 method for signature 'IRanges'
h5dapply( ..., range)

Arguments

blocksize

The size of the blocks in which to process the data (integer)

...

Further parameters to be handed over to FUN

range

The range along the specified dimensions which should be processed, this allows for limiting the apply to a specific region or set of samples, etc. - optional (defaults to the whole chromosome); This can be a GRanges, IRanges or numerical vector of length 2 (i.e [start, stop])

group

The group (location) within the HDF5 file, note that when range is numeric or IRanges this has to point to the location of the chromosome, e.g. /ExampleTally/Chr7. When range is a GRanges object, the chromosome information is encoded in the GRanges directly and group should only point to the root-group of the study, i.e. /ExampleTally

Details

Additional function parameters are:

filename

The name of a tally file to process

group

The name of a group in that tally file

FUN

The function to apply to each block, defaults to function(x) x, which returns the data as is (a list of arrays)

names

The names of the datasets to extract, e.g. c("Counts","Coverages") - optional (defaults to all datasets)

dims

The dimension to apply along for each dataset in the same order as names, these should correspond to compatible dimensions between the datsets. - optional (defaults to the genomic position dimension)

samples

Character vector of sample names - must match contents of sampleData stored in the tallyFile

sampleDimMap

A list mapping dataset names to their respective sample dimensions - default provides values for "Counts", "Coverages", "Deletions" and "Reference"

verbose

Boolean flag that controls the amount of messages being printed by h5dapply

BPPARAM

BPPARAM object to be passed to the bplapply call used to apply FUN to the blocks - see BiocParallel documentation for details; if this is NULL a normal lapply will be used instead of bplapply.

This function applys parameter FUN to blocks along a specified axis within the tally file, group and specified datasets. It creates a list of arrays (one for each dataset) and processes that list with the function FUN.

This is by far the most essential and powerful function within this package since it allows the user to execute their own analysis functions on the tallies stored within the HDF5 tally file.

The supplied function FUN must have a parameter data or ... (the former is the expected behaviour), which will be supplied to FUN from h5dapply for each block. This structure is a list with one slot for each dataset specified in the names argument to h5dapply containing the array corresponding to the current block in the given dataset. Furthemore the slot h5dapplyInfo is reserved and contains another list with the following content:

Blockstart is an integer specifying the starting position of the current block (in the dimension specified by the dims argument to h5dapply)

Blockend is an integer specifying the end position of the current block (in the dimension specified by the dims argument to h5dapply)

Datasets Contains a data.frame as it is returned by h5ls listing all datasets present in the other slots of data with their group, name, dimensions, number of dimensions (DimCount) and the dimension that is used for splitting into blocks (PosDim)

Group contains the name of the group as specified by the group argument to h5dapply

Value

A list with one entry per block, which is the result of applying FUN to the datasets specified in the parameter names within the block.

Author(s)

Paul Pyl

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
  # loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  # check the available samples and sampleData
  print(sampleData)
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Subsetting by Sample
  sampleData <- sampleData[sampleData$Patient == "Patient5",]
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    samples = sampleData$Sample,
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Using GRanges and IRanges
  library(GenomicRanges)
  library(IRanges)
  granges <- GRanges(
  c(rep("16", 10), rep("22", 10)),
  ranges = IRanges(
    start = c(seq(29000000,29009000, 1000), seq(39000000,39009000, 1000)),
    width = 1000
  ))
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = granges,
    verbose = TRUE
    )
  lapply( data, function(x) do.call(rbind, x) )

h5vc documentation built on Nov. 8, 2020, 4:56 p.m.