dump-management: HDF5 dump management

HDF5-dump-managementR Documentation

HDF5 dump management

Description

A set of utilities to control the location and physical properties of automatically created HDF5 datasets.

Usage

setHDF5DumpDir(dir)
setHDF5DumpFile(filepath)
setHDF5DumpName(name)
setHDF5DumpChunkLength(length=1000000L)
setHDF5DumpChunkShape(shape="scale")
setHDF5DumpCompressionLevel(level=6L)

getHDF5DumpDir()
getHDF5DumpFile()
getHDF5DumpName(for.use=FALSE)
getHDF5DumpChunkLength()
getHDF5DumpChunkShape()
getHDF5DumpCompressionLevel()

lsHDF5DumpFile()

showHDF5DumpLog()

## For developers:
getHDF5DumpChunkDim(dim)
appendDatasetCreationToHDF5DumpLog(filepath, name, dim, type,
                                   chunkdim, level)

Arguments

dir

The path (as a single string) to the current HDF5 dump directory, that is, to the (new or existing) directory where HDF5 dump files with automatic names will be created. This is ignored if the user specified an HDF5 dump file with setHDF5DumpFile. If dir is missing, then the HDF5 dump directory is set back to its default value i.e. to some directory under tempdir() (call getHDF5DumpDir() to get the exact path).

filepath

For setHDF5DumpFile: The path (as a single string) to the current HDF5 dump file, that is, to the (new or existing) HDF5 file where the next automatic HDF5 datasets will be written. If filepath is missing, then a new file with an automatic name will be created (in getHDF5DumpDir()) and used for each new dataset.

For appendDatasetCreationToHDF5DumpLog: See the Note TO DEVELOPERS below.

name

For setHDF5DumpName: The name of the next automatic HDF5 dataset to be written to the current HDF5 dump file.

For appendDatasetCreationToHDF5DumpLog: See the Note TO DEVELOPERS below.

length

The maximum length of the physical chunks of the next automatic HDF5 dataset to be written to the current HDF5 dump file.

shape

A string specifying the shape of the physical chunks of the next automatic HDF5 dataset to be written to the current HDF5 dump file. See makeCappedVolumeBox in the DelayedArray package for a description of the supported shapes.

level

For setHDF5DumpCompressionLevel: The compression level to use for writing automatic HDF5 datasets to disk. See the level argument in ?rhdf5::h5createDataset (in the rhdf5 package) for more information about this.

For appendDatasetCreationToHDF5DumpLog: See the Note TO DEVELOPERS below.

for.use

Whether the returned dataset name is for use by the caller or not. See below for the details.

dim

The dimensions of the HDF5 dataset to be written to disk, that is, an integer vector of length one or more giving the maximal indices in each dimension. See the dims argument in ?rhdf5::h5createDataset (in the rhdf5 package) for more information about this.

type

The type (a.k.a. storage mode) of the data to be written to disk. Can be obtained with type() on an array-like object (which is equivalent to storage.mode() or typeof() on an ordinary array). This is typically what an application writing datasets to the HDF5 dump should pass to the storage.mode argument of its call to rhdf5::h5createDataset. See the Note TO DEVELOPERS below for more information.

chunkdim

The dimensions of the chunks.

Details

Calling getHDF5DumpFile() and getHDF5DumpName() with no argument should be informative only i.e. it's a mean for the user to know where the next automatic HDF5 dataset will be written. Since a given file/name combination can be used only once, the user should be careful to not use that combination to explicitely create an HDF5 dataset because that would get in the way of the creation of the next automatic HDF5 dataset. See the Note TO DEVELOPERS below if you actually need to use this file/name combination.

lsHDF5DumpFile() is a just convenience wrapper for h5ls(getHDF5DumpFile()).

Value

getHDF5DumpDir returns the absolute path to the directory where HDF5 dump files with automatic names will be created. Only meaningful if the user did NOT specify an HDF5 dump file with setHDF5DumpFile.

getHDF5DumpFile returns the absolute path to the HDF5 file where the next automatic HDF5 dataset will be written.

getHDF5DumpName returns the name of the next automatic HDF5 dataset.

getHDF5DumpCompressionLevel returns the compression level currently used for writing automatic HDF5 datasets to disk.

showHDF5DumpLog returns the dump log in an invisible data frame.

getHDF5DumpChunkDim returns the dimensions of the physical chunks that will be used to write the dataset to disk.

Note

TO DEVELOPERS:

If your application needs to write its own dataset to the HDF5 dump then it should:

  1. Get a file/dataset name combination by calling getHDF5DumpFile() and getHDF5DumpName(for.use=TRUE).

  2. [OPTIONAL] Call getHDF5DumpChunkDim(dim) to get reasonable chunk dimensions to use for writing the dataset to disk. Or choose your own chunk dimensions.

  3. Add an entry to the dump log by calling appendDatasetCreationToHDF5DumpLog. Typically, this should be done right after creating the dataset (e.g. with rhdf5::h5createDataset) and before starting to write the dataset to disk. The values passed to appendDatasetCreationToHDF5DumpLog via the filepath, name, dim, type, chunkdim, and level arguments should be those that were passed to rhdf5::h5createDataset via the file, dataset, dims, storage.mode, chunk, and level arguments, respectively. Note that appendDatasetCreationToHDF5DumpLog uses a lock mechanism so is safe to use in the context of parallel execution.

This is actually what the coercion method to HDF5Array does internally.

See Also

  • writeHDF5Array for writing an array-like object to an HDF5 file.

  • HDF5Array objects.

  • The h5ls function on which lsHDF5DumpFile is based.

  • makeCappedVolumeBox in the DelayedArray package.

  • type in the DelayedArray package.

Examples

getHDF5DumpDir()
getHDF5DumpFile()

## Use setHDF5DumpFile() to change the current HDF5 dump file.
## If the specified file exists, then it must be in HDF5 format or
## an error will be raised. If it doesn't exist, then it will be
## created.
#setHDF5DumpFile("path/to/some/HDF5/file")

lsHDF5DumpFile()

a <- array(1:600, c(150, 4))
A <- as(a, "HDF5Array")
lsHDF5DumpFile()
A

b <- array(runif(6000), c(4, 2, 150))
B <- as(b, "HDF5Array")
lsHDF5DumpFile()
B

C <- (log(2 * A + 0.88) - 5)^3 * t(B[ , 1, ])
as(C, "HDF5Array")  # realize C on disk
lsHDF5DumpFile()

## Matrix multiplication is not delayed: the output matrix is realized
## block by block. The current "realization backend" controls where
## realization happens e.g. in memory if set to NULL or in an HDF5 file
## if set to "HDF5Array". See '?realize' in the DelayedArray package for
## more information about "realization backends".
setAutoRealizationBackend("HDF5Array")
m <- matrix(runif(20), nrow=4)
P <- C %*% m
lsHDF5DumpFile()

## See all the HDF5 datasets created in the current session so far:
showHDF5DumpLog()

## Wrap the call in suppressMessages() if you are only interested in the
## data frame version of the dump log:
dump_log <- suppressMessages(showHDF5DumpLog())
dump_log

Bioconductor/HDF5Array documentation built on Oct. 31, 2024, 9:16 a.m.