dump-management: HDF5 dump management
In Bioconductor/HDF5Array: HDF5 datasets as array-like objects in R

HDF5-dump-management

R Documentation

HDF5 dump management

Description

A set of utilities to control the location and physical properties of automatically created HDF5 datasets.

Usage

setHDF5DumpDir(dir)
setHDF5DumpFile(filepath)
setHDF5DumpName(name)
setHDF5DumpChunkLength(length=1000000L)
setHDF5DumpChunkShape(shape="scale")
setHDF5DumpCompressionLevel(level=6L)

getHDF5DumpDir()
getHDF5DumpFile()
getHDF5DumpName(for.use=FALSE)
getHDF5DumpChunkLength()
getHDF5DumpChunkShape()
getHDF5DumpCompressionLevel()

lsHDF5DumpFile()

showHDF5DumpLog()

## For developers:
getHDF5DumpChunkDim(dim)
appendDatasetCreationToHDF5DumpLog(filepath, name, dim, type,
                                   chunkdim, level)

Arguments

`dir`	The path (as a single string) to the current HDF5 dump directory, that is, to the (new or existing) directory where HDF5 dump files with automatic names will be created. This is ignored if the user specified an HDF5 dump file with `setHDF5DumpFile`. If `dir` is missing, then the HDF5 dump directory is set back to its default value i.e. to some directory under `tempdir()` (call `getHDF5DumpDir()` to get the exact path).
`filepath`	For `setHDF5DumpFile`: The path (as a single string) to the current HDF5 dump file, that is, to the (new or existing) HDF5 file where the next automatic HDF5 datasets will be written. If `filepath` is missing, then a new file with an automatic name will be created (in `getHDF5DumpDir()`) and used for each new dataset. For `appendDatasetCreationToHDF5DumpLog`: See the Note TO DEVELOPERS below.
`name`	For `setHDF5DumpName`: The name of the next automatic HDF5 dataset to be written to the current HDF5 dump file. For `appendDatasetCreationToHDF5DumpLog`: See the Note TO DEVELOPERS below.
`length`	The maximum length of the physical chunks of the next automatic HDF5 dataset to be written to the current HDF5 dump file.
`shape`	A string specifying the shape of the physical chunks of the next automatic HDF5 dataset to be written to the current HDF5 dump file. See `makeCappedVolumeBox` in the DelayedArray package for a description of the supported shapes.
`level`	For `setHDF5DumpCompressionLevel`: The compression level to use for writing automatic HDF5 datasets to disk. See the `level` argument in `?rhdf5::h5createDataset` (in the rhdf5 package) for more information about this. For `appendDatasetCreationToHDF5DumpLog`: See the Note TO DEVELOPERS below.
`for.use`	Whether the returned dataset name is for use by the caller or not. See below for the details.
`dim`	The dimensions of the HDF5 dataset to be written to disk, that is, an integer vector of length one or more giving the maximal indices in each dimension. See the `dims` argument in `?rhdf5::h5createDataset` (in the rhdf5 package) for more information about this.
`type`	The type (a.k.a. storage mode) of the data to be written to disk. Can be obtained with `type()` on an array-like object (which is equivalent to `storage.mode()` or `typeof()` on an ordinary array). This is typically what an application writing datasets to the HDF5 dump should pass to the `storage.mode` argument of its call to `rhdf5::h5createDataset`. See the Note TO DEVELOPERS below for more information.
`chunkdim`	The dimensions of the chunks.

Details

Calling getHDF5DumpFile() and getHDF5DumpName() with no argument should be informative only i.e. it's a mean for the user to know where the next automatic HDF5 dataset will be written. Since a given file/name combination can be used only once, the user should be careful to not use that combination to explicitely create an HDF5 dataset because that would get in the way of the creation of the next automatic HDF5 dataset. See the Note TO DEVELOPERS below if you actually need to use this file/name combination.

lsHDF5DumpFile() is a just convenience wrapper for h5ls(getHDF5DumpFile()).

Value

getHDF5DumpDir returns the absolute path to the directory where HDF5 dump files with automatic names will be created. Only meaningful if the user did NOT specify an HDF5 dump file with setHDF5DumpFile.

getHDF5DumpFile returns the absolute path to the HDF5 file where the next automatic HDF5 dataset will be written.

getHDF5DumpName returns the name of the next automatic HDF5 dataset.

getHDF5DumpCompressionLevel returns the compression level currently used for writing automatic HDF5 datasets to disk.

showHDF5DumpLog returns the dump log in an invisible data frame.

getHDF5DumpChunkDim returns the dimensions of the physical chunks that will be used to write the dataset to disk.

Note

TO DEVELOPERS:

If your application needs to write its own dataset to the HDF5 dump then it should:

Get a file/dataset name combination by calling getHDF5DumpFile() and getHDF5DumpName(for.use=TRUE).
[OPTIONAL] Call getHDF5DumpChunkDim(dim) to get reasonable chunk dimensions to use for writing the dataset to disk. Or choose your own chunk dimensions.
Add an entry to the dump log by calling appendDatasetCreationToHDF5DumpLog. Typically, this should be done right after creating the dataset (e.g. with rhdf5::h5createDataset) and before starting to write the dataset to disk. The values passed to appendDatasetCreationToHDF5DumpLog via the filepath, name, dim, type, chunkdim, and level arguments should be those that were passed to rhdf5::h5createDataset via the file, dataset, dims, storage.mode, chunk, and level arguments, respectively. Note that appendDatasetCreationToHDF5DumpLog uses a lock mechanism so is safe to use in the context of parallel execution.

This is actually what the coercion method to HDF5Array does internally.

Examples

getHDF5DumpDir()
getHDF5DumpFile()

## Use setHDF5DumpFile() to change the current HDF5 dump file.
## If the specified file exists, then it must be in HDF5 format or
## an error will be raised. If it doesn't exist, then it will be
## created.
#setHDF5DumpFile("path/to/some/HDF5/file")

lsHDF5DumpFile()

a <- array(1:600, c(150, 4))
A <- as(a, "HDF5Array")
lsHDF5DumpFile()
A

b <- array(runif(6000), c(4, 2, 150))
B <- as(b, "HDF5Array")
lsHDF5DumpFile()
B

C <- (log(2 * A + 0.88) - 5)^3 * t(B[ , 1, ])
as(C, "HDF5Array")  # realize C on disk
lsHDF5DumpFile()

## Matrix multiplication is not delayed: the output matrix is realized
## block by block. The current "realization backend" controls where
## realization happens e.g. in memory if set to NULL or in an HDF5 file
## if set to "HDF5Array". See '?realize' in the DelayedArray package for
## more information about "realization backends".
setAutoRealizationBackend("HDF5Array")
m <- matrix(runif(20), nrow=4)
P <- C %*% m
lsHDF5DumpFile()

## See all the HDF5 datasets created in the current session so far:
showHDF5DumpLog()

## Wrap the call in suppressMessages() if you are only interested in the
## data frame version of the dump log:
dump_log <- suppressMessages(showHDF5DumpLog())
dump_log

Bioconductor/HDF5Array documentation built on June 8, 2025, 4:19 a.m.