BigDataStatMeth: BigDataStatMeth: Scalable statistical computing with R, C++,...

BigDataStatMethR Documentation

BigDataStatMeth: Scalable statistical computing with R, C++, and HDF5

Description

BigDataStatMeth provides statistical and linear algebra operations for matrices stored in HDF5 files. The package is designed for workflows in which matrices may be too large to be held entirely in memory, while still allowing users to work with familiar R functions.

The recommended user-facing interface is based on HDF5Matrix objects and standard R methods. HDF5-backed matrices can be manipulated using calls such as dim(), [, %*%, crossprod(), tcrossprod(), scale(), cor(), svd(), prcomp(), qr(), chol(), and solve().

Main user-facing functionality

  • Core HDF5 matrix handling: hdf5_create_matrix(), hdf5_matrix(), list_datasets(), is_open(), close(), and hdf5_close_all().

  • Subsetting and conversion: [, [<-, as.matrix(), and as.data.frame().

  • Dimension names: rownames(), colnames(), and dimnames().

  • Element-wise arithmetic: +, -, *, and / for HDF5Matrix objects.

  • Matrix algebra: %*%, crossprod(), tcrossprod(), cbind(), and rbind().

  • Aggregations and summaries: colSums(), rowSums(), colMeans(), rowMeans(), colVars(), rowVars(), colSds(), rowSds(), colMins(), rowMins(), colMaxs(), rowMaxs(), mean(), var(), and sd().

  • Statistical transformations: scale(), sweep(), and cor().

  • Matrix decompositions and factorizations: svd(), prcomp(), qr(), chol(), solve(), eigen(), and pseudoinverse().

  • Diagonal, split, reduce, and apply operations: diag(), diag_op(), diag_scale(), split_dataset(), reduce(), and apply_function().

Additional high-level utilities

Most user workflows can be expressed through HDF5Matrix objects and standard R methods. Some functions keep the bd* prefix because they provide additional utilities that do not map directly to a standard R generic, or because they expose workflows available in earlier versions of the package. Examples include utilities for creating HDF5 groups, moving datasets, and writing HDF5-backed dimension names. These functions remain part of the package API and are documented in their corresponding help pages.

Global options and HDF5 resources

Block-wise operations can be configured with hdf5matrix_options(), including options for parallel execution, number of threads, block size, and HDF5 compression. Open HDF5 resources can be closed explicitly with close() for individual objects or hdf5_close_all() for all handles tracked by the package.

Architecture and developer interfaces

BigDataStatMeth is organized around a standard R interface backed by a C++ computational infrastructure. The user-facing layer is based on HDF5Matrix objects and S3 methods, allowing HDF5-backed matrices to be used with familiar R functions.

Internally, a lightweight R6 layer connects these R methods with the C++ backend. The C++ infrastructure provides classes for managing HDF5 files, groups, and datasets, together with block-wise routines for linear algebra and statistical operations.

This design allows developers to implement new scalable methods from Rcpp-based code while reusing the package machinery for HDF5 file management, block iteration, compression handling, and numerical computation.

Getting started

See vignette("BigDataStatMeth") for a practical introduction to HDF5-backed matrices and the main user-facing functionality.

Examples

h5file <- tempfile(fileext = ".h5")

set.seed(1)
X <- matrix(rnorm(100 * 20), nrow = 100, ncol = 20)

X_h5 <- hdf5_create_matrix(
  filename = h5file,
  dataset = "data/X",
  data = X,
  overwrite = TRUE
)

dim(X_h5)
colMeans(X_h5)

XtX_h5 <- crossprod(X_h5)
dim(XtX_h5)

close(X_h5)
close(XtX_h5)
hdf5_close_all(verbose = FALSE)


BigDataStatMeth documentation built on May 15, 2026, 1:07 a.m.