lazyarray: Create or load a 'lazyarray' instance

View source: R/lazyarray.R

lazyarrayR Documentation

Create or load a lazyarray instance

Description

Creates or load a lazyarray that stores data on the hard disks. The data content is load on demand.

Usage

lazyarray(
  path,
  dim,
  read_only = FALSE,
  type = c("filearray", "fstarray"),
  storage_format = c("double", "integer", "complex", "character"),
  meta_name = "lazyarray.meta"
)

fstarray(
  path,
  dim,
  read_only = FALSE,
  storage_format = c("double", "integer", "complex", "character"),
  meta_name = "lazyarray.meta"
)

filearray(
  path,
  dim,
  read_only = FALSE,
  storage_format = c("double", "integer"),
  meta_name = "lazyarray.meta"
)

as.lazymatrix(x, ...)

as.lazyarray(x, path, type = "filearray", ...)

Arguments

path

path to a local drive where array data should be stored

dim

integer vector, dimension of array, see dim

read_only

whether created array is read-only

type

the back-end implementation of the array; choices are "filearray" and "fstarray".

storage_format

data type, choices are "double", "integer", "character", and "complex"; see details

meta_name

header file name, default is "lazyarray.meta"

x

An R matrix or array

...

passed into lazyarray

Details

The function lazyarray() can either create or load an array on the hard drives. When path exists as a directory, and there is a valid array instance stored, lazyarray will ignore other parameters such as storage_format, type, and sometimes dim (see Section "Array Partitions"). The function will try to load the existing array given by the descriptive meta file. When path is missing or there is no valid array files inside of the directory, then a new array will be spawned, and path will be created automatically if it is missing.

There are two back-end implementations for lazyarray(): "filearray" and "fstarray". You can use type to specify which implementation serves your needs. There are some differences between these two types. Each one has its own strengths and weaknesses. Please see Section "Array Types" for more details.

The argument meta_name specifies the name of file which stores all the attribute information such as the total dimension, partition size, file format, and storage format etc. There could be multiple meta files for the same array object; see Section "Array Partitions" for details.

Value

An R6 class of lazyarray. The class name is either FstArray or FileArray, depending on type specified. Both inherit AbstractLazyArray.

Array Types

Type filearray stores data in its binary form "as-is" to the local drives. This format is compatible with the package filematrix. The data types supported are integers and double-float numbers.

Type fstarray stores data in fst format defined by the package fstcore using 'ZSTD' compression technique. Unlike filearray, fstarray supports complex numbers and string characters in addition to integer and double numbers.

The performance on solid-state drives mounted on 'NVMe' shows filearray can reach up to 3 GB per second for reading speed and fstarray can reach up to 1 GB per second.

By default, filearray will be used if the storage format is supported, and fstarray is the back-up option. However, if the array data is structured or ordered, or the storage size is a major concern, fstarray might achieve a better performance because it compresses data before writing to hard drive.

To explicitly create file array, use the function filearray(). Similarly, use fstarray() to create fst-based array.

Array Partitions

A lazyarray partitions data in two ways: file partitions and in-file blocks.

1. File-level Partition:

The number of file partitions matches with the last array margin. Given a 100 x 200 x 30 x 4 array, there will be 4 partitions, each partition stores a slice of data containing a 100 x 200 x 30 sub-array, or 2,400,000 elements.

Once an array is created, the length of each partition does not change anymore. However, the shape of each partition can be changed. The number of partitions can grow or trim. To change these, you just need to create a new meta file and specify the new dimension at no additional cost. Use the previous example. The partition sub-dimension can be 10000 x 60, 2000 x 300, or 1000 x 200 x 3 as long as the total length matches. The total partitions can change to 3, 5, or 100, or any positive integer. To change the total dimension to 2400000 x 100, you can call lazyarray with the new dimension ( see examples). Please make sure the type and meta_name are specified.

2. In-file Blocks:

Within each file, the data are stored in blocks. When reading the data, if an element within each block is used, then the whole block gets read.

For filearray, the block size equals to the first margin. For example, a 100 x 200 x 3 file array will have 3 file partitions, 200 blocks, each block has 100 elements

As for fstarray, the lower bound of block size can be set by options(lazyarray.fstarray.blocksize=...). By default, this number is 16,384. For a 100 x 200 x 3 array, each partition only has one block and block number if 20,000.

Indexing and Recommended Dimension Settings

If there is a dimension that defines the unit of analysis, then make it the last margin index. If a margin is rarely indexed, put it in the first margin. This is because indexing along the last margin is the fastest, and indexing along the first margin is the slowest.

If x has 200 x 200 x 200 dimension, x[,,i] is the fastest, then x[,i,], then x[i,,].

Author(s)

Zhengjia Wang

Examples


library(lazyarray)

path <- tempfile()

# ---------------- case 1: Create new array ------------------
arr <- lazyarray(path, storage_format = 'double', dim = c(2,3,4))
arr[] <- 1:24

# Subset and get the first partition
arr[,,1]

# Partition file path (total 4 partitions)
arr$get_partition_fpath()

# Removing array doesn't clear the data
rm(arr); gc()

# ---------------- Case 2: Load from existing directory ----------------
# Load from existing path, no need to specify other params
arr <- lazyarray(path, read_only = TRUE)

summary(arr, quiet = TRUE)

# ---------------- Case 3: Import from existing data ----------------

# Change dimension to 6 x 20

arr1 <- lazyarray(path, dim = c(6,20), meta_name = "arr_6x20.meta")

arr1[,1:5]

arr1[,1:6] <- rnorm(36)

# arr also changes
arr[,,1]


# ---------------- Case 4: Converting from R arrays ----------------

x <- matrix(1:16, 4)
x <- as.lazymatrix(x, type = 'fstarray', storage_format = "complex")
x[,]  # or x[]




dipterix/lazyarray documentation built on June 30, 2023, 6:30 a.m.