ff: ff classes for representing (large) atomic data

View source: R/ff.R

ffR Documentation

ff classes for representing (large) atomic data

Description

The ff package provides atomic data structures that are stored on disk but behave (almost) as if they were in RAM by mapping only a section (pagesize) into main memory (the effective main memory consumption per ff object). Several access optimization techniques such as Hyrid Index Preprocessing (as.hi, update.ff) and Virtualization (virtual, vt, vw) are implemented to achieve good performance even with large datasets. In addition to the basic access functions, the ff package also provides compatibility functions that facilitate writing code for ff and ram objects (clone, as.ff, as.ram) and very basic support for operating on ff objects (ffapply). While the (possibly packed) raw data is stored on a flat file, meta informations about the atomic data structure such as its dimension, virtual storage mode (vmode), factor level encoding, internal length etc.. are stored as an ordinary R object (external pointer plus attributes) and can be saved in the workspace. The raw flat file data encoding is always in native machine format for optimal performance and provides several packing schemes for different data types such as logical, raw, integer and double (in an extended version support for more tighly packed virtual data types is supported). flatfile data files can be shared among ff objects in the same R process or even from different R processes due to Memory-Mapping, although the caching effects have not been tested extensively.
Please do read and understand the limitations and warnings in LimWarn before you do anything serious with package ff.

Usage

ff( initdata  = NULL
, length      = NULL
, levels      = NULL
, ordered     = NULL
, dim         = NULL
, dimorder    = NULL
, bydim       = NULL
, symmetric   = FALSE
, fixdiag     = NULL
, names       = NULL
, dimnames    = NULL
, ramclass    = NULL
, ramattribs  = NULL
, vmode       = NULL
, update      = NULL
, pattern     = NULL
, filename    = NULL
, overwrite   = FALSE
, readonly    = FALSE
, pagesize    = NULL  # getOption("ffpagesize")
, caching     = NULL  # getOption("ffcaching")
, finalizer   = NULL
, finonexit   = NULL  # getOption("fffinonexit")
, FF_RETURN   = TRUE
, BATCHSIZE   = .Machine$integer.max
, BATCHBYTES  = getOption("ffbatchbytes")
, VERBOSE     = FALSE
)

Arguments

initdata

scalar or vector of the .vimplemented vmodes, recycled if needed, default 0, see also as.vmode and vector.vmode

length

optional vector length of the object (default: derive from 'initdata' or 'dim'), see length.ff

levels

optional character vector of levels if (in this case initdata must be composed of these) (default: derive from initdata)

ordered

indicate whether the levels are ordered (TRUE) or non-ordered factor (FALSE, default)

dim

optional array dim, see dim.ff and array

dimorder

physical layout (default seq_along(dim)), see dimorder and aperm

bydim

dimorder by which to interpret the 'initdata', generalization of the 'byrow' paramter in matrix

symmetric

extended feature: TRUE creates symmetric matrix (default FALSE)

fixdiag

extended feature: non-NULL scalar requires fixed diagonal for symmetric matrix (default NULL is free diagonal)

names

NOT taken from initdata, see names

dimnames

NOT taken from initdata, see dimnames

ramclass

class attribute attached when moving all or parts of this ff into ram, see ramclass

ramattribs

additional attributes attached when moving all or parts of this ff into ram, see ramattribs

vmode

virtual storage mode (default: derive from 'initdata'), see vmode and as.vmode

update

set to FALSE to avoid updating with 'initdata' (default TRUE) (used by ffdf)

pattern

root pattern with or without path for automatic ff filename creation (default NULL translates to "ff"), see also argument 'filename'

filename

ff filename with or without path (default tmpfile with 'pattern' prefix); without path the file is created in getOption("fftempdir"), with path '.' the file is created in getwd. Note that files created in getOption("fftempdir") have default finalizer "delete" while other files have default finalizer "close". See also arguments 'pattern' and 'finalizer' and physical

overwrite

set to TRUE to allow overwriting existing files (default FALSE)

readonly

set to TRUE to forbid writing to existing files

pagesize

pagesize in bytes for the memory mapping (default from getOptions("ffpagesize") initialized by getdefaultpagesize), see also physical

caching

caching scheme for the backend, currently 'mmnoflush' or 'mmeachflush' (flush mmpages at each swap, default from getOptions("ffcaching") initialized with 'mmeachflush'), see also physical

finalizer

name of finalizer function called when ff object is removed (default: ff files created in getOptions("fftempdir") are considered temporary and have default finalizer delete, files created in other locations have default finalizer close); available finalizer generics are "close", "delete" and "deleteIfOpen", available methods are close.ff, delete.ff and deleteIfOpen.ff, see also argument 'finonexit' and finalizer

finonexit

logical scalar determining whether and finalize is also called when R is closed via q, (default TRUE from getOptions("fffinonexit"))

FF_RETURN

logical scalar or ff object to be used. The default TRUE creates a new ff file. FALSE returns a ram object. Handing over an ff object here uses this or stops if not ffsuitable

BATCHSIZE

integer scalar limiting the number of elements to be processed in update.ff when length(initdata)>1, default from .Machine$integer.max

BATCHBYTES

integer scalar limiting the number of bytes to be processed in update.ff when length(initdata)>1, default from getOption("ffbatchbytes"), see also .rambytes

VERBOSE

set to TRUE for verbosing in update.ff when length(initdata)>1, default FALSE

Details

The atomic data is stored in filename as a native encoded raw flat file on disk, OS specific limitations of the file system apply. The number of elements per ff object is limited to the integer indexing, i.e. .Machine$integer.max. Atomic objects created with ff are is.open, a C++ object is ready to access the file via memory-mapping. Currently the C++ backend provides two caching schemes: 'mmnoflush' let the OS decide when to flash memory mapped pages and 'mmeachflush' will flush memory mapped pages at each page swap per ff file. These minimal memory ressources can be released by closeing or deleteing the ff file. ff objects can be saved and loaded across R sessions. If the ff file still exists in the same location, it will be opened automatically at the first attempt to access its data. If the ff object is removed, at the next garbage collection (see gc) the ff object's finalizer is invoked. Raw data files can be made accessible as an ff object by explicitly given the filename and vmode but no size information (length or dim). The ff object will open the file and handle the data with respect to the given vmode. The close finalizer will close the ff file, the delete finalizer will delete the ff file. The default finalizer deleteIfOpen will delete open files and do nothing for closed files. If the default finalizer is used, two actions are needed to protect the ff file against deletion: create the file outside the standard 'fftempdir' and close the ff object before removing it or before quitting R. When R is exited through q, the finalizer will be invoked depending on the 'fffinonexit' option, furthermore the 'fftempdir' is unlinked.

Value

If (!FF_RETURN) then a ram object like those generated by vector, matrix, array but with attributes 'vmode', 'physical' and 'virtual' accessible via vmode, physical and virtual
If (FF_RETURN) an object of class 'ff' which is a a list with two components:

physical

an external pointer of class 'ff_pointer' which carries attributes with copy by reference semantics: changing a physical attribute of a copy changes the original

virtual

an empty list which carries attributes with copy by value semantics: changing a virtual attribute of a copy does not change the original

Physical object component

The 'ff_pointer' carries the following 'physical' or readonly attributes, which are accessible via physical:

vmode see vmode
maxlength see maxlength
pattern see parameter 'pattern'
filename see filename
pagesize see parameter 'pagesize'
caching see parameter 'caching'
finalizer see parameter 'finalizer'
finonexit see parameter 'finonexit'
readonly see is.readonly
class The external pointer needs class 'ff_pointer' to allow method dispatch of finalizers

Virtual object component

The 'virtual' component carries the following attributes (some of which might be NULL):

Length see length.ff
Levels see levels.ff
Names see names.ff
VW see vw.ff
Dim see dim.ff
Dimorder see dimorder
Symmetric see symmetric.ff
Fixdiag see fixdiag.ff
ramclass see ramclass
ramattribs see ramattribs

Class

You should not rely on the internal structure of ff objects or their ram versions. Instead use the accessor functions like vmode, physical and virtual. Still it would be wise to avoid attributes AND classes 'vmode', 'physical' and 'virtual' in any other packages. Note that the 'ff' object's class attribute also has copy-by-value semantics ('virtual'). For the 'ff' object the following class attritibutes are known:

vector c("ff_vector","ff")
matrix c("ff_matrix","ff_array","ff")
array c("ff_array","ff")
symmetric matrix c("ff_symm","ff")
distance matrix c("ff_dist","ff_symm","ff")
reserved for future use c("ff_mixed","ff")

Methods

The following methods and functions are available for ff objects:

Type Name Assign Comment
Basic functions
function ff constructor for ff and ram objects
generic update updates one ff object with the content of another
generic clone clones an ff object optionally changing some of its features
method print print ff
method str ff object structure
Class test and coercion
function is.ff check if inherits from ff
generic as.ff coerce to ff, if not yet
generic as.ram coerce to ram retaining some of the ff information
generic as.bit coerce to bit
Virtual storage mode
generic vmode <- get and set virtual mode (setting only for ram, not for ff objects)
generic as.vmode coerce to vmode (only for ram, not for ff objects)
Physical attributes
function physical <- set and get physical attributes
generic filename <- get and set filename
generic pattern <- get pattern and set filename path and prefix via pattern
generic maxlength get maxlength
generic is.sorted <- set and get if is marked as sorted
generic na.count <- set and get NA count, if set to non-NA only swap methods can change and na.count is maintained automatically
generic is.readonly get if is readonly
Virtual attributes
function virtual <- set and get virtual attributes
method length <- set and get length
method dim <- set and get dim
generic dimorder <- set and get the order of dimension interpretation
generic vt virtually transpose ff_array
method t create transposed clone of ff_array
generic vw <- set and get virtual windows
method names <- set and get names
method dimnames <- set and get dimnames
generic symmetric get if is symmetric
generic fixdiag <- set and get fixed diagonal of symmetric matrix
method levels <- levels of factor
generic recodeLevels recode a factor to different levels
generic sortLevels sort the levels and recoce a factor
method is.factor if is factor
method is.ordered if is ordered (factor)
generic ramclass get ramclass
generic ramattribs get ramattribs
Access functions
function get.ff get single ff element (currently [[ is a shortcut)
function set.ff set single ff element (currently [[<- is a shortcut)
function getset.ff set single ff element and get old value in one access operation
function read.ff get vector of contiguous elements
function write.ff set vector of contiguous elements
function readwrite.ff set vector of contiguous elements and get old values in one access operation
method [ get vector of indexed elements, uses HIP, see hi
method [<- set vector of indexed elements, uses HIP, see hi
generic swap set vector of indexed elements and get old values in one access operation
generic add (almost) unifies '+=' operation for ff and ram objects
generic bigsample sample from ff object
Opening/Closing/Deleting
generic is.open check if ff is open
method open open ff object (is done automatically on access)
method close close ff object (releases C++ memory and protects against file deletion if deleteIfOpen) finalizer is used
generic delete deletes ff file (unconditionally)
generic deleteIfOpen deletes ff file if ff object is open (finalization method)
generic finalizer <- get and set finalizer
generic finalize force finalization
Other
function geterror.ff get error code
function geterrstr.ff get error message

ff options

Through options or getOption one can change and query global features of the ff package:

option description default
fftempdir default directory for creating ff files tempdir
fffinalizer name of default finalizer deleteIfOpen
fffinonexit default for invoking finalizer on exit of R TRUE
ffpagesize default pagesize getdefaultpagesize
ffcaching caching scheme for the C++ backend 'mmnoflush'
ffdrop default for the drop parameter in the ff subscript methods TRUE
ffbatchbytes default for the byte limit in batched/chunked processing 16MB

OS specific

The following table gives an overview of file size limits for common file systems (see https://en.wikipedia.org/wiki/Comparison_of_file_systems for further details):

File System File size limit
FAT16 2GB
FAT32 4GB
NTFS 16GB
ext2/3/4 16GB to 2TB
ReiserFS 4GB (up to version 3.4) / 8TB (from version 3.5)
XFS 8EB
JFS 4PB
HFS 2GB
HFS Plus 16GB
USF1 4GB to 256TB
USF2 512GB to 32PB
UDF 16EB

Credits

Package Version 1.0

Daniel Adler dadler@uni-goettingen.de
R package design, C++ generic file vectors, Memory-Mapping, 64-bit Multi-Indexing adapter and Documentation, Platform ports
Oleg Nenadic onenadi@uni-goettingen.de
Index sequence packing, Documentation
Walter Zucchini wzucchi@uni-goettingen.de
Array Indexing, Sampling, Documentation
Christian Gläser christian_glaeser@gmx.de
Wrapper for biglm package

Package Version 2.0

Jens Oehlschlägel Jens.Oehlschlaegel@truecluster.com
R package redesign; Hybrid Index Preprocessing; transparent object creation and finalization; vmode design; virtualization and hybrid copying; arrays with dimorder and bydim; symmetric matrices; factors and POSIXct; virtual windows and transpose; new generics update, clone, swap, add, as.ff and as.ram; ffapply and collapsing functions. R-coding, C-coding and Rd-documentation.
Daniel Adler dadler@uni-goettingen.de
C++ generic file vectors, vmode implementation and low-level bit-packing/unpacking, arithmetic operations and NA handling, Memory-Mapping and backend caching. C++ coding and platform ports. R-code extensions for opening existing flat files readonly and shared.

Licence

Package under GPL-2, included C++ code released by Daniel Adler under the less restrictive ISCL

Note

Note that the standard finalizers are generic functions, their dispatch to the 'ff_pointer' method happens at finalization time, their 'ff' methods exist for direct calling.

See Also

vector, matrix, array, as.ff, as.ram

Examples

  message("make sure you understand the following ff options 
    before you start using the ff package!!")
  oldoptions <- options(fffinalizer="deleteIfOpen", fffinonexit="TRUE", fftempdir=tempdir())
  message("an integer vector")
  ff(1:12)                  
  message("a double vector of length 12")
  ff(0, 12)
  message("a 2-bit logical vector of length 12 (vmode='boolean' has 1 bit)")
  ff(vmode="logical", length=12)
  message("an integer matrix 3x4 (standard colwise physical layout)")
  ff(1:12, dim=c(3,4))
  message("an integer matrix 3x4 (rowwise physical layout, but filled in standard colwise order)")
  ff(1:12, dim=c(3,4), dimorder=c(2,1))
  message("an integer matrix 3x4 (standard colwise physical layout, but filled in rowwise order
aka matrix(, byrow=TRUE))")
  ff(1:12, dim=c(3,4), bydim=c(2,1))
  gc()
  options(oldoptions)

  if (ffxtensions()){
     message("a 26-dimensional boolean array using 1-bit representation
      (file size 8 MB compared to 256 MB int in ram)")
     a <- ff(vmode="boolean", dim=rep(2, 26))
     dimnames(a) <- dummy.dimnames(a)
     rm(a); gc()
  }

  ## Not run: 

     message("This 2GB biglm example can take long, you might want to change
       the size in order to define a size appropriate for your computer")
     require(biglm)

     b <- 1000
     n <- 100000
     k <- 3
     memory.size(max = TRUE)
     system.time(
     x <- ff(vmode="double", dim=c(b*n,k), dimnames=list(NULL, LETTERS[1:k]))
     )
     memory.size(max = TRUE)
     system.time(
     ffrowapply({
        l <- i2 - i1 + 1
        z <- rnorm(l)
        x[i1:i2,] <- z + matrix(rnorm(l*k), l, k)
     }, X=x, VERBOSE=TRUE, BATCHSIZE=n)
     )
     memory.size(max = TRUE)

     form <- A ~ B + C
     first <- TRUE
     system.time(
     ffrowapply({
        if (first){
          first <- FALSE
          fit <- biglm(form, as.data.frame(x[i1:i2,,drop=FALSE], stringsAsFactors = TRUE))
        }else
          fit <- update(fit, as.data.frame(x[i1:i2,,drop=FALSE], stringsAsFactors = TRUE))
     }, X=x, VERBOSE=TRUE, BATCHSIZE=n)
     )
     memory.size(max = TRUE)
     first
     fit
     summary(fit)
     rm(x); gc()
  
## End(Not run)

ff documentation built on Sept. 30, 2024, 9:38 a.m.

Related to ff in ff...