The core “big.matrix” operations.

Share:

Description

Create a big.matrix (or check to see if an object is a big.matrix, or create a big.matrix from a matrix, and so on). The big.matrix may be file-backed.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
big.matrix(nrow, ncol, type = options()$bigmemory.default.type,
  init = NULL, dimnames = NULL, separated = FALSE,
  backingfile = NULL, backingpath = NULL, descriptorfile = NULL,
  binarydescriptor=FALSE, shared = TRUE)
filebacked.big.matrix(nrow, ncol,
  type = options()$bigmemory.default.type, init = NULL,
  dimnames = NULL, separated = FALSE, backingfile = NULL,
  backingpath = NULL, descriptorfile = NULL, binarydescriptor=FALSE)
as.big.matrix(x, type = NULL, separated = FALSE,
  backingfile = NULL, backingpath = NULL, descriptorfile = NULL,
  binarydescriptor=FALSE, shared=TRUE)
is.big.matrix(x)
is.separated(x)
is.filebacked(x)
is.shared(x)
is.readonly(x)
is.nil(address)

Arguments

x

a matrix, vector, or data.frame for as.big.matrix; if a vector, a one-column
big.matrix is created by as.big.matrix; if a data.frame, see details. For the is.* functions, x is likely a big.matrix.

nrow

number of rows.

ncol

number of columns.

type

the type of the atomic element (options()$bigmemory.default.type by default – "double" – but can be changed by the user to "integer", "short", or "char").

init

a scalar value for initializing the matrix (NULL by default to avoid unnecessary time spent doing the initializing).

dimnames

a list of the row and column names; use with caution for large objects.

separated

use separated column organization of the data; see details.

backingfile

the root name for the file(s) for the cache of x.

backingpath

the path to the directory containing the file backing cache.

descriptorfile

the name of the file to hold the backingfile description, for subsequent use with attach.big.matrix; if NULL, the backingfile is used as the root part of the descriptor file name. The descriptor file is placed in the same directory as the backing files.

binarydescriptor

the flag to specify if the binary RDS format should be used for the backingfile description, for subsequent use with attach.big.matrix; if NULL of FALSE, the dput() file format is used.

shared

TRUE by default, and always TRUE if the big.matrix is file-backed. For a non-filebacked big.matrix, shared=FALSE uses non-shared memory, which can be more stable for large (say, >50% of RAM) objects. Shared memory allocation can sometimes fail in such cases due to exhausted shared-memory resources in the system.

address

an externalptr, so is.nil(x@address) might be a sensible thing to want to check, but it's pretty obscure.

Details

A big.matrix consists of an object in R that does nothing more than point to the data structure implemented in C++. The object acts much like a traditional R matrix, but helps protect the user from many inadvertant memory-consuming pitfalls of traditional R matrices and data frames.

There are two big.matrix types which manage data in different ways. A standard, shared big.matrix is constrained to available RAM, and may be shared across separate R processes. A file-backed big.matrix may exceed available RAM by using hard drive space, and may also be shared across processes. The atomic types of these matrices may be double, integer, short, or char (8, 4, 2, and 1 bytes, respectively).

If x is a big.matrix, then x[1:5,] is returned as an R matrix containing the first five rows of x. If x is of type double, then the result will be numeric; otherwise, the result will be an integer R matrix. The expression x alone will display information about the R object (e.g. the external pointer) rather than evaluating the matrix itself (the user should try x[,] with extreme caution, recognizing that a huge R matrix will be created).

If x has a huge number of rows and/or columns, then the use of rownames and/or colnames will be extremely memory-intensive and should be avoided. If x has a huge number of columns and separated=TRUE is used (this isn't typically recommended), the user might want to store the transpose as there is overhead of a pointer for each column in the matrix. If separated is TRUE, then the memory is allocated into separate vectors for each column. Use this option with caution if you have a large number of columns, as shared-memory segments are limited by OS and hardware combinations. If separated is FALSE, the matrix is stored in traditional column-major format. The function is.separated() returns the separation type of the big.matrix.

When a big.matrix, x, is passed as an argument to a function, it is essentially providing call-by-reference rather than call-by-value behavior. If the function modifies any of the values of x, the changes are not limited in scope to a local copy within the function. This introduces the possibility of side-effects, in contrast to standard R behavior.

A file-backed big.matrix may exceed available RAM in size by using a file cache (or possibly multiple file caches, if separated=TRUE). This can incur a substantial performance penalty for such large matrices, but less of a penalty than most other approaches for handling such large objects. A side-effect of creating a file-backed object is not only the file-backing(s), but a descriptor file (in the same directory) that is needed for subsequent attachments (see attach.big.matrix).

Note that we do not allow setting or changing the dimnames attributes by default; such changes would not be reflected in the descriptor objects or in shared memory. To override this, set
options(bigmemory.allow.dimnames=TRUE).

It should also be noted that a user can create an “anonymous” file-backed big.matrix by specifying "" as the filebacking argument. In this case, the backing resides in the temporary directory and a descriptor file is not created. These should be used with caution since even anonymous backings use disk space which could eventually fill the hard drive. Anonymous backings are removed either manually, by a user, or automatically, when the operating system deems it appropriate.

Finally, note that as.big.matrix can coerce data frames. It does this by making any character columns into factors, and then making all factors numeric before forming the big.matrix. Level labels are not preserved and must be managed by the user if desired.

Value

A big.matrix is returned (for big.matrix and filebacked.big.matrix, and
as.big.matrix), and TRUE or FALSE for is.big.matrix and the other functions.

Author(s)

John W. Emerson and Michael J. Kane <bigmemoryauthors@gmail.com>

References

The Bigmemory Project: http://www.bigmemory.org/.

See Also

bigmemory, and perhaps the class documentation of big.matrix; attach.big.matrix and describe. Sister packages biganalytics, bigtabulate, synchronicity, and bigalgebra provide advanced functionality.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
x <- big.matrix(10, 2, type='integer', init=-5)
options(bigmemory.allow.dimnames=TRUE)
colnames(x) <- c("alpha", "beta")
is.big.matrix(x)
dim(x)
colnames(x)
rownames(x)
x[,]
x[1:8,1] <- 11:18
colnames(x) <- NULL
x[,]

x <- as.big.matrix(matrix(-5, 10, 2))
colnames(x) <- c("alpha", "beta")
is.big.matrix(x)
dim(x)
colnames(x)
rownames(x)
x[1:8,1] <- 11:18
x[,]

# The following shared memory example is quite silly, as you wouldn't
# likely do this in a single R session.  But if zdescription were
# passed to another R session via SNOW, foreach, or even by a
# simple file read/write, then the attach.big.matrix() within the
# second R process would give access to the same object in memory.
# Please see the package vignette for real examples.

z <- big.matrix(3, 3, type='integer', init=3)
z[,]
dim(z)
z[1,1] <- 2
z[,]
zdescription <- describe(z)
zdescription
y <- attach.big.matrix(zdescription)
y[,]
y
z
y[1,1] <- -100
y[,]
z[,]

# A short filebacked example, showing the creation of associated files:
files <- dir()
files[grep("example.bin", files)]
z <- filebacked.big.matrix(3, 3, type='integer', init=123,
                           backingfile="example.bin",
                           descriptorfile="example.desc",
                           dimnames=list(c('a','b','c'), c('d', 'e', 'f')))
z[,]
files <- dir()
files[grep("example.bin", files)]
zz <- attach.big.matrix("example.desc")
zz[,]
zz[1,1] <- 0
zzz <- attach.big.matrix(describe(z))
zzz[,]

is.nil(z@address)