library(ldat)
This R package contains functions and methods for working with vectors and data sets that are too large to fit into memory. The data is kept (partially) on disk using memory mapping.
The package extends the basic functionality provided by the
lvec
package, which implements
integer, logical, and numeric memory mapped vectors. It
implements additional functionality to make working with these objects easier. It
implements various base R methods, such a mean
and table
, for lvec
objects. Furthermore, it implements data.frame-like objects (ldat
objects)
that can be used to work with data sets too large to fit into memory.
The following functions and methods are implemented for lvec
objects:
[]
and []<-
. Indexing returns an lvec object."+"
, "-"
, "*"
, "/"
, "^"
, "%%"
, "%/%"
"&"
, "|"
, "!"
, "=="
, "!="
, "<"
, "<="
, ">="
, ">"
. These
return an lvec object.all
, any
, prod
, sum
, mean
, max
, min
,
range
, quantile
, median
. These all return an R-vector.abs
, sign
, sqrt
, floor
, ceiling
, trunc
,
round
, signif
exp
, log
, expm1
, log1p
, cos
, sin
, tan
,
cospi
, sinpi
, tanpi
, acos
, asin
, atan
, cosh
, sinh
, tanh
,
acosh
, asinh
, atanh
, lgamma
, gamma
, digamma
, trigamma
, cumsum
,
cumprod
, cummax
, cummin
. These all return an lvec.table
: cross tables of lvec objects.duplicated
: check for duplicated values in an lvec object.unique
: select unique values from an lvec object.which
: return the indices of true elements.match
: lookup elements in a other vectoris.na
: logical vector with missing values.These methods make working with lvec
objects almost the same as working with
regular R vectors. Some examples:
Create an vector with random numbers (in this case the vector is first
generated in memory, for long vector the generate
function can be used) and
set some values to missing:
x <- as_lvec(rnorm(1E6)) x[c(1,5)] <- NA
Calculate the mean
mean(x, na.rm = TRUE)
Truncate values to the range -1--1:
x[abs(x) > 1 & !is.na(x)] <- 1
Sort x
:
sort(x)
Attributes of vectors are preserved, which means that one can work with
factor
and Date
objects:
y <- as_lvec(factor(sample(letters, 1000, replace = TRUE))) print(y) table(y)
ldat
objects can be used to work with data sets. A ldat
objects mimics as
close as possible the functionality of a data.frame
. Under the hood it is a
list with lvec
objects.
Create a new ldat
dta <- ldat(a = 1:10, b = rnorm(10)) print(dta)
Create one from an existing dataset
dta <- as_ldat(iris)
An ldat
object can be indexed as regular data.frame (one difference is that
rows and or columns will not be dropped, e.g. drop = FALSE
). It will return a new
ldat
object.
dta[1:3, ] dta[dta$Sepal.Width > 4, ] dta[1:5, 1:2] dta[1:5, "Sepal.Width"]
Converting to a data.frame:
as.data.frame(dta[1:4, ])
lvec
and ldat
objectsAlthough some basic functionality has been implemented for ldat
and lvec
objects (see above), in practice, one will of course want to do stuff that isn't
implemented. One solution is to copy the data to a regular R-object, perform the
computation in R and if necessary copy the result back to an lvec
or ldat
object. This could for example be an option when a complete dataset doesn't fit
into memory, but the few columns needed for a computation do. An lvec
object
can be converted to an R-object using as_rvec
and an ldat
object using
as.data.frame
.
dta <- as_ldat(iris) cor(as_rvec(dta$Sepal.Length), as_rvec(dta$Sepal.Width))
However, in practice, the data doesn't always fit into memory, or one can not
assume it does. In that case chunkwise, or blockwis, processing can often be
used. Many of the functions mentioned above have been implemented using
chunkwise processing. With chunkwise processing one reads a chunk of data into
memory, performs some computation on that chunk before copying the next chunk
into memory. The ldat
and lvec
packages contain some helper functions for
that.
One use case is calculating a new value for each element in a vector. The code
below demontrates how this can be done using the helper functions chunk
and
slice_range
. In this case the month name is calculated for each element in
a Date
vector.
x <- as_lvec(seq.Date(as.Date("1900-01-01"), as.Date("2050-12-31"), by = "days")) m <- lvec(length(x), "character", strlen = 3) chunks <- chunk(x) for (c in chunks) { d <- slice_range(x, range = c, as_r = TRUE) m[range = c] = months(d, abbreviate = TRUE) }
Since this is a common operation, ldat
also contains a special helper function
for this use case, elementwise
, which applies a function to each element, in
this case months
m <- elementwise(x, months, abbreviate = TRUE)
Another common pattern (of which the pattern above is a special case) is to perform a calculation for each chunk using the result from the previous chunk, and finally calculate the end result.
# Generate an lvec with random normal values y <- generate(2E6, rnorm) # Calculate men of x state <- c(n = 0, sum = 0) chunks <- chunk(y) for (c in chunks) { d <- slice_range(y, range = c, as_r = TRUE) state[1] <- state[1] + length(d) state[2] <- state[2] + sum(d) } state[2]/state[1]
For this pattern there is also the helper function chunkwise
. This function
accepts an lvec
or ldat
object and three functions: an init function which
initialises the state, an update function which updates the state using a new
chunk of data, and a final function calculating the end result from the state.
The calculation of the mean can then be written as:
chunkwise(y, init = function(x) c(0,0), update = function(state, x) c(state[1] + length(x), state[2] + sum(x)), final = function(state) state[2]/state[1])
When, looking at ldat::mean.lvec
, you'll see that, except for some stuff with
na.rm
, this is also how that function is written.
Another function that is usefull when chunkswise processing lvec
or ldat
objects, is the append
function, which appends one object to another (
as c
for vectors, and rbind
for data.frames). For ldat
and lvec
objects
this method has a clone
argument. When set to FALSE
(for safety the default
is TRUE
), the second vector is appended to the first, modifying the first
argument. The advantage of that is that the first vector does not need to be
copied. Append can be used to create a result for which the final size is not
known at the beginning (if the size is known it is more efficient the initialise
the result to the correct size directly at the beginning).
In the code below x
is copied and therefore not modified:
x <- as_lvec(1:10) z <- append(x, 100:1009) x
In the code below the new data is appended to x
and x
is modified. x
and
`z now point to the same vector:
x <- as_lvec(1:10) z <- append(x, 100:1009, clone = FALSE) x
generate
: generate an lvec object filled with (random) values. partial_sort
: ensure that an lvec is sorted in such a way, that values
before a given element number are smaller than that element and values after
that element are larger. partial_order
: as above, but returns the order, e.g. the indexing vector
that would result in a partiallr sorted vector. Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.