Extended Tabular Operations for Both matrix and big.matrix Objects
This package extends the bigmemory package, but the
functions may also be used with traditional R
data.frame objects. The function
exposed, but we expect most users will prefer the higher-level functions
bigsplit. Each of these
functions provides functionality based on a specified conditional
structure. In other words, for every cell of a (possibly multidimensional)
contingency table, they provide (or tabulate) some useful conditional
behavior (or statistic(s)) of interest. At the most basic level, this
provides an extremely fast and memory-efficient alternative to
table for matrices and data frames.
1 2 3 4 5 6 7 8 9 10 11 12
bigtabulate(x, ccols, breaks = vector("list", length = length(ccols)), table = TRUE, useNA = "no", summary.cols = NULL, summary.na.rm = FALSE, splitcol = NULL, splitret = "list") bigsplit(x, ccols, breaks = vector("list", length = length(ccols)), useNA = "no", splitcol = NA, splitret = "list") bigtable(x, ccols, breaks = vector("list", length = length(ccols)), useNA = "no") bigtsummary(x, ccols, breaks = vector("list", length = length(ccols)), useNA = "no", cols, na.rm = FALSE)
a vector of column indices or names specifying which columns should be used for conditioning (e.g. for building a contingency table or structure for tabulation).
a vector or list of
whether to include extra '
column(s) for which table summaries will be calculated.
an obvious option for summaries.
This package concentrates on conditional stuctures and calculations,
The functions are juiced-up versions of the base R functions;
they work on both regular R matrices and data frames, but are specialized
for use with bigmemory and (for more advanced usage) foreach.
They are particularly fast and memory-efficient. We have found that
bigsplit followed by
can be particularly effective, when the subsets produced by the split
are of reasonable size. For intensive calculations, subsequent use of
foreach can be helpful (think: parallel apply-like behavior).
x is a
matrix or a
data.frame, some additional
work may be required. For example, a character column of a
will be converted to a
factor and then coerced to numeric
values (factor level numberings).
The conditional structure is specified via
This differs from the design of the base R functions but is at the root
of the gains in speed and memory-efficiency. The
breaks may seem
distracting, as most users will simply condition on categorical-like columns.
However, it provides the flexibility to “bin” “continuous”,
column(s) much like a histogram. See
of this type of option, which can be particularly valuable with massive
A word of caution: if a “continuous” variable is not “binned”, it will be treated like a factor and the resulting conditional structure will be large (perhaps immensely so). The function uses left-closed intervals [a,b) for the "binning" behavior, when specified, except in the right-most bin, where the interval is entirely closed.
bigsplit is somewhat more general than
The default behavior (
returns a split of
1:nrow(x) as a list
based on the specified conditional structure. However, it may also
return a vector of cell (or category) numbers. And of course it may
conduct a split of
array-like object(s), each similar to what is returned by
tapply and the associated R functions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
data(iris) # First, break up column 2 into 5 groups, and leave column 5 as a # factor (which it is). Note that iris is a data.frame, which is # fine. A matrix would also be fine. A big.matrix would also be fine! bigtable(iris, ccols=c(2, 5), breaks=list(5, NA)) iris[,2] <- round(iris[,2]) # So columns 2 and 5 will be factor-like # for convenience in these examples, below: ans1 <- bigtable(iris, c(2, 5)) ans1 # Same answer, but with nice factor labels from table(), because # table() handles factors. bigtable() uses the numeric factor # levels only. table(iris[,2], iris[,5]) # Here, our formulation is simpler than split's, and is faster and # more memory-efficient: ans2 <- bigsplit(iris, c(2, 5), splitcol=1) ans2[1:3] split(iris[,1], list(col2=factor(iris[,2]), col5=iris[,5]))[1:3]