This package extends the bigmemory package, but the
functions may also be used with traditional **R** `matrix`

and
`data.frame`

objects. The function `bigtabulate`

is
exposed, but we expect most users will prefer the higher-level functions
`bigtable`

, `bigtsummary`

, and `bigsplit`

. Each of these
functions provides functionality based on a specified conditional
structure. In other words, for every cell of a (possibly multidimensional)
contingency table, they provide (or tabulate) some useful conditional
behavior (or statistic(s)) of interest. At the most basic level, this
provides an extremely fast and memory-efficient alternative to
`table`

for matrices and data frames.

1 2 3 4 5 6 7 8 9 10 11 12 | ```
bigtabulate(x, ccols, breaks = vector("list", length = length(ccols)),
table = TRUE, useNA = "no", summary.cols = NULL,
summary.na.rm = FALSE, splitcol = NULL, splitret = "list")
bigsplit(x, ccols, breaks = vector("list", length = length(ccols)),
useNA = "no", splitcol = NA, splitret = "list")
bigtable(x, ccols, breaks = vector("list", length = length(ccols)),
useNA = "no")
bigtsummary(x, ccols, breaks = vector("list", length = length(ccols)),
useNA = "no", cols, na.rm = FALSE)
``` |

`x` |
a |

`ccols` |
a vector of column indices or names specifying which columns should be used for conditioning (e.g. for building a contingency table or structure for tabulation). |

`breaks` |
a vector or list of |

`table` |
if |

`useNA` |
whether to include extra ' |

`summary.cols` |
column(s) for which table summaries will be calculated. |

`summary.na.rm` |
if |

`splitcol` |
if |

`splitret` |
if |

`cols` |
with |

`na.rm` |
an obvious option for summaries. |

This package concentrates on conditional stuctures and calculations,
much like `table`

, `tapply`

, and `split`

.
The functions are juiced-up versions of the base **R** functions;
they work on both regular **R** matrices and data frames, but are specialized
for use with bigmemory and (for more advanced usage) foreach.
They are particularly fast and memory-efficient. We have found that
`bigsplit`

followed by `lapply`

or `sapply`

can be particularly effective, when the subsets produced by the split
are of reasonable size. For intensive calculations, subsequent use of
`foreach`

can be helpful (think: parallel apply-like behavior).

When `x`

is a `matrix`

or a `data.frame`

, some additional
work may be required. For example, a character column of a `data.frame`

will be converted to a `factor`

and then coerced to numeric
values (factor level numberings).

The conditional structure is specified via `ccols`

and `breaks`

.
This differs from the design of the base **R** functions but is at the root
of the gains in speed and memory-efficiency. The `breaks`

may seem
distracting, as most users will simply condition on categorical-like columns.
However, it provides the flexibility to “bin” “continuous”,
column(s) much like a histogram. See `binit`

for
another example
of this type of option, which can be particularly valuable with massive
data sets.

A word of caution: if a “continuous” variable is not “binned”, it will be treated like a factor and the resulting conditional structure will be large (perhaps immensely so). The function uses left-closed intervals [a,b) for the "binning" behavior, when specified, except in the right-most bin, where the interval is entirely closed.

Finally, `bigsplit`

is somewhat more general than `split`

.
The default behavior (`splitcol=NA`

)
returns a split of `1:nrow(x)`

as a list
based on the specified conditional structure. However, it may also
return a vector of cell (or category) numbers. And of course it may
conduct a split of `x[,splitcol]`

.

array-like object(s), each similar to what is returned by
`tapply`

and the associated **R** functions.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ```
data(iris)
# First, break up column 2 into 5 groups, and leave column 5 as a
# factor (which it is). Note that iris is a data.frame, which is
# fine. A matrix would also be fine. A big.matrix would also be fine!
bigtable(iris, ccols=c(2, 5), breaks=list(5, NA))
iris[,2] <- round(iris[,2]) # So columns 2 and 5 will be factor-like
# for convenience in these examples, below:
ans1 <- bigtable(iris, c(2, 5))
ans1
# Same answer, but with nice factor labels from table(), because
# table() handles factors. bigtable() uses the numeric factor
# levels only.
table(iris[,2], iris[,5])
# Here, our formulation is simpler than split's, and is faster and
# more memory-efficient:
ans2 <- bigsplit(iris, c(2, 5), splitcol=1)
ans2[1:3]
split(iris[,1], list(col2=factor(iris[,2]), col5=iris[,5]))[1:3]
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.