README.md

lifecycle Travis build status AppVeyor build status Coverage status CRAN status

R package {bigdfr}

R package to operate with data frames stored on disk

LIST OF FUNCTIONS

Example

# devtools::install_github("privefl/bigdfr")
library(bigdfr)

# Create a temporary file of ~349 MB (just as an example)
csv <- bigreadr::fwrite2(iris[rep(seq_len(nrow(iris)), 1e5), ], 
                         tempfile(fileext = ".csv"))
format(file.size(csv), big.mark = ",")

# Read the csv file in FDF format
(X <- FDF_read(csv))
head(X)
file.size(X$backingfile)
X$types

# Standard {dplyr} operations
X2 <- X %>% 
  filter(Species == "virginica", Sepal.Length < 5) %>%
  mutate(Sepal.Length = Sepal.Length + 1) %>%
  arrange(desc(Sepal.Length))

# Export as tibble (fully in memory, e.g. after sufficient filtering)
as_tibble(X2)

# An other way to get a tibble is to use summarize()
X %>%
  group_by(Species) %>%
  summarize(min_length = min(Sepal.Length))

How does it work?

I use a binary file on disk to store variables. Operations like mutate grow the file to add new columns. Operation like subset, filter and arrange just use indices to access a subset of the file. When (and only when) some columns are needed for some computations, data are actually accessed in memory.

Differences with {dplyr}

TODO

  1. optimize when possible
  2. need faster n() (quo_modif)
  3. similar printing as tibbles
  4. rethink fill/mutate?
  5. parallelize some lapply with {future}?
  6. user-defined summarize on all groups at once?
  7. implement fresh backingfile? (when subview is too small -> just use as_tibble()?)
  8. ...


privefl/bigdfr documentation built on Aug. 29, 2018, 9:58 a.m.