README.md
In privefl/bigdfr: Operate with Data Frames Stored on Disk

R package {bigdfr}

R package to operate with data frames stored on disk

# devtools::install_github("privefl/bigdfr")
library(bigdfr)

# Create a temporary file of ~349 MB (just as an example)
csv <- bigreadr::fwrite2(iris[rep(seq_len(nrow(iris)), 1e5), ], 
                         tempfile(fileext = ".csv"))
format(file.size(csv), big.mark = ",")

# Read the csv file in FDF format
(X <- FDF_read(csv))
head(X)
file.size(X$backingfile)
X$types

# Standard {dplyr} operations
X2 <- X %>% 
  filter(Species == "virginica", Sepal.Length < 5) %>%
  mutate(Sepal.Length = Sepal.Length + 1) %>%
  arrange(desc(Sepal.Length))

# Export as tibble (fully in memory, e.g. after sufficient filtering)
as_tibble(X2)

# An other way to get a tibble is to use summarize()
X %>%
  group_by(Species) %>%
  summarize(min_length = min(Sepal.Length))

I use a binary file on disk to store variables. Operations like mutate grow the file to add new columns. Operation like subset, filter and arrange just use indices to access a subset of the file. When (and only when) some columns are needed for some computations, data are actually accessed in memory.

In group_by, variables are passed the same way as in select. If you want to use temporary variables, use mutate.
This is allowed to summarize data with a function that returns a value of length > 1 (you'll get a list-column).
When adding columns to an FDF (e.g. with mutate), these columns always go last even if they existed before. This means that you can do FDF(iris) %>% mutate(Sepal.Width = Sepal.Width + 10) %>% pull() to get the newly created "Sepal.Width" variable.
filter drops empty groups.
You can't have list-columns stored in a FDF.

optimize when possible
need faster n() (quo_modif)
similar printing as tibbles
rethink fill/mutate?
parallelize some lapply with {future}?
user-defined summarize on all groups at once?
implement fresh backingfile? (when subview is too small -> just use as_tibble()?)
...