In hope-data-science/tidyfst: Tidy Verbs for Fast Data Manipulation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

The fst package for R provides a fast, easy and flexible way to serialize data frames. It has very amazing features, such as fast read and write of R data frames, super file compression and parse data frames without reading it. Considering all these features, now tidyfst could provide a new workflow to manipulate data more efficiently. The core idea is: We never need the whole data all at once, we only need the things we want and aggregate them to get the summary to provide target information.

tidyfst have provided the following functions to facilitate the workflow:

parse_fst: Get information of the data.frame without reading it
slice_fst: Select the target rows by number
select_fst: Select the target columns for the task
filter_fst: Conditional selection of rows
import_fst: Read a fst file like fst::read_fst but always return a data.table
export_fst: Write a fst file like fst::write_fst but always use largest compress factor (which yields smallest file)

In such a workflow, you never need to read the whole data.frame into your RAM, you just select the target data, process them instantly and get the results all at once. You do not have to read the data to know the structure of data.frame, because we have parse_fst(a wrapper for fst in fst package). Now let's give it a try.

library(tidyfst)

# Generate some random data frame with 10 million rows and various column types
nr_of_rows <- 1e7

df <- data.frame(
    Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
    Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
    Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
    Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
  )

# write the fst file, make sure you do not have the file with same name in the directory
export_fst(df,"fst_test.fst")

# remove all variables in the environment
rm(list = ls())

Now, we want to know the information in the data frame.

parse_fst("fst_test.fst") -> ft
ft

If we want to get the information in the Factor column, use:

ft %>% 
  select_fst(Factor) %>% 
  count_dt(Factor) -> factor_info

factor_info

If we want to calculate the mean of Integer by the group of Factor, use:

ft %>% 
  select_fst(Integer,Factor) %>% 
  summarise_dt(avg = mean(Integer),by = Factor) -> avg_info

avg_info

In this workflow, we only select/filter/slice the data we need, and get the results directly from the pipeline. Therefore, we read the minimum needed data into RAM and release it and save only the results we want. This workflow could save memory for many exploratory big data analysis. Last, let's delete the output file:

# delete the output file
unlink("fst_test.fst")

After (>=) version 0.9.3, tidyfst has also added a function as_fst(), which can turn any data.frame into a fst table and saved the data in the temporary file. This means that we might never have to save the object in the RAM ever (as long as it is a data.frame)! A small example:

iris %>% as_fst() -> iris_fst
mtcars %>% as_fst() -> mtcars_fst

iris_fst
mtcars_fst

So when you have generated a pretty large data.frame and do not want it to consume the cache in your computer, just save it and read it when needed using as_fst.

hope-data-science/tidyfst documentation built on Sept. 23, 2024, 8:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com