knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
The fst package for R provides a fast, easy and flexible way to serialize data frames. It has very amazing features, such as fast read and write of R data frames, super file compression and parse data frames without reading it. Considering all these features, now tidyfst could provide a new workflow to manipulate data more efficiently. The core idea is: We never need the whole data all at once, we only need the things we want and aggregate them to get the summary to provide target information.
tidyfst have provided the following functions to facilitate the workflow:
fst::read_fst
but always return a data.tablefst::write_fst
but always use largest compress factor (which yields smallest file) In such a workflow, you never need to read the whole data.frame into your RAM, you just select the target data, process them instantly and get the results all at once. You do not have to read the data to know the structure of data.frame, because we have parse_fst
(a wrapper for fst
in fst package). Now let's give it a try.
library(tidyfst) # Generate some random data frame with 10 million rows and various column types nr_of_rows <- 1e7 df <- data.frame( Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE), Integer = sample(1L:100L, nr_of_rows, replace = TRUE), Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE), Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE)) ) # write the fst file, make sure you do not have the file with same name in the directory export_fst(df,"fst_test.fst") # remove all variables in the environment rm(list = ls())
Now, we want to know the information in the data frame.
parse_fst("fst_test.fst") -> ft ft
If we want to get the information in the Factor
column, use:
ft %>% select_fst(Factor) %>% count_dt(Factor) -> factor_info factor_info
If we want to calculate the mean of Integer
by the group of Factor
, use:
ft %>% select_fst(Integer,Factor) %>% summarise_dt(avg = mean(Integer),by = Factor) -> avg_info avg_info
In this workflow, we only select/filter/slice the data we need, and get the results directly from the pipeline. Therefore, we read the minimum needed data into RAM and release it and save only the results we want. This workflow could save memory for many exploratory big data analysis. Last, let's delete the output file:
# delete the output file unlink("fst_test.fst")
After (>=) version 0.9.3, tidyfst has also added a function as_fst()
, which can turn any data.frame into a fst table and saved the data in the temporary file. This means that we might never have to save the object in the RAM ever (as long as it is a data.frame)! A small example:
iris %>% as_fst() -> iris_fst mtcars %>% as_fst() -> mtcars_fst iris_fst mtcars_fst
So when you have generated a pretty large data.frame and do not want it to consume the cache in your computer, just save it and read it when needed using as_fst
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.