fread_chunked: Read files in chunks with 'datatable::fread()'
In al-obrien/farrago: A random collection of helpful code baubles

fread_chunked

R Documentation

Read files in chunks with `datatable::fread()`

Description

fread_chunked is a helper function that wraps around fread to allow chunk-wise operations on data as it loads. By itself fread can load delimited files extremely fast; however, it does not have extensive nor easy-to-use capabilities to perform operations while the data streams into R.

Usage

fread_chunked(
  file_location,
  filter_col,
  filter_v,
  chunk_function = NULL,
  chunk_size = 1000000L,
  ...
)

Arguments

`file_location`	Location of target file to load (any file compatible with `fread`).
`filter_col`	Target column to perform filtering operation.
`filter_v`	Vector of values to perform filtering on (categorical by default via 'in' operator).
`chunk_function`	A custom function to perform instead of the default behaviour of filtering on a single column.
`chunk_size`	Size of each chunk to perform operations (default: 1e6L).
`...`	Additional parameters to pass to `fread`.

Details

This function by default will filter data based upon a provided column ID and filtering vector. However, a custom function can also be provided for more flexible operations to be performed on each chunk. The common use-case is while working with extremely large data, where the entire dataset would never fit into the available computer memory. When datasets contains much more information than needed for a particular analysis the chunk-wise filtering will ensure data loaded is reduced to the filtering criteria required without, hopefully, hitting RAM limits.

There are several options to perform chunk reading in R. In addition to this function, you could also explore the package chunked and readr::read_csv_chunked(). However, at some point, it may be more suitable to simply have the data stored in a database for more efficient operations outside of R.

Value

Datatable (passed through the chunk-wise function)

Examples

## Not run: 
file_of_interest <- '/path/to/file/myfile.csv'

# Filter based upon an ID column or similar
ids_of_interest <- c(1, 2, 3)
chunk_loaded_file <- fread_chunked(file_of_interest, filter_col = recordID, filter_v = ids_of_interest)

# Example of custom provided function
# ... perform chunked load an filter if ID is in any of several columns
custom_chunk_f <- function(chunk) {
  chunk[chunk[, Reduce(`|`, lapply(.SD, `%in%`, filter_v)),
                .SDcols = c('recordID1', 'recordID2', 'recordID3', 'recordID4', 'recordID5')]]
                }
 chunk_loaded_file <- fread_chunked(file_of_interest, filter_v = ids_of_interest, chunk_function = custom_chunk_f)

## End(Not run)

al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.