read_chunked: Read files in chunks with 'iotools::chunk.apply()'
In al-obrien/farrago: A random collection of helpful code baubles

read_chunked

R Documentation

Read files in chunks with `iotools::chunk.apply()`

Description

read_chunked is a helper function that wraps around chunk.apply to allow chunk-wise operations on data as it loads. This is very similar to fread_chunked, though will perform much faster as it can run the chunks in parallel if desired. The drawback is it can be a bit tricky to set up as it requires column type assignment.

Usage

read_chunked(
  file_location,
  filter_col,
  filter_v,
  col_types,
  chunk_function = NULL,
  chunk_size = 1000000L,
  rbind_method = rbind,
  sep = ",",
  header_check = TRUE,
  parallel = 1,
  ...
)

Arguments

`file_location`	Location of target file to load (any file compatible with `dstrsplit` and provided `sep` parameter).
`filter_col`	Target column to perform filtering operation.
`filter_v`	Vector of values to perform filtering on (categorical by default via `in` operator).
`col_types`	A vector of values that specifies all of the column types in the file of interest, this is an `iotools` requirement.
`chunk_function`	A custom function to perform instead of the default behaviour of filtering on a single column.
`chunk_size`	Size of each chunk to perform operations (default: 1e6L).
`rbind_method`	Function to perform the appending of chunks (default: `rbind`, but other binding options can be used).
`sep`	The delimiter type in the file of interest (default: ',').
`header_check`	Boolean value, to determine if headers should be searched for based upon column names and dropped from the function calls.
`parallel`	How many processes should be used in loading (default: 1).
`...`	Other paramters passed to `dstrsplit`.

Details

This function by default will filter data based upon a provided column ID and filtering vector. However, a custom function can also be provided for more flexible operations to be performed on each chunk. The common use-case is while working with extremely large data, where the entire dataset would never fit into the available computer memory. When datasets contains much more information than needed for a particular analysis the chunk-wise filtering will ensure data loaded is reduced to the filtering criteria required without, hopefully, hitting RAM limits. Providing column types for each column is important, assuming them all as character may lead to errors and column length mismatches (causing the load to fail, e.g. too many input columns).

There are several options to perform chunk reading in R. In addition to this function, you could also explore the package chunked and readr::read_csv_chunked(). However, at some point, it may be more suitable to simply have the data stored in a database for more efficient operations outside of R.

Value

Dataframe (passed through the chunk-wise function)

Examples

## Not run: 
file_of_interest <- '/path/to/file/myfile.csv'

# Predetermined column types of large complex file (may require manual review!)
file_coltypes <- c(rep('integer', 5), # First 5 are integer

# Filter based upon an ID column or similar
ids_of_interest <- c(1, 2, 3)
chunk_loaded_file <- read_chunked(file_of_interest, filter_col = recordID, filter_v = ids_of_interest)

# Example of custom provided function
# ... perform chunked load an filter if ID is in any of several columns
custom_chunk_f <- function(chunk) {
  data.table::setDT(chunk) # Set as data.table for speed of custom function...
  chunk[chunk[, Reduce(`|`, lapply(.SD, `%in%`, filter_v)),
                .SDcols = c('recordID1', 'recordID2', 'recordID3', 'recordID4', 'recordID5')]]
                }
 chunk_loaded_file <- read_chunked(file_of_interest, filter_v = ids_of_interest, col_types = file_coltypes,
                                   chunk_function = custom_chunk_f, rbind_method = rbind, parallel = 2)

## End(Not run)

al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.