read_chunked: Read files in chunks with 'iotools::chunk.apply()'

View source: R/transfer.R

read_chunkedR Documentation

Read files in chunks with iotools::chunk.apply()

Description

read_chunked is a helper function that wraps around chunk.apply to allow chunk-wise operations on data as it loads. This is very similar to fread_chunked, though will perform much faster as it can run the chunks in parallel if desired. The drawback is it can be a bit tricky to set up as it requires column type assignment.

Usage

read_chunked(
  file_location,
  filter_col,
  filter_v,
  col_types,
  chunk_function = NULL,
  chunk_size = 1000000L,
  rbind_method = rbind,
  sep = ",",
  header_check = TRUE,
  parallel = 1,
  ...
)

Arguments

file_location

Location of target file to load (any file compatible with dstrsplit and provided sep parameter).

filter_col

Target column to perform filtering operation.

filter_v

Vector of values to perform filtering on (categorical by default via in operator).

col_types

A vector of values that specifies all of the column types in the file of interest, this is an iotools requirement.

chunk_function

A custom function to perform instead of the default behaviour of filtering on a single column.

chunk_size

Size of each chunk to perform operations (default: 1e6L).

rbind_method

Function to perform the appending of chunks (default: rbind, but other binding options can be used).

sep

The delimiter type in the file of interest (default: ',').

header_check

Boolean value, to determine if headers should be searched for based upon column names and dropped from the function calls.

parallel

How many processes should be used in loading (default: 1).

...

Other paramters passed to dstrsplit.

Details

This function by default will filter data based upon a provided column ID and filtering vector. However, a custom function can also be provided for more flexible operations to be performed on each chunk. The common use-case is while working with extremely large data, where the entire dataset would never fit into the available computer memory. When datasets contains much more information than needed for a particular analysis the chunk-wise filtering will ensure data loaded is reduced to the filtering criteria required without, hopefully, hitting RAM limits. Providing column types for each column is important, assuming them all as character may lead to errors and column length mismatches (causing the load to fail, e.g. too many input columns).

There are several options to perform chunk reading in R. In addition to this function, you could also explore the package chunked and readr::read_csv_chunked(). However, at some point, it may be more suitable to simply have the data stored in a database for more efficient operations outside of R.

Value

Dataframe (passed through the chunk-wise function)

See Also

fread_chunked

Examples

## Not run: 
file_of_interest <- '/path/to/file/myfile.csv'

# Predetermined column types of large complex file (may require manual review!)
file_coltypes <- c(rep('integer', 5), # First 5 are integer

# Filter based upon an ID column or similar
ids_of_interest <- c(1, 2, 3)
chunk_loaded_file <- read_chunked(file_of_interest, filter_col = recordID, filter_v = ids_of_interest)

# Example of custom provided function
# ... perform chunked load an filter if ID is in any of several columns
custom_chunk_f <- function(chunk) {
  data.table::setDT(chunk) # Set as data.table for speed of custom function...
  chunk[chunk[, Reduce(`|`, lapply(.SD, `%in%`, filter_v)),
                .SDcols = c('recordID1', 'recordID2', 'recordID3', 'recordID4', 'recordID5')]]
                }
 chunk_loaded_file <- read_chunked(file_of_interest, filter_v = ids_of_interest, col_types = file_coltypes,
                                   chunk_function = custom_chunk_f, rbind_method = rbind, parallel = 2)

## End(Not run)

al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.