read_chunked | R Documentation |
iotools::chunk.apply()
read_chunked
is a helper function that wraps around chunk.apply
to allow chunk-wise operations on data as it loads.
This is very similar to fread_chunked
, though will perform much faster as it can run the chunks in parallel if desired. The drawback is it can be
a bit tricky to set up as it requires column type assignment.
read_chunked(
file_location,
filter_col,
filter_v,
col_types,
chunk_function = NULL,
chunk_size = 1000000L,
rbind_method = rbind,
sep = ",",
header_check = TRUE,
parallel = 1,
...
)
file_location |
Location of target file to load (any file compatible with |
filter_col |
Target column to perform filtering operation. |
filter_v |
Vector of values to perform filtering on (categorical by default via |
col_types |
A vector of values that specifies all of the column types in the file of interest, this is an |
chunk_function |
A custom function to perform instead of the default behaviour of filtering on a single column. |
chunk_size |
Size of each chunk to perform operations (default: 1e6L). |
rbind_method |
Function to perform the appending of chunks (default: |
sep |
The delimiter type in the file of interest (default: ','). |
header_check |
Boolean value, to determine if headers should be searched for based upon column names and dropped from the function calls. |
parallel |
How many processes should be used in loading (default: 1). |
... |
Other paramters passed to |
This function by default will filter data based upon a provided column ID and filtering vector. However, a custom function can also be provided for more
flexible operations to be performed on each chunk. The common use-case is while working with extremely large data, where the entire dataset would never fit
into the available computer memory. When datasets contains much more information than needed for a particular analysis the chunk-wise filtering will ensure
data loaded is reduced to the filtering criteria required without, hopefully, hitting RAM limits. Providing column types for each column is important, assuming
them all as character may lead to errors and column length mismatches (causing the load to fail, e.g. too many input columns
).
There are several options to perform chunk reading in R. In addition to this function, you could also
explore the package chunked and readr::read_csv_chunked()
. However, at some point, it may be
more suitable to simply have the data stored in a database for more efficient operations outside of R.
Dataframe (passed through the chunk-wise function)
fread_chunked
## Not run:
file_of_interest <- '/path/to/file/myfile.csv'
# Predetermined column types of large complex file (may require manual review!)
file_coltypes <- c(rep('integer', 5), # First 5 are integer
# Filter based upon an ID column or similar
ids_of_interest <- c(1, 2, 3)
chunk_loaded_file <- read_chunked(file_of_interest, filter_col = recordID, filter_v = ids_of_interest)
# Example of custom provided function
# ... perform chunked load an filter if ID is in any of several columns
custom_chunk_f <- function(chunk) {
data.table::setDT(chunk) # Set as data.table for speed of custom function...
chunk[chunk[, Reduce(`|`, lapply(.SD, `%in%`, filter_v)),
.SDcols = c('recordID1', 'recordID2', 'recordID3', 'recordID4', 'recordID5')]]
}
chunk_loaded_file <- read_chunked(file_of_interest, filter_v = ids_of_interest, col_types = file_coltypes,
chunk_function = custom_chunk_f, rbind_method = rbind, parallel = 2)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.