fileBlockApply | R Documentation |
Applies a transforming or filtering function to blocks of lines from a file
or connection, processing each block as a character vector. Sequentially reads blocks
of chunkSize
lines from the file or connection using readLines
,
calls the function FUN
on each block, and returns a vector or list of
the combined results. This allows moderately efficient reading and processing
of large files, potentially larger than would fit in memory. With the
default setting of chunkSize = -1L
, the whole file will be read in as
a single block.
fileBlockApply(
con,
FUN,
...,
chunkSize = -1L,
unlist = TRUE,
filter = FALSE,
growStart = 1e+05,
growX = 2,
growAdd = 0,
.warn = TRUE,
.skipNul = FALSE,
.encoding = "unknown"
)
con |
The path to a text file to read, or a Open connections are read from their current positions. If not already open, a connection will be opened in "rt" mode and then closed again before the call to this function returns. |
FUN |
The function called on each file block (vector of file lines). This function should take a character vector of characters as its first parameter, or as the first unsupplied named parameter (see "grep" example). Only some functions will work well, see Kinds of functions |
... |
Additional arguments to pass to the |
chunkSize |
The maximum number of lines to read from a file in one
chunk. |
unlist |
Internally, the results from each block is saved separately in
a list element. By default these are flattened before returning to hide the
file chunking. To preserve this split by block, set |
filter |
By default the results from the function call are returned. To
use a function to select lines from the file, set this |
growStart |
Tuning parameter for internal memory management. The initial number of blocks allowed before reallocating space. By default this is 10,000. |
growX |
Tuning parameter for internal memory management. If run out of
space to store block results, increase total space by |
growAdd |
Tuning parameter for internal memory management. If run out of
space to store block results, increase total space by |
.warn |
Passed through to readLines as the |
.skipNul |
Passed through to readLines as the |
.encoding |
Passed through to readLines as the |
As this takes connections as well as file names, it can read from compressed files, URLs, etc. As it can read in chunks, it can process files or text streams larger than would fit in memory. Note that if the size of chunks is too small this may be inefficient in both time and memory.
The result of applying FUN
to each chunk in the file or
connection, as returned by readLines
. It is probably
invisible, but the internal process is to assign the results from each
block to successive list elements, and then unlist one level before
returning.
The file is read with readLines
; the
.
prefixed parameters are passed through to readLines
and
allow controlling how embedded NUL characters are treated and if the blocks
passed to FUN
are annotated with encoding information.
Any of LF
, CRLF
, or CR
are recognized as line
separators. The elements of the character vector chunks passed to
FUN
do not contain these separators.
If a file has no terminal EOL marker, the last line will not be included if the file/connection was non-blocking and text-mode. Otherwise it will be included, but as this may signal truncated or incomplete data, a warning is generated.
Embedded NUL characters will by default end their lines, but trigger a
warning that this is happening. No text between the NUL and the next EOL
marker (LF
, CRLF
, or CR
) will be included in the line
as returned. Turn off NUL warnings if you expect NUL characters and want
this truncation (set .warn= FALSE
). To ignore NUL characters and
return all text in the line, dropping the NUL characters, set
.skipNul= TRUE
. No warning is then generated.
Only functions that preserve elements work
simply. This means the function outputs results where each element of the
output is based on only one element of the input. For example, out <-
nchar(in)
is a 1-1 vector transform of in
, so it just works.
out <- intersect(in, setVec)
also works, as on an element by element
basis, it either outputs or does not output an element from in
.
grep
works similarly with value=TRUE
. Functions that return
indices may not be useful unless filter= TRUE
is set as these
indices will be relative to the chunk boundaries. If you want to use other
kinds of functions, you can always set unlist=FALSE and manually combine
the results yourself.
Note: Applying your function to blocks with this function will likely be
faster than applying it one line at a time with fileLineApply
.
### Tiny file to parse (auto-deleted on exit)
content <- c( "One line", "Two lines.", "", "Four" )
fileName <- tempfile()
writeLines(content, fileName)
### Transformation function
fileBlockApply( fileName, "nchar" )
#> [1] 8 10 0 4
### Selecting functions
fileBlockApply( fileName, "intersect", c("", "One line") )
#> [1] "One line" ""
fileBlockApply( fileName, "grep", pattern="line", value=TRUE )
#> [1] "One line" "Two lines."
### Keeping the block structure
# (chunkSize is unreasonably tiny to show block boundary)
fileBlockApply( fileName, "nchar", chunkSize=2, unlist=FALSE )
#> [[1]]
#> [1] 8 10
#>
#> [[2]]
#> [1] 0 4
### Explicit function definition
# Faking a file connection
values <- as.character(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
con <- textConnection( values )
# Filtering manually
fileBlockApply(con, function (x) { y <- as.numeric(x); y[y > 5] }, chunkSize= 4)
#> [1] 6 7 8 9 10
# Rewind fake connection
con <- textConnection( values )
fileBlockApply( con, function (x) { y <- as.numeric(x); y[y > 5] },
chunkSize= 4, unlist= FALSE )
#> [[1]]
#> numeric(0)
#>
#> [[2]]
#> [1] 6 7 8
#>
#> [[3]]
#> [1] 9 10
### Manually merging block results (counts lines)
con <- textConnection( values )
blockLengths <- fileBlockApply( con, "length", chunkSize= 4)
totalLength <- sum(blockLengths)
con <- textConnection( values )
sum( fileBlockApply( con, function (x) sum(as.integer(x)), chunkSize= 4))
#> 55
### Filtering
# A logical returning function
fileBlockApply(con, function (x) { as.numeric(x) > 5 }, chunkSize= 4)
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
# Filter returns original lines:
con <- textConnection( values )
fileBlockApply(con, function (x) { as.numeric(x) > 5 }, chunkSize= 4, filter= TRUE)
#> [1] "6" "7" "8" "9" "10"
# An index returning function (relative to block start)
con <- textConnection( values )
fileBlockApply( con, function (x) { length(x) }, chunkSize= 3)
#> [1] 3 3 3 1
# Filter using index, returns last line of each block,
# including uneven block at end
con <- textConnection( values )
fileBlockApply( con, function (x) { length(x) }, chunkSize= 3, filter=TRUE)
#> [1] "3" "6" "9" "10"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.