fileBlockApply: Apply a function by block to a file or connection

View source: R/fileApply.R

fileBlockApplyR Documentation

Apply a function by block to a file or connection

Description

Applies a transforming or filtering function to blocks of lines from a file or connection, processing each block as a character vector. Sequentially reads blocks of chunkSize lines from the file or connection using readLines, calls the function FUN on each block, and returns a vector or list of the combined results. This allows moderately efficient reading and processing of large files, potentially larger than would fit in memory. With the default setting of chunkSize = -1L, the whole file will be read in as a single block.

Usage

fileBlockApply(
  con,
  FUN,
  ...,
  chunkSize = -1L,
  unlist = TRUE,
  filter = FALSE,
  growStart = 1e+05,
  growX = 2,
  growAdd = 0,
  .warn = TRUE,
  .skipNul = FALSE,
  .encoding = "unknown"
)

Arguments

con

The path to a text file to read, or a connection (including urls). Will auto-detect compression and read single gzip, bzip2, zip, and xz/lzma files or connections to them (i.e. no tarred files or zipped directories allowed).

Open connections are read from their current positions. If not already open, a connection will be opened in "rt" mode and then closed again before the call to this function returns.

FUN

The function called on each file block (vector of file lines). This function should take a character vector of characters as its first parameter, or as the first unsupplied named parameter (see "grep" example). Only some functions will work well, see Kinds of functions

...

Additional arguments to pass to the FUN.

chunkSize

The maximum number of lines to read from a file in one chunk. FUN is called on each chunk separately and then the results are combined. All lines in the last chunk will be read, even though there are probably less than chunkSize lines left. The last line may, on rare occasions, not be read (this will trigger a warning. See "File reading", below.)

unlist

Internally, the results from each block is saved separately in a list element. By default these are flattened before returning to hide the file chunking. To preserve this split by block, set unlist= FALSE.

filter

By default the results from the function call are returned. To use a function to select lines from the file, set this TRUE. FUN must then return a vector of numeric or logical values. Note that it may be faster to select lines directly with value returning functions like grep(value= TRUE).

growStart

Tuning parameter for internal memory management. The initial number of blocks allowed before reallocating space. By default this is 10,000.

growX

Tuning parameter for internal memory management. If run out of space to store block results, increase total space by growX * current + growAdd. Done each time run out of space, allocating successively large blocks of space.

growAdd

Tuning parameter for internal memory management. If run out of space to store block results, increase total space by growX * current + growAdd. Done each time run out of space, allocating successively large blocks of space. By default this is 0.

.warn

Passed through to readLines as the warn parameter. By default every embedded NUL triggers a warning. Can set this to FALSE if don't want a warning. Note, no warning is generated regardless of this flag when skipping NUL characters (i.e. when .skipNul= TRUE).

.skipNul

Passed through to readLines as the skipNul parameter, By default every embedded NUL terminates a line (and will then also trigger a warning by default, when .warn= TRUE). Setting this TRUE keeps reading past the NUL up to EOL, and does not warn.

.encoding

Passed through to readLines as the encoding parameter.

Details

As this takes connections as well as file names, it can read from compressed files, URLs, etc. As it can read in chunks, it can process files or text streams larger than would fit in memory. Note that if the size of chunks is too small this may be inefficient in both time and memory.

Value

The result of applying FUN to each chunk in the file or connection, as returned by readLines. It is probably invisible, but the internal process is to assign the results from each block to successive list elements, and then unlist one level before returning.

File reading

The file is read with readLines; the . prefixed parameters are passed through to readLines and allow controlling how embedded NUL characters are treated and if the blocks passed to FUN are annotated with encoding information.

Any of LF, CRLF, or CR are recognized as line separators. The elements of the character vector chunks passed to FUN do not contain these separators.

If a file has no terminal EOL marker, the last line will not be included if the file/connection was non-blocking and text-mode. Otherwise it will be included, but as this may signal truncated or incomplete data, a warning is generated.

Embedded NUL characters will by default end their lines, but trigger a warning that this is happening. No text between the NUL and the next EOL marker (LF, CRLF, or CR) will be included in the line as returned. Turn off NUL warnings if you expect NUL characters and want this truncation (set .warn= FALSE). To ignore NUL characters and return all text in the line, dropping the NUL characters, set .skipNul= TRUE. No warning is then generated.

Kinds of functions

Only functions that preserve elements work simply. This means the function outputs results where each element of the output is based on only one element of the input. For example, out <- nchar(in) is a 1-1 vector transform of in, so it just works. out <- intersect(in, setVec) also works, as on an element by element basis, it either outputs or does not output an element from in. grep works similarly with value=TRUE. Functions that return indices may not be useful unless filter= TRUE is set as these indices will be relative to the chunk boundaries. If you want to use other kinds of functions, you can always set unlist=FALSE and manually combine the results yourself.

Note: Applying your function to blocks with this function will likely be faster than applying it one line at a time with fileLineApply.

Examples

### Tiny file to parse (auto-deleted on exit)
content <- c( "One line", "Two lines.", "", "Four" )
fileName <- tempfile()
writeLines(content, fileName)

### Transformation function
fileBlockApply( fileName, "nchar" )
#> [1]  8 10  0  4

### Selecting functions
fileBlockApply( fileName, "intersect", c("", "One line") )
#> [1] "One line" ""

fileBlockApply( fileName, "grep", pattern="line", value=TRUE )
#> [1] "One line"   "Two lines."

### Keeping the block structure
# (chunkSize is unreasonably tiny to show block boundary)
fileBlockApply( fileName, "nchar", chunkSize=2, unlist=FALSE )
#> [[1]]
#> [1]  8 10
#>
#> [[2]]
#> [1] 0 4

### Explicit function definition
# Faking a file connection
values <- as.character(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
con <- textConnection( values )

# Filtering manually
fileBlockApply(con, function (x) { y <- as.numeric(x); y[y > 5] }, chunkSize= 4)
#> [1]  6  7  8  9 10

# Rewind fake connection
con <- textConnection( values )

fileBlockApply( con, function (x) { y <- as.numeric(x); y[y > 5] },
                chunkSize= 4, unlist= FALSE )
#> [[1]]
#> numeric(0)
#>
#> [[2]]
#> [1] 6 7 8
#>
#> [[3]]
#> [1]  9 10

### Manually merging block results (counts lines)
con <- textConnection( values )
blockLengths <- fileBlockApply( con, "length", chunkSize= 4)
totalLength <- sum(blockLengths)

con <- textConnection( values )
sum( fileBlockApply( con, function (x) sum(as.integer(x)), chunkSize= 4))
#> 55

### Filtering
# A logical returning function
fileBlockApply(con, function (x) { as.numeric(x) > 5 }, chunkSize= 4)
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

# Filter returns original lines:
con <- textConnection( values )
fileBlockApply(con, function (x) { as.numeric(x) > 5 }, chunkSize= 4, filter= TRUE)
#> [1] "6"  "7"  "8"  "9"  "10"

# An index returning function (relative to block start)
con <- textConnection( values )
fileBlockApply( con, function (x) { length(x) }, chunkSize= 3)
#> [1] 3 3 3 1

# Filter using index, returns last line of each block,
# including uneven block at end
con <- textConnection( values )
fileBlockApply( con, function (x) { length(x) }, chunkSize= 3, filter=TRUE)
#> [1] "3"  "6"  "9"  "10"


jefferys/JefferysRUtils documentation built on Jan. 12, 2024, 9:18 p.m.