fileLineApply: Apply a function to a file or connection

View source: R/fileApply.R

fileLineApplyR Documentation

Apply a function to a file or connection

Description

Applies a function to each line of a file or connection, processing each line as an element of a character vector. Sequentially reads blocks of chunkSize lines from the file or connection using readLines and sapply(USE.NAMES=FALSE) to apply the FUN across each block and returning the combined result from all blocks. This implementation allows moderately efficient reading and processing of large files, potentially larger than would fit in memory. With the default setting of chunkSize = -1L, the whole file will be read in as a single block.

Usage

fileLineApply(
  con,
  FUN,
  ...,
  chunkSize = -1L,
  unlist = TRUE,
  filter = FALSE,
  growStart = 1e+05,
  growX = 2,
  growAdd = 0,
  .warn = TRUE,
  .skipNul = FALSE,
  .encoding = "unknown",
  .simplify = TRUE,
  .USE.NAMES = FALSE
)

Arguments

con

The path to a text file to read, or a connection (including urls). Will auto-detect compression and read single gzip, bzip2, zip, and xz/lzma files or connections to them (i.e. no tarred files or zipped directories allowed).

Open connections are read from their current positions. If not already open, a connection will be opened in "rt" mode and then closed again before the call to this function returns.

FUN

The function called on each file line (character vector element). This function should take a (1 element) character vector of characters as its first parameter, or as the first unsupplied named parameter. If the function takes a vector of characters, it is probably better to use fileBlockApply

...

Additional arguments to pass to the FUN.

chunkSize

The maximum number of lines to read from a file in one chunk. FUN is sapplyed to each chunk separately and then the results are combined. All lines in the last chunk will be read, even though there are probably less than chunkSize lines left. The last line may, on rare occasions, not be read (this will trigger a warning. See "File reading", below.)

unlist

Internally, the results from each block is saved separately in a list element. By default these are flattened before returning to hide the file chunking. To preserve this list by block, set unlist= FALSE.

filter

By default the function result is returned. If a function returns a logical value, this can be used to select lines by setting this parameter TRUE. When filtering, the parameters .simplify and .USE.NAMES are ignored. Note: unlike fileBlockApply, an integer returning function is not allowed.

growStart

Tuning parameter for internal memory management. The initial number of blocks allowed before reallocating space. By default this is 10,000.

growX

Tuning parameter for internal memory management. If run out of space to store block results, increase total space by growX * current + growAdd. Done each time run out of space, allocating successively large blocks of space. By default this is 2.

growAdd

Tuning parameter for internal memory management. If run out of space to store block results, increase total space by growX * current + growAdd. Done each time run out of space, allocating successively large blocks of space. By default this is 0.

.warn

Passed through to readLines as the warn parameter. By default every embedded NUL triggers a warning. Can set this to FALSE if don't want a warning. Note, no warning is generated regardless of this flag when skipping NUL characters (i.e. when .skipNul= TRUE).

.skipNul

Passed through to readLines as the skipNul parameter, By default every embedded NUL terminates a line (and will then also trigger a warning by default, when .warn= TRUE). Setting this TRUE keeps reading past the NUL up to EOL, and does not warn.

.encoding

Passed through to readLines as the encoding parameter.

.simplify

Passed through to sapply as the simplify parameter. Set FALSE to avoid flattening the results for a block into a vector or array.

.USE.NAMES

Passed through to sapply as the simplify parameter. Set TRUE to keep the source line from the file as the name of the returned value.

Details

As this takes connections as well as file names, it can read from compressed files, URLs, etc. As it can read in chunks, it can process files or text streams larger than would fit in memory. Note that if the size of chunks is too small this may be inefficient in both time and memory.

Value

The result of applying FUN to each line in the file or connection, as returned by readLines. It is probably invisible, but the internal process is to assign the results from each block to successive list elements, and then unlists one level before returning. You can always unlist the result to get a vector, depending on what FUN returns

File reading

The file is read with readLines; the . prefixed parameters are passed through to readLines and allow controlling how embedded NUL characters are treated and if the blocks passed to FUN are annotated with encoding information.

Any of LF, CRLF, or CR are recognized as line separators. The elements of the character vectors (file lines) passed to FUN do not contain these separators.

If a file has no terminal EOL marker, the last line will not be included if the file/connection was non-blocking and text-mode. Otherwise it will be included, but as this may signal truncated or incomplete data, a warning is generated.

Embedded NUL characters will by default end their lines, but trigger a warning that this is happening. No text between the NUL and the next EOL marker (LF, CRLF, or CR) will be included in the line as returned. Turn off NUL warnings if you expect NUL characters and want this truncation (set .warn= FALSE). To ignore NUL characters and return all text in the line, dropping the NUL characters, set .skipNul= TRUE. No warning is then generated.

Kinds of functions

If a function takes a vector of characters as its input, it might be more efficient to use fileBlockApply. This is especially true for filtering functions, like grep. Functions that return indices or logicals vectors can be used with fileFilterApply which will return the elements instead of an index or logical vector.

Examples

### Tiny file to parse (auto-deleted on exit)
content <- c( "One line", "Two lines.", "", "Four" )
fileName <- tempfile()
writeLines(content, fileName)

### Transformation function
fileLineApply( fileName, "toupper" )
#> [[1]]
#> [1] "ONE LINE"
#>
#> [[2]]
#> [1] "TWO LINES."
#>
#> [[3]]
#> [1] ""
#>
#> [[4]]
#> [1] "FOUR"

unlist(fileLineApply( fileName, "toupper" ))
#> [1] "ONE LINE"   "TWO LINES." ""           "FOUR"

### Selecting functions
fileLineApply( fileName, "intersect", c("", "One line") )
#> [[1]]
#> [1] "One line"
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] ""
#>
#> [[4]]
#> character(0)

unlist(fileLineApply( fileName, "intersect", c("", "One line") ))
#> [1] "One line" ""

### Keeping the block structure
# (chunkSize is unreasonably tiny to show block boundary)
fileLineApply( fileName, "nchar", chunkSize=3, unlist=FALSE )
#> [[1]]
#> [[1]][[1]]
#> [1] 8
#>
#> [[1]][[2]]
#> [1] 10
#>
#> [[1]][[3]]
#> [1] 0
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 4

# Only need to unlist once, as recursive=TRUE is the default.
unlist(fileLineApply( fileName, "nchar", chunkSize=3, unlist=FALSE ))
#> [1]  8 10  0  4

### Explicit function definition - word count by line
# Manually for a normal file (sapply simplifies)
lengths( strsplit(readLines(fileName), "\\s+"))
#> [1] 2 2 0 1

# Using file apply (with tiny chunkSize)
fileLineApply( fileName, function (x) { lengths(strsplit(x, "\\s+")) }, chunkSize= 2)
#> [1] 2 2 0 1

# A logical returning function
fileLineApply( fileName, function (x) { lengths(strsplit(x, "\\s+")) == 1 },
               chunkSize= 2 )
#> [1] FALSE FALSE FALSE  TRUE

# Filtering on a logical returning function
fileLineApply( fileName, function (x) { lengths(strsplit(x, "\\s+")) == 1 },
               chunkSize= 2, filter=TRUE )
#> [1] "Four"


jefferys/JefferysRUtils documentation built on Jan. 12, 2024, 9:18 p.m.