fileLineApply | R Documentation |
Applies a function to each line of a file or connection, processing each line
as an element of a character vector. Sequentially reads blocks of
chunkSize
lines from the file or connection using readLines
and
sapply(USE.NAMES=FALSE)
to apply the FUN
across each block and
returning the combined result from all blocks. This implementation allows
moderately efficient reading and processing of large files, potentially
larger than would fit in memory. With the default setting of chunkSize
= -1L
, the whole file will be read in as a single block.
fileLineApply(
con,
FUN,
...,
chunkSize = -1L,
unlist = TRUE,
filter = FALSE,
growStart = 1e+05,
growX = 2,
growAdd = 0,
.warn = TRUE,
.skipNul = FALSE,
.encoding = "unknown",
.simplify = TRUE,
.USE.NAMES = FALSE
)
con |
The path to a text file to read, or a Open connections are read from their current positions. If not already open, a connection will be opened in "rt" mode and then closed again before the call to this function returns. |
FUN |
The function called on each file line (character vector element).
This function should take a (1 element) character vector of characters as
its first parameter, or as the first unsupplied named parameter. If the
function takes a vector of characters, it is probably better to use
|
... |
Additional arguments to pass to the |
chunkSize |
The maximum number of lines to read from a file in one
chunk. |
unlist |
Internally, the results from each block is saved separately in
a list element. By default these are flattened before returning to hide the
file chunking. To preserve this list by block, set |
filter |
By default the function result is returned. If a function
returns a logical value, this can be used to select lines by setting this
parameter |
growStart |
Tuning parameter for internal memory management. The initial number of blocks allowed before reallocating space. By default this is 10,000. |
growX |
Tuning parameter for internal memory management. If run out of
space to store block results, increase total space by |
growAdd |
Tuning parameter for internal memory management. If run out of
space to store block results, increase total space by |
.warn |
Passed through to readLines as the |
.skipNul |
Passed through to readLines as the |
.encoding |
Passed through to readLines as the |
.simplify |
Passed through to sapply as the |
.USE.NAMES |
Passed through to sapply as the |
As this takes connections as well as file names, it can read from compressed files, URLs, etc. As it can read in chunks, it can process files or text streams larger than would fit in memory. Note that if the size of chunks is too small this may be inefficient in both time and memory.
The result of applying FUN
to each line in the file or
connection, as returned by readLines
. It is probably
invisible, but the internal process is to assign the results from each
block to successive list elements, and then unlists one level before
returning. You can always unlist the result to get a vector, depending
on what FUN
returns
The file is read with readLines
; the
.
prefixed parameters are passed through to readLines
and
allow controlling how embedded NUL characters are treated and if the blocks
passed to FUN
are annotated with encoding information.
Any of LF
, CRLF
, or CR
are recognized as line
separators. The elements of the character vectors (file lines) passed to
FUN
do not contain these separators.
If a file has no terminal EOL marker, the last line will not be included if the file/connection was non-blocking and text-mode. Otherwise it will be included, but as this may signal truncated or incomplete data, a warning is generated.
Embedded NUL characters will by default end their lines, but trigger a
warning that this is happening. No text between the NUL and the next EOL
marker (LF
, CRLF
, or CR
) will be included in the line
as returned. Turn off NUL warnings if you expect NUL characters and want
this truncation (set .warn= FALSE
). To ignore NUL characters and
return all text in the line, dropping the NUL characters, set
.skipNul= TRUE
. No warning is then generated.
If a function takes a vector of characters as
its input, it might be more efficient to use fileBlockApply
. This
is especially true for filtering functions, like grep. Functions that
return indices or logicals vectors can be used with fileFilterApply
which will return the elements instead of an index or logical vector.
### Tiny file to parse (auto-deleted on exit)
content <- c( "One line", "Two lines.", "", "Four" )
fileName <- tempfile()
writeLines(content, fileName)
### Transformation function
fileLineApply( fileName, "toupper" )
#> [[1]]
#> [1] "ONE LINE"
#>
#> [[2]]
#> [1] "TWO LINES."
#>
#> [[3]]
#> [1] ""
#>
#> [[4]]
#> [1] "FOUR"
unlist(fileLineApply( fileName, "toupper" ))
#> [1] "ONE LINE" "TWO LINES." "" "FOUR"
### Selecting functions
fileLineApply( fileName, "intersect", c("", "One line") )
#> [[1]]
#> [1] "One line"
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] ""
#>
#> [[4]]
#> character(0)
unlist(fileLineApply( fileName, "intersect", c("", "One line") ))
#> [1] "One line" ""
### Keeping the block structure
# (chunkSize is unreasonably tiny to show block boundary)
fileLineApply( fileName, "nchar", chunkSize=3, unlist=FALSE )
#> [[1]]
#> [[1]][[1]]
#> [1] 8
#>
#> [[1]][[2]]
#> [1] 10
#>
#> [[1]][[3]]
#> [1] 0
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 4
# Only need to unlist once, as recursive=TRUE is the default.
unlist(fileLineApply( fileName, "nchar", chunkSize=3, unlist=FALSE ))
#> [1] 8 10 0 4
### Explicit function definition - word count by line
# Manually for a normal file (sapply simplifies)
lengths( strsplit(readLines(fileName), "\\s+"))
#> [1] 2 2 0 1
# Using file apply (with tiny chunkSize)
fileLineApply( fileName, function (x) { lengths(strsplit(x, "\\s+")) }, chunkSize= 2)
#> [1] 2 2 0 1
# A logical returning function
fileLineApply( fileName, function (x) { lengths(strsplit(x, "\\s+")) == 1 },
chunkSize= 2 )
#> [1] FALSE FALSE FALSE TRUE
# Filtering on a logical returning function
fileLineApply( fileName, function (x) { lengths(strsplit(x, "\\s+")) == 1 },
chunkSize= 2, filter=TRUE )
#> [1] "Four"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.