readTextFileByChunk: Experimental sequential text reader helper function
In datadr: Divide and Recombine for Large, Complex Data

Description Usage Arguments Details Examples

Experimental helper function for reading text data sequentially from a file on disk and adding to connection using addData

1
2
3

readTextFileByChunk(input, output, overwrite = FALSE, linesPerBlock = 10000,
  fn = NULL, header = TRUE, skip = 0, recordEndRegex = NULL,
  cl = NULL)

`input`	the path to an input text file
`output`	an output connection such as those created with `localDiskConn`, and `hdfsConn`
`overwrite`	logical; should existing output location be overwritten? (also can specify `overwrite = "backup"` to move the existing output to _bak)
`linesPerBlock`	how many lines at a time to read
`fn`	function to be applied to each chunk of lines (see details)
`header`	does the file have a header
`skip`	number of lines to skip before reading
`recordEndRegex`	an optional regular expression that finds lines in the text file that indicate the end of a record (for multi-line records)
`cl`	a "cluster" object to be used for parallel processing, created using `makeCluster`

The function fn should have one argument, which should expect to receive a vector of strings, each element of which is a line in the file. It is also possible for fn to take two arguments, in which case the second argument is the header line from the file (some parsing methods might need to know the header).

csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
myoutput <- localDiskConn(file.path(tempdir(), "irisText"), autoYes = TRUE)
a <- readTextFileByChunk(csvFile,
  output = myoutput, linesPerBlock = 10,
  fn = function(x, header) {
    colNames <- strsplit(header, ",")[[1]]
    read.csv(textConnection(paste(x, collapse = "\n")), col.names = colNames, header = FALSE)
  })
a[[1]]

* Saving connection attributes
Processing chunk  1 
Processing chunk  2 
Processing chunk  3 
Processing chunk  4 
Processing chunk  5 
Processing chunk  6 
Processing chunk  7 
Processing chunk  8 
Processing chunk  9 
Processing chunk  10 
Processing chunk  11 
Processing chunk  12 
Processing chunk  13 
Processing chunk  14 
Processing chunk  15 
Processing chunk  16 
$key
[1] 6

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          7.0         3.2          4.7         1.4 versicolor
2          6.4         3.2          4.5         1.5 versicolor
3          6.9         3.1          4.9         1.5 versicolor
4          5.5         2.3          4.0         1.3 versicolor
5          6.5         2.8          4.6         1.5 versicolor
...