readTextFileByChunk: Experimental sequential text reader helper function

Description Usage Arguments Details Examples

Description

Experimental helper function for reading text data sequentially from a file on disk and adding to connection using addData

Usage

1
2
3
readTextFileByChunk(input, output, overwrite = FALSE, linesPerBlock = 10000,
  fn = NULL, header = TRUE, skip = 0, recordEndRegex = NULL,
  cl = NULL)

Arguments

input

the path to an input text file

output

an output connection such as those created with localDiskConn, and hdfsConn

overwrite

logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)

linesPerBlock

how many lines at a time to read

fn

function to be applied to each chunk of lines (see details)

header

does the file have a header

skip

number of lines to skip before reading

recordEndRegex

an optional regular expression that finds lines in the text file that indicate the end of a record (for multi-line records)

cl

a "cluster" object to be used for parallel processing, created using makeCluster

Details

The function fn should have one argument, which should expect to receive a vector of strings, each element of which is a line in the file. It is also possible for fn to take two arguments, in which case the second argument is the header line from the file (some parsing methods might need to know the header).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
myoutput <- localDiskConn(file.path(tempdir(), "irisText"), autoYes = TRUE)
a <- readTextFileByChunk(csvFile,
  output = myoutput, linesPerBlock = 10,
  fn = function(x, header) {
    colNames <- strsplit(header, ",")[[1]]
    read.csv(textConnection(paste(x, collapse = "\n")), col.names = colNames, header = FALSE)
  })
a[[1]]

Example output

* Saving connection attributes
Processing chunk  1 
Processing chunk  2 
Processing chunk  3 
Processing chunk  4 
Processing chunk  5 
Processing chunk  6 
Processing chunk  7 
Processing chunk  8 
Processing chunk  9 
Processing chunk  10 
Processing chunk  11 
Processing chunk  12 
Processing chunk  13 
Processing chunk  14 
Processing chunk  15 
Processing chunk  16 
$key
[1] 6

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          7.0         3.2          4.7         1.4 versicolor
2          6.4         3.2          4.5         1.5 versicolor
3          6.9         3.1          4.9         1.5 versicolor
4          5.5         2.3          4.0         1.3 versicolor
5          6.5         2.8          4.6         1.5 versicolor
...

datadr documentation built on May 1, 2019, 8:06 p.m.