drRead.table: Data Input
In datadr: Divide and Recombine for Large, Complex Data

Description Usage Arguments Value Note Author(s) Examples

Reads a text file in table format and creates a distributed data frame from it, with cases corresponding to lines and variables to fields in the file.

## S3 method for class 'table'
drRead(file, header = FALSE, sep = "", quote = "\"'", dec = ".",
  skip = 0, fill = !blank.lines.skip, blank.lines.skip = TRUE, comment.char = "#",
  allowEscapes = FALSE, encoding = "unknown", autoColClasses = TRUE,
  rowsPerBlock = 50000, postTransFn = identity, output = NULL, overwrite = FALSE,
  params = NULL, packages = NULL, control = NULL, ...)
## S3 method for class 'csv'
drRead(file, header = TRUE, sep = ",",
  quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
## S3 method for class 'csv2'
drRead(file, header = TRUE, sep = ";",
  quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
## S3 method for class 'delim'
drRead(file, header = TRUE, sep = "\t",
  quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
## S3 method for class 'delim2'
drRead(file, header = TRUE, sep = "\t",
  quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)

`file`	input text file - can either be character string pointing to a file on local disk, or an `hdfsConn` object pointing to a text file on HDFS (see `output` argument below)
`header`	this and parameters other parameters below are passed to `read.table` for each chunk being processed - see `read.table` for more info. Most all have defaults or appropriate defaults are set through other format-specific functions such as `drRead.csv` and `drRead.delim`.
`sep`	see `read.table` for more info
`quote`	see `read.table` for more info
`dec`	see `read.table` for more info
`skip`	see `read.table` for more info
`fill`	see `read.table` for more info
`blank.lines.skip`	see `read.table` for more info
`comment.char`	see `read.table` for more info
`allowEscapes`	see `read.table` for more info
`encoding`	see `read.table` for more info
`autoColClasses`	should column classes be determined automatically by reading in a sample? This can sometimes be problematic because of strange ways R handles quotes in `read.table`, but keeping the default of `TRUE` is advantageous for speed.
`rowsPerBlock`	how many rows of the input file should make up a block (key-value pair) of output?
`postTransFn`	a function to be applied after a block is read in to provide any additional processingn before the block is stored
`output`	a "kvConnection" object indicating where the output data should reside. Must be a `localDiskConn` object if input is a text file on local disk, or a `hdfsConn` object if input is a text file on HDFS.
`overwrite`	logical; should existing output location be overwritten? (also can specify `overwrite = "backup"` to move the existing output to _bak)
`params`	a named list of objects external to the input data that are needed in `postTransFn`
`packages`	a vector of R package names that contain functions used in `fn` (most should be taken care of automatically such that this is rarely necessary to specify)
`control`	parameters specifying how the backend should handle things (most-likely parameters to `rhwatch` in RHIPE) - see `rhipeControl` and `localDiskControl`
`...`	see `read.table` for more info

an object of class "ddf"

For local disk, the file is actually read in sequentially instead of in parallel. This is because of possible performance issues when trying to read from the same disk in parallel.

Note that if skip is positive and/or if header is TRUE, it will first read these in as they only occur once in the data, and we then check for these lines in each block and remove those lines if they appear.

Also note that if you supply "Factor" column classes, they will be converted to character.

Ryan Hafen

## Not run:   csvFile <- file.path(tempdir(), "iris.csv")
  write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
  irisTextConn <- localDiskConn(file.path(tempdir(), "irisText2"), autoYes = TRUE)
  a <- drRead.csv(csvFile, output = irisTextConn, rowsPerBlock = 10)

## End(Not run)