ReadCSV: Read a text file.

Description Usage Arguments Details

Description

Read reads a dataset from a plain text file.

Usage

1
2
3
ReadCSV(files, attributes, header = FALSE, skip = 0, nrows = -1,
  sep = ",", simple = FALSE,
  quote = "\"", escape = "\\", trim.cr = FALSE, nullable = FALSE, MoreArgs = list(), chunk = NULL)

Arguments

files

A character vector of the files to be read. Paths can either be absolute or relative to the current working directory for the R session. Each file should be a csv with the same format. Only the first file is used to determine column names, types, etc.

attributes

The specification of the columns being read from the files. This should either be the name of a relation or a call to c.

In the former case, the name is specified in the same format as it is for Load, meaning that it can be a name or character string. The inputs are taken to be the same as those of that relation, both names and types.

Otherwise, the attributes are specified as the arguments of the call to c. Each argument should be of the form name = type, where name is the desired name for the corresponding column and type is a literal type object. The columns are specified in order and skipping is allowed. See ‘details’ for more information.

However, in the case that the value of attributes is a length-one character vector, then both of these formats are superseded and the single element is taken to be the name of relation to use if and only if the name is a valid relation.

In the case that the file has more columns than are specified here, extra columns at the end are simply skipped.

header

Whether the files contain the name of the columns on the first line. If so, only the header in the first file is used.

skip

The number of lines to skip at the beginning of each document.

nrows

The maximum number of lines to read. Negative and other invalid values are ignored.

sep

The field delimiter, given as a length-one character vector. This single element should either be the word "TAB" or a single ASCII character, escape characters included. For example, "\t", " ", and "\" work. The only exception to this is "\n" for obvious reasons.

simple

Whether quotes are allowed. If not, then the fields are split whenever the delimiter is seen, regardless of whether it is inside a quoted string. Using a simple algorithm is significantly faster than not and is highly recommended whenever possible.

quote

The character used to quote strings, given as a length-one character vector whose single element should be a single ASCII character. This is the character that is used to quote strings. Having different characters to quote a string, such as "(" and ")" is not supported.

escape

The character used to escape the quote character, given as a length-one character vector whose single element should be a single ASCII character.

trim.cr

Whether to check for and remove the carriage return (CR) characters. On Window machines, lines typically end with a carriage return before the line feed (LF) character, a.k.a. the newline character. Setting this as true ensures that CRs are not included in the last field. Only LF and CR+LF behaviour is currently supported. See here for more information.

nullable

An object used to specify the strings for each column that are to be interpreted as ‘NULL’ values, somewhat analogous to the na.strings arguments. However, it differs in that each attribute can have at most one null string and attributes can have different null strings, where a null string is a string that is interpreted as a NULL value.

The null string for each attribute can either be a length-one character, whose only element is taken to be the null string,

OR

a length-one logical. TRUE is the same as using the string "NULL" and FALSE denotes that no attributes are nullable.

nullable can either be given as one of the two above formats, in which case the same value is used for every attribute, or as a list in order to use different null strings for different attributes.

If given as a list, the elements are interpreted in the following order:

1) If named, the name is taken to be the attribute. The value is interpreted as described above. 2) If a list, the element labelled ‘attr’ should be a length-one character giving the attribute name. If an element labelled ‘null’ exists, it is interpreted as the null string; if not, then the null string is taken to be "NULL". 3) If a length-one character, then the single element is treated as the attribute name and the null string is taken to be "NULL".

Any other format results in an error.

Currently only the (1) format of the list is supported.

MoreArgs

A list of additional arguments to pass to an inner call to read.table. This call is used to read in a partial table, which is used for various checks as well as determining column names and types. Only the first file given in files is read. Each argument must be named, as to avoid confusion.

Any argument taken by both ReadCSV and read.table is disallowed from being passed in this manner. Instead, the value passed to this function is used instead. The exception to this is nrows. Only a single row is ever read by coderead.table.

Furtheremore, arguments may be changed to fully mimic the behavior of the Grokit CSV Reader, such as to accomodate simple = TRUE.

chunk

The chunk size, to be passed to Input.

Details

This section deals with the specification of attribute names and types, which is considerably more complicated that of read.table. The description of attributes should be read before continuing.

When given as a call, the specification is quoted and broken apart before being processed on a per-element basis. Unlike read.table, column names and types can be specified for some columns and left blank for others, in which case automatic processing takes over as it does in read.table. In order to skip either a column name or type, simply omit the corresponding label in the call, using a completely empty argument when skipping both for a column.

For example, if you want to omit the name of the second column, the type of the third column, and both for the fourth column, c(a = b, c, d=, ) would be appropriate. In this example, the first column has name “a” and type b. The second column is given a generated name, such as “V1”. Third column has its type deduced based on the file. Both of these occur for the fourth column.

In the case that you want a single column without specifying either the name or the type, simply use c(). Normally, converting this to a list structure based on its AST results in no arguments. However, it is understood that reading in a CSV with zero columns is nonsensical and so this special functionality is used, as there is no other way to specify such a call.

In truth, the function being called does not have to be c. A warning is thrown but the exact function being called is otherwise ignored. This allows for accidentally using list or similar mistakes.

The default names for columns are the same as they are in read.table, If the file header is read, then the corresponding name is used. Otherwise, the name is “V” followed by the column index, starting at zero.

Types are determined by an inner call to read.table. However, strings are never assumed to be factors. See the MoreArgs argument for more details.


tera-insights/gtBase documentation built on May 31, 2019, 8:35 a.m.