Table broken into records (normally by new lines), and records broken into fields (normally by delimiter). But you can't parse into lines and then into fields because end of a record you need to understand escaping and quoting. This makes the readr C++ interface fundamentally field based.
Data is fundamentally rectangular, and usually stored in record-major format (input may be ragged but output will always be rectangular dataset).
Always know number (and type) of fields; don't always know number of rows. Field types are known in advance (either guessed from first 100 rows, or supplied by user). For large files, cost of determining types is negligible compared to parsing; small files are so quick anyway that a little additional overhead doesn't matter.
Key to good performance is to avoid copying and allocation. The C++ API is designed so that we never need to copy data, except at the last possible moment to unescape strings.
There are four main components:
A source provides an iterator-based interface to a file, string or raw vector.
In C++, these are SourceFile
, SourceString
, SourceRaw
, etc. Each
source has a corresponding R-representation usually generated by
datasource()
. Source::create()
instantiates the approach class from
the R representation.
A token is an iterator that points to a single value in source. A token also contains metadata about the location of the value (e.g. the row and col, needed for informative error message) and optionally, an unescape method. The unescape method is used if the tokenizer detects escapes (and hence the memory allocated by the source can't be used directly).
A tokeniser converts a stream of characters from a source into a stream of tokens. Tokenizers are typically written in DFA style. This is a bit more verbose than informal parsing, but it makes it much easier to verify correctness.
Field collectors take a stream of tokens, parsing each token and storing it an R vector.
There is one collector for each column type: CollectorLogical
,
CollectorInteger
, CollectorDouble
etc. On the R side, these are
represented by col_logical
, col_integer()
, col_double()
etc.
Collector::create()
dynamically creates a Collector subclass from an
R list.
Each component is described in more detail below.
There are three main sources (Source.h
):
SourceFile.h
.SourceString.h
.SourceRaw.h
.Sources abstract away the underlying data storage to provide an iterator based interface (.begin()
and .end()
).
Currently, connections are supported by saving to a file. Eventually, we'll need to fully support streaming connections by implementing a stream-based parsing interface.
A token (Token.h
) is either:
Empty
Missing
A string, represented by two iterators into the underlying source. If the string is escaped, the token also contains a pointer to an unescaping function.
EOF. Used to indicate that parsing is complete.
Tokens also store their position (row and col of the field) for informative error messages.
The tokenizer (Tokenizer.h
) turns a source (a stream of characters) into a stream of tokens. To use a tokenizer:
# Create the C++ object from the R spec TokenizerPtr tokenizer = Tokenizer::create(tokenizerSpec); # Initialise it with a source tokenizer->tokenize(source->begin(), source->end()); # Call nextToken until there are no tokens left for (Token t = tokenizer->nextToken(); t.type() != TOKEN_EOF; t = tokenizer->nextToken());
The most important tokenisers are:
TokenizerDelim
for parsing general delimited files
TokenizerFixedWidth
for parsing fixed width files.
Tokenizers also identify missing values and manage encoding (mediated through the token).
DiagrammeR::mermaid('graph LR field -->|...| field field -->|,| delim delim -->|...| field delim -->|,| delim delim -->|"| string style delim fill:lightgreen string -->|"| quote string -->|...| string quote -->|"| string quote -->|,| delim ')
(,
is short hand for any delimiter, newline or EOF.)
It is designed to support the most common style of csv files, where quotes in strings are double escaped. In other words, to create a string containing a single double quote, you use """"
. In the future, we'll add other tokenizers to support more esoteric formats.
DiagrammeR::mermaid('graph LR escape_s --> string string -->|\\| escape_s string -->|...| string string -->|"| string_complete string_complete --> delim delim -->|...| field delim -->|,| delim delim -->|"| string delim -->|\\| escape_f style delim fill:lightgreen field -->|...| field field -->|,| delim field -->|\\| escape_f escape_f --> field ')
Column collectors collect corresponding fields across multiple records, parsing strings and storing in the appropriate R vector.
Four collectors correspond to existing behaviour in read.csv()
etc:
Three others support the most important S3 vectors:
There are two others that don't represent existing S3 vectors, but might be useful to add:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.