Assumptions

Main components

There are four main components:

Each component is described in more detail below.

Sources

There are three main sources (Source.h):

Sources abstract away the underlying data storage to provide an iterator based interface (.begin() and .end()).

Currently, connections are supported by saving to a file. Eventually, we'll need to fully support streaming connections by implementing a stream-based parsing interface.

Tokens

A token (Token.h) is either:

Tokens also store their position (row and col of the field) for informative error messages.

Tokenizer

The tokenizer (Tokenizer.h) turns a source (a stream of characters) into a stream of tokens. To use a tokenizer:

# Create the C++ object from the R spec
TokenizerPtr tokenizer = Tokenizer::create(tokenizerSpec);

# Initialise it with a source
tokenizer->tokenize(source->begin(), source->end());

# Call nextToken until there are no tokens left 
for (Token t = tokenizer->nextToken(); t.type() != TOKEN_EOF; t = tokenizer->nextToken());

The most important tokenisers are:

Tokenizers also identify missing values and manage encoding (mediated through the token).

TokenizerDelim with doubled escapes

DiagrammeR::mermaid('graph LR
field -->|...| field
field -->|,| delim

delim -->|...| field
delim -->|,| delim
delim -->|"| string
style delim fill:lightgreen

string -->|"| quote
string -->|...| string

quote -->|"| string
quote -->|,| delim
')

(, is short hand for any delimiter, newline or EOF.)

It is designed to support the most common style of csv files, where quotes in strings are double escaped. In other words, to create a string containing a single double quote, you use """". In the future, we'll add other tokenizers to support more esoteric formats.

TokenizerDelim with backslash escapes

DiagrammeR::mermaid('graph LR
escape_s --> string

string -->|\\| escape_s
string -->|...| string
string -->|"| string_complete

string_complete --> delim

delim -->|...| field
delim -->|,| delim
delim -->|"| string
delim -->|\\| escape_f
style delim fill:lightgreen

field -->|...| field
field -->|,| delim
field -->|\\| escape_f

escape_f --> field
')

Column collectors

Column collectors collect corresponding fields across multiple records, parsing strings and storing in the appropriate R vector.

Four collectors correspond to existing behaviour in read.csv() etc:

Three others support the most important S3 vectors:

There are two others that don't represent existing S3 vectors, but might be useful to add:



tidyverse/readr documentation built on Jan. 27, 2024, 11:59 p.m.