In tidyverse/readr: Read Rectangular Text Data

Assumptions

Table broken into records (normally by new lines), and records broken into fields (normally by delimiter). But you can't parse into lines and then into fields because end of a record you need to understand escaping and quoting. This makes the readr C++ interface fundamentally field based.
Data is fundamentally rectangular, and usually stored in record-major format (input may be ragged but output will always be rectangular dataset).
Always know number (and type) of fields; don't always know number of rows. Field types are known in advance (either guessed from first 100 rows, or supplied by user). For large files, cost of determining types is negligible compared to parsing; small files are so quick anyway that a little additional overhead doesn't matter.
Key to good performance is to avoid copying and allocation. The C++ API is designed so that we never need to copy data, except at the last possible moment to unescape strings.

Main components

There are four main components:

A source provides an iterator-based interface to a file, string or raw vector.

In C++, these are SourceFile, SourceString, SourceRaw, etc. Each source has a corresponding R-representation usually generated by datasource(). Source::create() instantiates the approach class from the R representation.
A token is an iterator that points to a single value in source. A token also contains metadata about the location of the value (e.g. the row and col, needed for informative error message) and optionally, an unescape method. The unescape method is used if the tokenizer detects escapes (and hence the memory allocated by the source can't be used directly).
A tokeniser converts a stream of characters from a source into a stream of tokens. Tokenizers are typically written in DFA style. This is a bit more verbose than informal parsing, but it makes it much easier to verify correctness.
Field collectors take a stream of tokens, parsing each token and storing it an R vector.

There is one collector for each column type: CollectorLogical, CollectorInteger, CollectorDouble etc. On the R side, these are represented by col_logical, col_integer(), col_double() etc. Collector::create() dynamically creates a Collector subclass from an R list.

Each component is described in more detail below.

Sources

There are three main sources (Source.h):

A file on disk (mmapped for optimal performance), SourceFile.h.
A string, SourceString.h.
A raw vector, SourceRaw.h.

Sources abstract away the underlying data storage to provide an iterator based interface (.begin() and .end()).

Currently, connections are supported by saving to a file. Eventually, we'll need to fully support streaming connections by implementing a stream-based parsing interface.

Tokens

A token (Token.h) is either:

Empty
Missing
A string, represented by two iterators into the underlying source. If the string is escaped, the token also contains a pointer to an unescaping function.
EOF. Used to indicate that parsing is complete.

Tokens also store their position (row and col of the field) for informative error messages.

Tokenizer

The tokenizer (Tokenizer.h) turns a source (a stream of characters) into a stream of tokens. To use a tokenizer:

# Create the C++ object from the R spec
TokenizerPtr tokenizer = Tokenizer::create(tokenizerSpec);

# Initialise it with a source
tokenizer->tokenize(source->begin(), source->end());

# Call nextToken until there are no tokens left 
for (Token t = tokenizer->nextToken(); t.type() != TOKEN_EOF; t = tokenizer->nextToken());

The most important tokenisers are:

TokenizerDelim for parsing general delimited files
TokenizerFixedWidth for parsing fixed width files.

Tokenizers also identify missing values and manage encoding (mediated through the token).

TokenizerDelim with doubled escapes

DiagrammeR::mermaid('graph LR
field -->|...| field
field -->|,| delim

delim -->|...| field
delim -->|,| delim
delim -->|"| string
style delim fill:lightgreen

string -->|"| quote
string -->|...| string

quote -->|"| string
quote -->|,| delim
')

(, is short hand for any delimiter, newline or EOF.)

It is designed to support the most common style of csv files, where quotes in strings are double escaped. In other words, to create a string containing a single double quote, you use """". In the future, we'll add other tokenizers to support more esoteric formats.

TokenizerDelim with backslash escapes

DiagrammeR::mermaid('graph LR
escape_s --> string

string -->|\\| escape_s
string -->|...| string
string -->|"| string_complete

string_complete --> delim

delim -->|...| field
delim -->|,| delim
delim -->|"| string
delim -->|\\| escape_f
style delim fill:lightgreen

field -->|...| field
field -->|,| delim
field -->|\\| escape_f

escape_f --> field
')

Column collectors

Column collectors collect corresponding fields across multiple records, parsing strings and storing in the appropriate R vector.

Four collectors correspond to existing behaviour in read.csv() etc:

Logical
Integer
Double: decimal
Character: encoding, trim, emptyIsMissing?

Three others support the most important S3 vectors:

Factor: levels, ordered
Date
DateTime

There are two others that don't represent existing S3 vectors, but might be useful to add:

BigInteger (64 bit)
Time

tidyverse/readr documentation built on Aug. 4, 2024, 5:26 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tidyverse/readr
Read Rectangular Text Data

In tidyverse/readr: Read Rectangular Text Data

Assumptions

Main components

Sources

Tokens

Tokenizer

TokenizerDelim with doubled escapes

TokenizerDelim with backslash escapes

Column collectors

R Package Documentation

Browse R Packages

We want your feedback!

tidyverse/readr Read Rectangular Text Data

In tidyverse/readr: Read Rectangular Text Data

Assumptions

Main components

Sources

Tokens

Tokenizer

TokenizerDelim with doubled escapes

TokenizerDelim with backslash escapes

Column collectors

R Package Documentation

Browse R Packages

We want your feedback!

tidyverse/readr
Read Rectangular Text Data