Corpus includes support for parsing data in JavaScript Object Notation (JSON) format. Beyond that, Corpus can type-check data values to ensure that their types are compatible. We use the latter facility to determine column types for tabular data stored in newline-delimited JSON (NDJSON) format.
Here, we describe the type system Corpus uses for JSON values, and we document the differences between JSON as understood by Corpus and JSON as formally specified. Finally, we describe the JSON parser, which follows a very different model from most other JSON parsers.
Every JSON value has a type. Types can be arranged in a lattice. The special
type Null
is the "bottom" type; all other types are super-types of Null
.
The special type Any
is the "top" type; all other types are sub-types
of Any
. Besides Null
and Any
, every type has a sub-type and a
super-type.
Any
|
|
+--------+-------+-----+------+-----------------+
| | | | |
| Real | +-------------+ +--------------+
Boolean | Text | Array Types | | Record Types |
| Integer | +-------------+ +--------------+
| | | | |
+--------+-------+-----+------+-----------------+
|
|
Null
In addition to Any
and Null
, the other atomic types are Boolean
,
Integer
, Real
, and Text
. Array types and record types are compound
types, defined in terms of other types.
The lattice denotes the sub-type and super-type relations. For example, every
value of type Integer
is also of type Real
and of type Any
.
Also, every value of type Null
is also of type Text
.
Array types are parameterized by the element type and the length. We require
arrays to have homogeneous element types, but, since we allow elements to have
type Any
, any valid JSON array has a valid type in our system. The length of
an array can either be a non-negative number, or it can be a special value
(-1
) interpreted as "variable".
Let Array(t, n)
denote an array with element type t
and length n
. We can
define the lattice of array types in terms of the smallest super-type of a
pair of array types (their parent):
The parent of Array(t, n)
and Array(u, n)
is Array(parent(t, u), n)
,
where parent(e, f)
is the parent type of e
and f
.
The parent of Array(t, n)
and Array(u, m)
for unequal m
and n
is Array(parent(t, u), -1)
.
Conceptually, a record type is a map that associates a type to every field
name. That is, a record type has an infinite set of fields. In practice, most
fields have type Null
, and we can represent this map efficiently as a set
of (name, type) pairs for the fields with non-Null
types.
Suppose that R
and S
are two record types. Their parent type is another
record type, P
. To define P
, we just need to define the types for the
fields of P
. To do this, let f
be any field name, and suppose that R[f]
is the type of the field in record R
, and S[f]
is the type of the field in
record S
. Then, we set the parent record type for this field to P[f] =
parent(R[f], S[f])
.
Here are some example JSON values along with their types:
Value Type
-------------------------------
null Null
true Boolean
-1 Integer
3.14 Real
"hello" Text
[] Array(Null, 0)
[1,2] Array(Integer, 2)
[1,false,3] Array(Any, 3)
[1,null,2.2] Array(Real, 3)
Here are some examples of parents of record types:
parent({"a": Boolean}, {"b": Text}) = {"a": Boolean, "b": Text}
parent({"a": Integer}, {"a": Real}) = {"a": Real}
parent({"a": Integer, "b": Real},
{"c": Text, "b": Integer}) = {"a": Integer, "b": Real, "c": Text}
In our type system, the value null
can have any type. In practice, we
interpret null
as a missing value (NA
in R).
JSON as understood by Corpus mostly agrees with the formal JSON specification, but there are a few differences, which we outline below.
Beyond requiring Unicode, formal JSON is agnostic to the encoding. Corpus requires that the data be encoded in valid UTF-8.
Corpus has a broader definition of a "number" than is given in the formal JSON specification:
+
) before positive numbers. Formal
JSON does not.Examples: +1
, +3.14
Examples: 1.
, .2
, .2e-10
Infinity
and NaN
as infinity and not-a-number,
respectively, and allows an optional sign (+
or -
) before these
sequences. This allows Corpus to parse the "JSON" generated by Python
(json.dumps(float('nan'))
or json.dumps(float('inf'))
). Formal JSON
does not allow these values.Examples: Infinity
, NaN
, -NaN
Corpus has a definition of a "text" that is more restrictive than a formal JSON string:
Formal JSON allows arbitrary escape sequences of the form \uXXXX
where
XXXX
is a sequence of four hexadecimal digits. Unlike formal JSON,
when XXXX
is a UTF-16 high surrogate, Corpus requires that the
next sequence of characters in the string be \uYYYY
, where YYYY
is
a UTF-16 low surrogate.
Unlike formal JSON, Corpus does allow escape sequences of the form \uYYYY
where YYYY
is a UTF-16 low surrogate unless the sequence is preceded by
\uXXXX
, where XXXX
is a UTF-16 high surrogate.
Corpus does not allow arrays longer than INT_MAX
elements (2147483647 on
most systems). Formal JSON does not place a limit on the maximum array
length, but JavaScript allows arrays up to 4294967295 elements.
Corpus has a definition of a "record" that is more restrictive than a formal JSON object:
Corpus does not allow records to have more than INT_MAX
fields.
Corpus requires that all field names be unique. Formal JSON does not. Moreover, Corpus requires that, when decoded to Unicode Normalized Composed Form (NFC) the normalized field names are all unique.
Examples:
{" ": 1, "\u0020": 2}
,
{"\u00e8": 3, "e\u0300": 4}
(valid JSON but not accepted by Corpus)
Most JSON libraries decode JSON-encoded values in a single pass; whenever the parser encounters a new value it calls a client-supplied callback function. Corpus takes a different approach, which requires two passes over the input data. In the first pass, Corpus validates the input data and determines its type. In the second pass, the client decodes the typed value to a native type.
The first pass over the value (a call to corpus_data_assign
) scans the input
and determines its type. If, in the process of scanning, Corpus encounters a
new data type, it adds this type to the passed-in schema object, assigning a
new integer ID for the type. After scanning, Corpus initializes a
struct corpus_data
value containing a pointer to the encoded value, its size
(in bytes), and the integer ID of the value's type. The first pass over the
data does not allocate any memory, except to add new types to the data schema
if necessary.
Once the value has been typed, the client can use corpus_data_bool
,
corpus_data_int
, corpus_data_double
, or corpus_data_text
to decode the
value to a native type. If the value is an array, the client can iterate over
its values using the corpus_data_items
function. If the value is a record,
the client can iterate over its fields using corpus_data_fields
, or she can
access specific fields by name using the corpus_data_field
function.
The relevant interfaces are data.h and datatype.h.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.