| nanoparquet-types | R Documentation |
How nanoparquet maps R types to Parquet types.
When writing out a data frame, nanoparquet maps R's data types to Parquet logical types. The following table is a summary of the mapping. For the details see below.
| R type | Parquet type | Default | Notes |
| bit64::integer64 | INT64 | x | NA_integer64_ marks missing values. |
| blob::blob | BYTE_ARRAY | x | Missing values are NULL. |
| " | FIXED_LEN_BYTE_ARRAY | All entries must have the same length. Missing values are NULL. |
|
| character | STRING(BYTE_ARRAY) | x | I.e. STRSXP. Converted to UTF-8. |
| " | BYTE_ARRAY | ||
| " | FIXED_LEN_BYTE_ARRAY | ||
| " | ENUM | ||
| " | UUID | ||
| Date | DATE | x | |
| difftime | INT64 | x | If not hms::hms. Arrow metadata marks it as Duration(NS). |
| factor | STRING | x | Arrow metadata marks it as a factor. |
| " | ENUM | ||
| hms::hms | TIME(true, MILLIS) | x | Sub-milliseconds precision is lost. |
| integer | INT(32, true) | x | I.e. INTSXP. |
| " | INT64 | ||
| " | INT96 | ||
| " | DECIMAL(INT32) | ||
| " | DECIMAL(INT64) | ||
| " | INT(8, *) | ||
| " | INT(16, *) | ||
| " | INT(32, signed) | ||
| list | LIST(INT32 elements) | x | List of integer vectors. NULL entries and NA elements are supported. |
| " | LIST(DOUBLE elements) | x | List of double vectors. NULL entries and NA elements are supported. |
| " | LIST(STRING elements) | x | List of character vectors. NULL entries and NA elements are supported. |
| " | BYTE_ARRAY | Must be a list of raw vectors. Missing values are NULL. |
|
| " | FIXED_LEN_BYTE_ARRAY | Must be a list of raw vectors of the same length. Missing values are NULL. |
|
| logical | BOOLEAN | x | I.e. LGLSXP. |
| numeric | DOUBLE | x | I.e. REALSXP. |
| " | INT96 | ||
| " | FLOAT | ||
| " | DECIMAL(INT32) | ||
| " | DECIMAL(INT64) | ||
| " | INT(*, *) | ||
| " | FLOAT16 | ||
| POSIXct | TIMESTAMP(true, MICROS) | x | Sub-microsecond precision is lost. |
The non-default mappings can be selected via the schema argument. E.g.
to write out a factor column called 'name' as ENUM, use
write_parquet(..., schema = parquet_schema(name = "ENUM"))
The detailed mapping rules are listed below, in order of preference. These rules will likely change until nanoparquet reaches version 1.0.0.
bit64::integer64 objects (from the bit64 package) are written as
INT64. nanoparquet handles any object that inherits the integer64
class this way. NA_integer64_ (i.e. INT64_MIN) marks missing values.
blob::blob objects (from the blob package) are written as BYTE_ARRAY.
blob::blob is a list of raw vectors, and nanoparquet handles any object
that inherits the blob class this way, even if the blob package is not
installed. Missing values (i.e. NULL list entries) are supported.
Factors (i.e. vectors that inherit the factor class) are converted
to character vectors using as.character(), then written as a
STRSXP (character vector) type. The fact that a column is a factor
is stored in the Arrow metadata (see below), unless the
nanoparquet.write_arrow_metadata option is set to FALSE.
Dates (i.e. the Date class) is written as DATE logical type, which
is an INT32 type internally.
hms objects (from the hms package) are written as TIME(true, MILLIS).
logical type, which is internally the INT32 Parquet type.
Sub-milliseconds precision is lost.
POSIXct objects are written as TIMESTAMP(true, MICROS) logical type,
which is internally the INT64 Parquet type.
Sub-microsecond precision is lost.
difftime objects (that are not hms objects, see above), are
written as an INT64 Parquet type, and noting in the Arrow metadata
(see below) that this column has type Duration with NANOSECONDS
unit.
Integer vectors (INTSXP) are written as INT(32, true) logical type,
which corresponds to the INT32 type.
Real vectors (REALSXP) are written as the DOUBLE type.
Character vectors (STRSXP) are written as the STRING logical type,
which has the BYTE_ARRAY type. They are always converted to UTF-8
before writing.
Logical vectors (LGLSXP) are written as the BOOLEAN type.
Other vectors error currently.
You can use infer_parquet_schema() on a data frame to map R data types
to Parquet data types.
To change the default R to Parquet mapping, use parquet_schema() and
the schema argument of write_parquet(). Currently supported
non-default mappings are:
integer to INT64,
integer to INT96,
double to INT96,
double to FLOAT,
character to BYTE_ARRAY,
character to FIXED_LEN_BYTE_ARRAY,
character to ENUM,
factor to ENUM,
integer to DECIAML & INT32,
integer to DECIAML & INT64,
double to DECIAML & INT32,
double to DECIAML & INT64,
integer to INT(8, *), INT(16, *), INT(32, signed),
double to INT(*, *),
character to UUID,
double to FLOAT16,
list of integer vectors to LIST with INT32 elements,
list of double vectors to LIST with DOUBLE elements,
list of character vectors to LIST with STRING elements,
list of raw vectors to BYTE_ARRAY,
list of raw vectors to FIXED_LEN_BYTE_ARRAY,
blob::blob to FIXED_LEN_BYTE_ARRAY.
When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level data types. The following table summarizes the mappings. See more details below.
| Parquet type | R type | Notes |
| Logical types | ||
| BSON | character | |
| DATE | Date | |
| DECIMAL | numeric | REALSXP, potentially losing precision. |
| ENUM | character | |
| FLOAT16 | numeric | REALSXP |
| INT(8, *) | integer | |
| INT(16, *) | integer | |
| INT(32, *) | integer | Large unsigned values may overflow! |
| INT(64, *) | numeric | REALSXP, or integer64 if read_int64_type option is set. |
| INTERVAL | list(raw) | Missing values are NULL. |
| JSON | character | |
| LIST | list | Elements are read as their corresponding R type. |
| MAP | Not supported. | |
| STRING | factor | If Arrow metadata says it is a factor. Also UTF8. |
| " | character | Otherwise. Also UTF8. |
| TIME | hms::hms | Also TIME_MILLIS and TIME_MICROS. |
| TIMESTAMP | POSIXct | Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. |
| UUID | character | In 00112233-4455-6677-8899-aabbccddeeff form. |
| UNKNOWN | Not supported. | |
| Primitive types | ||
| BOOLEAN | logical | |
| BYTE_ARRAY | factor | If Arrow metadata says it is a factor. |
| " | blob::blob | Otherwise. Missing values are NULL. |
| DOUBLE | numeric | REALSXP |
| FIXED_LEN_BYTE_ARRAY | blob::blob | Missing values are NULL. |
| FLOAT | numeric | REALSXP |
| INT32 | integer | |
| INT64 | numeric | REALSXP, or integer64 if read_int64_type option is set. |
| INT96 | POSIXct | |
The exact rules are below. These rules will likely change until nanoparquet reaches version 1.0.0.
The BOOLEAN type is read as a logical vector (LGLSXP).
The STRING logical type and the UTF8 converted type is read as
a character vector with UTF-8 encoding.
The DATE logical type and the DATE converted type are read as a
Date R object.
The TIME logical type and the TIME_MILLIS and TIME_MICROS
converted types are read as an hms object, see the hms package.
The TIMESTAMP logical type and the TIMESTAMP_MILLIS and
TIMESTAMP_MICROS converted types are read as POSIXct objects.
If the logical type has the UTC flag set, then the time zone of the
POSIXct object is set to UTC.
INT32 is read as an integer vector (INTSXP).
INT64 is read as a real vector (REALSXP) by default. If the
read_int64_type option in parquet_options() is set to "integer64"
or "bit64::integer64", it is read as a bit64::integer64 vector
instead. NA_integer64_ (i.e. INT64_MIN) marks missing values.
DOUBLE and FLOAT are read as real vectors (REALSXP).
INT96 is read as a POSIXct read vector with the tzone attribute
set to "UTC". It was an old convention to store time stamps as
INT96 objects.
The DECIMAL converted type (FIXED_LEN_BYTE_ARRAY or BYTE_ARRAY
type) is read as a real vector (REALSXP), potentially losing
precision.
The ENUM logical type is read as a character vector.
The UUID logical type is read as a character vector that uses the
00112233-4455-6677-8899-aabbccddeeff form.
The FLOAT16 logical type is read as a real vector (REALSXP).
BYTE_ARRAY is read as a factor object if the file was written
by Arrow and the original data type of the column was a factor.
(See 'The Arrow metadata below.)
Otherwise BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY are read as a
blob::blob object. blob::blob is a list of raw vectors with class
blob (and related vctrs classes). The blob package is not required to
use this object; it is simply a list of raw vectors. Missing values are
denoted by NULL.
Other logical and converted types are read as their annotated low level types:
INT(8, true), INT(16, true) and INT(32, true) are read as
integer vectors because they are INT32 internally in Parquet.
INT(64, true) is read as a real vector (REALSXP), unless the
read_int64_type option is set (see above).
Unsigned integer types INT(8, false), INT(16, false) and
INT(32, false) are read as integer vectors (INTSXP). Large
positive values may overflow into negative values, this is a known
issue that we will fix.
INT(64, false) is read as a real vector (REALSXP), unless the
read_int64_type option is set (see above). Large positive values may
overflow into negative values, this is a known issue that we will fix.
INTERVAL is a fixed length byte array, and nanoparquet reads it as
a list of raw vectors. Missing values are denoted by NULL.
JSON columns are read as character vectors (STRSXP).
BSON columns are read as raw vectors (RAWSXP).
These types are not yet supported:
Nested LIST types (lists of lists) are not supported.
The MAP logical type is not supported.
The UNKNOWN logical type is not supported.
You can use the read_parquet_schema() function to see how R would read
the columns of a Parquet file. Look at the r_type column.
Apache Arrow (i.e. the arrow R package) adds additional metadata to
Parquet files when writing them in arrow::write_parquet(). Then,
when reading the file in arrow::read_parquet(), it uses this metadata
to recreate the same Arrow and R data types as before writing.
nanoparquet::write_parquet() also adds the Arrow metadata to Parquet
files, unless the nanoparquet.write_arrow_metadata option is set to
FALSE.
Similarly, nanoparquet::read_parquet() uses the Arrow metadata in the
Parquet file (if present), unless the nanoparquet.use_arrow_metadata
option is set to FALSE.
The Arrow metadata is stored in the file level key-value metadata, with
key ARROW:schema.
Currently nanoparquet uses the Arrow metadata for two things:
It uses it to detect factors. Without the Arrow metadata factors are read as string vectors.
It uses it to detect difftime objects. Without the arrow metadata
these are read as INT64 columns, containing the time difference in
nanoseconds.
nanoparquet-package for options that modify the type mappings.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.