Working with Arrow and Parquet

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The uint64 problem

A5 cell IDs are 64-bit unsigned integers. R has no native uint64 type, and its double can only represent integers exactly up to 2^53. Nearly half of all A5 cell IDs exceed this threshold, so converting them to double silently corrupts the data.

This is a problem when reading Parquet files that store A5 cell IDs as uint64 columns — the standard format used by DuckDB, Python, and geoparquet.io. By default, arrow::read_parquet() converts uint64 to R's double, losing precision:

library(arrow)
library(tibble)
library(a5R)

# A real A5 cell — Edinburgh at resolution 20
cell <- a5_lonlat_to_cell(-3.19, 55.95, resolution = 20)
a5_u64_to_hex(cell)

# Write to Parquet as uint64 (the standard interchange format)
tf <- tempfile(fileext = ".parquet")
arrow::write_parquet(
  arrow::arrow_table(cell_id = a5_cell_to_arrow(cell)),
  tf
)

# Read it back naively — arrow silently converts uint64 to double
(naive <- tibble(arrow::read_parquet(tf)))

cell_as_dbl <- naive$cell_id

# The double can't distinguish this cell from nearby IDs
cell_as_dbl == cell_as_dbl + 1   # TRUE — silent corruption
cell_as_dbl == cell_as_dbl + 100 # still TRUE

The solution: a5_cell_from_arrow() and a5_cell_to_arrow()

a5R provides two functions that bypass the lossy double conversion entirely, using Arrow's zero-copy View() to reinterpret the raw bytes:

library(a5R)
library(tibble)

# Six cities across the globe — some will have bit 63 set (origin >= 6)
cities <- tibble(
  name = c("Edinburgh", "Tokyo", "São Paulo", "Nairobi", "Anchorage", "Sydney"),
  lon  = c(   -3.19,     139.69,     -46.63,     36.82,    -149.90,    151.21),
  lat  = c(   55.95,      35.69,     -23.55,     -1.29,      61.22,    -33.87)
)

cities$cell <- a5_lonlat_to_cell(cities$lon, cities$lat, resolution = 10)
cities

These cells work seamlessly in tibbles. Now let's enrich the data with some A5 operations — cell resolution and distance from Edinburgh:

edinburgh <- cities$cell[1]

cities$resolution <- a5_get_resolution(cities$cell)
cities$dist_from_edinburgh_km <- as.numeric(
  a5_cell_distance(cities$cell, rep(edinburgh, nrow(cities)), units = "km")
)

cities

Writing and reading Parquet

Convert to an Arrow table and write to Parquet. The cell column is stored as native uint64 — the same binary format used by DuckDB, Python, and geoparquet.io:

tf <- tempfile(fileext = ".parquet")

arrow_tbl <- arrow::arrow_table(
  name = cities$name,
  cell_id = a5_cell_to_arrow(cities$cell),
  cell_res = cities$resolution,
  dist_from_edinburgh_km = cities$dist_from_edinburgh_km
)
arrow_tbl$schema
arrow::write_parquet(arrow_tbl, tf)

Read it back — a5_cell_from_arrow() recovers the exact cell IDs without any precision loss:

pq <- arrow::read_parquet(tf, as_data_frame = FALSE)

# Recover cells from the uint64 column, bind with the rest of the data
recovered_cells <- a5_cell_from_arrow(pq$column(1))
result <- as.data.frame(pq)
result$cell <- recovered_cells
result <- tibble::as_tibble(result[c("name", "cell", "cell_res", "dist_from_edinburgh_km")])
result

Verify the round-trip is lossless:

identical(format(cities$cell), format(result$cell))

How it works under the hood

  1. a5_cell_to_arrow(): packs the eight raw-byte fields into 8-byte little-endian blobs (one per cell), creates an Arrow fixed_size_binary(8) array, then uses View(uint64) to reinterpret the bytes as unsigned 64-bit integers — zero-copy.

  2. a5_cell_from_arrow(): does the reverse — View(fixed_size_binary(8)) on the uint64 array to get the raw bytes, then unpacks each 8-byte blob into the eight raw-byte fields used by a5_cell.

The raw bytes never pass through double, so there is no precision loss at any step. See vignette("internal-cell-representation") for details on the raw-byte representation.



Try the a5R package in your browser

Any scripts or data that you put into this service are public.

a5R documentation built on March 26, 2026, 5:10 p.m.