knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
A5 cell IDs are 64-bit unsigned integers. R has no native uint64 type,
and its double can only represent integers exactly up to 2^53. Nearly
half of all A5 cell IDs exceed this threshold, so converting them to
double silently corrupts the data.
This is a problem when reading Parquet files that store A5 cell IDs as
uint64 columns — the standard format used by DuckDB, Python, and
geoparquet.io. By default, arrow::read_parquet()
converts uint64 to R's double, losing precision:
library(arrow) library(tibble) library(a5R) # A real A5 cell — Edinburgh at resolution 20 cell <- a5_lonlat_to_cell(-3.19, 55.95, resolution = 20) a5_u64_to_hex(cell) # Write to Parquet as uint64 (the standard interchange format) tf <- tempfile(fileext = ".parquet") arrow::write_parquet( arrow::arrow_table(cell_id = a5_cell_to_arrow(cell)), tf ) # Read it back naively — arrow silently converts uint64 to double (naive <- tibble(arrow::read_parquet(tf))) cell_as_dbl <- naive$cell_id # The double can't distinguish this cell from nearby IDs cell_as_dbl == cell_as_dbl + 1 # TRUE — silent corruption cell_as_dbl == cell_as_dbl + 100 # still TRUE
a5_cell_from_arrow() and a5_cell_to_arrow()a5R provides two functions that bypass the lossy double conversion
entirely, using Arrow's zero-copy View() to reinterpret the raw bytes:
library(a5R) library(tibble) # Six cities across the globe — some will have bit 63 set (origin >= 6) cities <- tibble( name = c("Edinburgh", "Tokyo", "São Paulo", "Nairobi", "Anchorage", "Sydney"), lon = c( -3.19, 139.69, -46.63, 36.82, -149.90, 151.21), lat = c( 55.95, 35.69, -23.55, -1.29, 61.22, -33.87) ) cities$cell <- a5_lonlat_to_cell(cities$lon, cities$lat, resolution = 10) cities
These cells work seamlessly in tibbles. Now let's enrich the data with some A5 operations — cell resolution and distance from Edinburgh:
edinburgh <- cities$cell[1] cities$resolution <- a5_get_resolution(cities$cell) cities$dist_from_edinburgh_km <- as.numeric( a5_cell_distance(cities$cell, rep(edinburgh, nrow(cities)), units = "km") ) cities
Convert to an Arrow table and write to Parquet. The cell column is stored
as native uint64 — the same binary format used by DuckDB, Python, and
geoparquet.io:
tf <- tempfile(fileext = ".parquet") arrow_tbl <- arrow::arrow_table( name = cities$name, cell_id = a5_cell_to_arrow(cities$cell), cell_res = cities$resolution, dist_from_edinburgh_km = cities$dist_from_edinburgh_km ) arrow_tbl$schema arrow::write_parquet(arrow_tbl, tf)
Read it back — a5_cell_from_arrow() recovers the exact cell IDs
without any precision loss:
pq <- arrow::read_parquet(tf, as_data_frame = FALSE) # Recover cells from the uint64 column, bind with the rest of the data recovered_cells <- a5_cell_from_arrow(pq$column(1)) result <- as.data.frame(pq) result$cell <- recovered_cells result <- tibble::as_tibble(result[c("name", "cell", "cell_res", "dist_from_edinburgh_km")]) result
Verify the round-trip is lossless:
identical(format(cities$cell), format(result$cell))
a5_cell_to_arrow(): packs the eight raw-byte fields into 8-byte
little-endian blobs (one per cell), creates an Arrow
fixed_size_binary(8) array, then uses View(uint64) to
reinterpret the bytes as unsigned 64-bit integers — zero-copy.
a5_cell_from_arrow(): does the reverse — View(fixed_size_binary(8))
on the uint64 array to get the raw bytes, then unpacks each 8-byte
blob into the eight raw-byte fields used by a5_cell.
The raw bytes never pass through double, so there is no precision loss
at any step. See vignette("internal-cell-representation") for details
on the raw-byte representation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.