Write Simple Feature Objects ('sf') with 'Apache' 'Arrow'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

sfarrow is designed to help read/write spatial vector data in "simple feature" format from/to Parquet files while maintaining coordinate reference system information. Essentially, this tool is attempting to connect R objects in sf and in arrow and it relies on these packages for its internal work.

A key goal is to support interoperability of spatial data in Parquet files. R objects (including sf) can be written to files with arrow; however, these do not necessarily maintain the spatial information or can be read in by Python. sfarrow implements a metadata format also used by Python GeoPandas, described here: https://github.com/geopandas/geo-arrow-spec. Note that these metadata are not stable yet, and sfarrow will warn you that it may change.

# install from CRAN with install.packages('sfarrow')
# or install from devtools::install_github("wcjochem/sfarrow@main)
# load the library
library(sfarrow)
library(dplyr, warn.conflicts = FALSE)

Reading and writing single files

A Parquet file (with .parquet extension) can be read using st_read_parquet() and pointing to the file system. This will create an sf spatial data object in memory which can then be used as normal using functions from sf.

# read an example dataset created from Python using geopandas
world <- st_read_parquet(system.file("extdata", "world.parquet", package = "sfarrow"))

class(world)
world
plot(sf::st_geometry(world))

Similarly, a Parquet file can be written from an sf object using st_write_parquet() and specifying a path to the new file. Non-spatial objects cannot be written with sfarrow, and users should instead use arrow.

# output the file to a new location
# note the warning about possible future changes in metadata.
st_write_parquet(world, dsn = file.path(tempdir(), "new_world.parquet"))

Partitioned datasets

While reading/writing a Parquet file is nice, the real power of arrow comes from splitting big datasets into multiple files, or partitions, based on criteria that make it faster to query. There is currently basic support in sfarrow for multi-file spatial datasets. For additional dataset querying options, see the arrow documentation.

Querying and reading Datasets

sfarrow accesses arrows's dplyr interface to explore partitioned, Arrow datasets.

For this example we will use a dataset which was created by randomly splitting the nc.shp file first into three groups and then further partitioning into two more random groups. This creates a nested set of files.

list.files(system.file("extdata", "ds", package = "sfarrow"), recursive = TRUE)

The file tree is showing that the data were partitioned by the variables "split1" and "split2". Those are the column names that were used for the random splits. This partitioning is in "Hive style" where the partitioning variables are in the paths.

The first step is to open the Dataset using arrow.

ds <- arrow::open_dataset(system.file("extdata", "ds", package="sfarrow"))

For small datasets (as in the example) we can read the entire set of files into an sf object.

nc_ds <- read_sf_dataset(ds)

nc_ds

With large datasets, more often we will want query them and return a reduced set of the partitioned records. To create a query, the easiest way is to use dplyr::filter() on the partitioning (and/or other) variables to subset the rows and dplyr::select() to subset the columns. read_sf_dataset() will then use the arrow_dplyr_query and call dplyr::collect() to extract and then process the Arrow Table into sf.

nc_d12 <- ds %>% 
            filter(split1 == 1, split2 == 2) %>%
            read_sf_dataset()

nc_d12
plot(sf::st_geometry(nc_d12), col="grey")

When using select() to read only a subset of columns, if the geometry column is not returned, the default behaviour of sfarrow is to throw an error from read_sf_dataset. If you do not need the geometry column for your analyses, then using arrow and not sfarrow should be sufficient. However, setting find_geom = TRUE in read_sf_dataset will read in any geometry columns in the metadata, in addition to the selected columns.

# this command will throw an error
# no geometry column selected for read_sf_dataset
# nc_sub <- ds %>% 
#             select('FIPS') %>% # subset of columns
#             read_sf_dataset()

# set find_geom
nc_sub <- ds %>%
            select('FIPS') %>% # subset of columns
            read_sf_dataset(find_geom = TRUE)

nc_sub

Writing to Datasets

To write an sf object into multiple files, we can again construct a query using dplyr::group_by() to define the partitioning variables. The result is then passed to sfarrow.

world %>%
  group_by(continent) %>%
  write_sf_dataset(file.path(tempdir(), "world_ds"), 
                   format = "parquet",
                   hive_style = FALSE)

In this example we are not using Hive style. This results in the partitioning variable not being in the folder paths.

list.files(file.path(tempdir(), "world_ds"))

To read this style of Dataset, we must specify the partitioning variables when it is opened.

arrow::open_dataset(file.path(tempdir(), "world_ds"), 
                    partitioning = "continent") %>%
  filter(continent == "Africa") %>%
  read_sf_dataset()