knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
sfarrow
is designed to help read/write spatial vector data in "simple feature"
format from/to Parquet files while maintaining coordinate reference system
information. Essentially, this tool is attempting to connect R
objects in
sf
and in
arrow
and it relies on these packages for
its internal work.
A key goal is to support interoperability of spatial data in Parquet files. R
objects (including sf
) can be written to files with arrow
; however, these do
not necessarily maintain the spatial information or can be read in by Python.
sfarrow
implements a metadata format also used by Python GeoPandas
,
described here:
https://github.com/geopandas/geo-arrow-spec.
Note that these metadata are not stable yet, and sfarrow
will warn you that it
may change.
# install from CRAN with install.packages('sfarrow') # or install from devtools::install_github("wcjochem/sfarrow@main) # load the library library(sfarrow) library(dplyr, warn.conflicts = FALSE)
A Parquet file (with .parquet
extension) can be read using st_read_parquet()
and pointing to the file system. This will create an sf
spatial data object in
memory which can then be used as normal using functions from sf
.
# read an example dataset created from Python using geopandas world <- st_read_parquet(system.file("extdata", "world.parquet", package = "sfarrow")) class(world) world plot(sf::st_geometry(world))
Similarly, a Parquet file can be written from an sf
object using
st_write_parquet()
and specifying a path to the new file. Non-spatial objects
cannot be written with sfarrow
, and users should instead use arrow
.
# output the file to a new location # note the warning about possible future changes in metadata. st_write_parquet(world, dsn = file.path(tempdir(), "new_world.parquet"))
While reading/writing a Parquet file is nice, the real power of arrow
comes
from splitting big datasets into multiple files, or partitions, based on
criteria that make it faster to query. There is currently basic support in
sfarrow
for multi-file spatial datasets. For additional dataset querying
options, see the arrow
documentation.
sfarrow
accesses arrows
's dplyr
interface to explore partitioned, Arrow
datasets.
For this example we will use a dataset which was created by randomly splitting the nc.shp file first into three groups and then further partitioning into two more random groups. This creates a nested set of files.
list.files(system.file("extdata", "ds", package = "sfarrow"), recursive = TRUE)
The file tree is showing that the data were partitioned by the variables "split1" and "split2". Those are the column names that were used for the random splits. This partitioning is in "Hive style" where the partitioning variables are in the paths.
The first step is to open the Dataset using arrow
.
ds <- arrow::open_dataset(system.file("extdata", "ds", package="sfarrow"))
For small datasets (as in the example) we can read the entire set of files into
an sf
object.
nc_ds <- read_sf_dataset(ds) nc_ds
With large datasets, more often we will want query them and return a reduced set
of the partitioned records. To create a query, the easiest way is to use
dplyr::filter()
on the partitioning (and/or other) variables to subset the
rows and dplyr::select()
to subset the columns. read_sf_dataset()
will then
use the arrow_dplyr_query
and call dplyr::collect()
to extract and then
process the Arrow Table into sf
.
nc_d12 <- ds %>% filter(split1 == 1, split2 == 2) %>% read_sf_dataset() nc_d12 plot(sf::st_geometry(nc_d12), col="grey")
When using select()
to read only a subset of columns, if the geometry column
is not returned, the default behaviour of sfarrow
is to throw an error from
read_sf_dataset
. If you do not need the geometry column for your analyses,
then using arrow
and not sfarrow
should be sufficient. However, setting
find_geom = TRUE
in read_sf_dataset
will read in any geometry columns in the
metadata, in addition to the selected columns.
# this command will throw an error # no geometry column selected for read_sf_dataset # nc_sub <- ds %>% # select('FIPS') %>% # subset of columns # read_sf_dataset() # set find_geom nc_sub <- ds %>% select('FIPS') %>% # subset of columns read_sf_dataset(find_geom = TRUE) nc_sub
To write an sf
object into multiple files, we can again construct a query
using dplyr::group_by()
to define the partitioning variables. The result is
then passed to sfarrow
.
world %>% group_by(continent) %>% write_sf_dataset(file.path(tempdir(), "world_ds"), format = "parquet", hive_style = FALSE)
In this example we are not using Hive style. This results in the partitioning variable not being in the folder paths.
list.files(file.path(tempdir(), "world_ds"))
To read this style of Dataset, we must specify the partitioning variables when it is opened.
arrow::open_dataset(file.path(tempdir(), "world_ds"), partitioning = "continent") %>% filter(continent == "Africa") %>% read_sf_dataset()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.