Geometry in tables: the gris model

Abstract

This package is a response to some missing use-cases in GIS (geographic information systems) using R. It is an attempt to provide a unified framework for doing the things that GIS traditionally cannot do without awkward workarounds.

With some reservation, I call this the gris model, and it's being used in the rangl and rbgm packages.

The model provides:

The model can be used to

The model relies on

Gris does not use nested tables, though see worked discussions on how to do this (in the vignettes).

The approach is not a drop-in replacement for existing tools for traditional GIS is R. If there's a better tool for the job, the recommended approach is to convert to the structures needed by that tool and do it, then return the result to the approach used here as needed.

Implementations

This is an ongoing development. The spbabel package is on CRAN and will be a base package for exploration in the area. Using spbabel I have created the package rangl and that keeps things simple by only providing plotting methods and the right data structures. There is limited support for manipulation and subsetting objects, but most of this uses generic table indexing techniques. It's not yet clear how much this will be formalized.

The model was originally implemented in the gris package, but that made a number of false starts and over-complicated some apsects. A version of the model is also used in the AustralianAntarcticDivision/rbgm package, but that does not yet have the full rangl lessons. While I was figuring out the next steps from gris I explored nested data frames (from tidyr) and created a package to provide proper holes for ggplot2 polygons in ggpolypath.

Introduction

The sp and ggplot2 packages are both widely used in R for plotting maps. The visualization tools in ggplot2 are very general and can be used for plotting a very wide variety of data visualizations as well as maps, but the data structures used are not aligned with how map data are usually stored. The tools and data structure for sp are aligned with traditional GIS approaches and can be used for a wide variety of map-data manipulations and visualizations, but they are not inherently extensible.

This work aims to bridge these seemingly disparate domains, and bring the general, data frame table-based idioms of ggplot2 into the specialized domain of spatial data. I don't have a particular problem with either 'sp' or 'ggplot2' in terms of traditional GIS, but I do have a problem with traditional GIS. It is surprising to me that the limitations, baked-in in the 1990s for understandable reasons(need link to arc-node topology description, complexity of constrained triangulations), are now generally accepted with very little discussion. There are murmurings out there (cf. list elsewhere but include qgis geolife tracks, postgis topology, eonfusion, cgal, whuber, and other simplicial complex examples, )

First we need to explain the key differences between ggplot2 and sp.

Two worlds and two tables

The 'sp' package starts with an abstract Spatial class and all other objects extend this base. A Spatial object is a complex S4 classed structure composed of lists of coordinates, values, and metadata. It has

The geometry is in the complex list, and this is defined just-so by S4 classes with rigourous constraints to conform to spatial standards. This includes determined details such as the "winding" order for rings defining polygon islands or holes, and the details of which hole belongs to which island.

So 'sp' objects include three kinds of thing: a data frame, a complex geometry object, and everything else.

A ggplot2 'object' is a data frame, and, ok: it's not really an object. It has columns:

One thing is still missing, since the ggplot2 case is really two data frames. The description above neglects the same data frame of attributes mentioned first for 'sp'. This table is rarely mentioned explicitly, since in all ggplot2 examples it only gets used in a merge step, to copy its attribute metadata on id to all the coordinates.

The other parts missing from ggplot2 is the details about which hole belongs to which island, the projection metadata, and the bounding box. These things are either computed on the fly (the bounding box) or passed around in an ad hoc way. This is sometimes fine for map data since many systems use "only" longitude-latitude, or an assumed local map projection.

So, 'ggplot2' is three things as well: a data frame, the geometry in another data frame, and everything else.

I said that sp objects are "just-so" with "determined details", but ggplot2 needs exactly the same rigour, it's just less strict with how you go about it, and is less focussed on stopping you shooting yourself in the foot, or making fish soup.

Specific example

To illustrate the difference between sp and ggplot2, read in a polygon layer from the maps package and convert, and plot.

library(maps)
## there's a long side-story here, that we ignore for now
amap <- map("world", plot = FALSE, resolution = 0, fill = TRUE)
map(amap, fill = TRUE)
str(amap)

This is a simple list structure, with 'x' and 'y' values separate by 'NA', and a 'names' vector.

There is a name element for every "NA-separated" section.

sum(is.na(amap$x))

To plot this with gpplot2 we need a single table.

fortify.map <- function(x) {
  bad <- is.na(x$x)
  seq_ids <- cumsum(bad) + 1L
  data.frame(x = x$x, y = x$y, group = seq_ids, id = seq_ids)[!bad, ]
}
library(ggplot2)
amaptab <- fortify(amap) 

ggplot(amaptab) + aes(x = x, y = y, group = group, fill = group) + geom_polygon()

Problem 1

If you look carefully there are two big problems, there's an awkward polygon catastrophe at the "southern boundary" of Antarctica - it becomes clearer if we zoom in.

## for now, it's easier to subset by re-reading
ant <- map("world", regions = "Antarctica", plot = FALSE, fill = TRUE)
ggplot(fortify(ant)) + aes(x = x, y = y, group = group, fill = group) + geom_polygon() + geom_path() 

What is happening in the western corner at 85S? The problem is that in this coordinate system, there's no extension of the continent to the "south pole", and so the polygon wraps around awkwardly from the westernmost coordinate to the easternmost, and since the westernmost is not also the southernmost coordinate it is leaving a dangle incorrectly joining the west to the east. The issue is more apparent by including the boundary, with 'geom_path', but also consider that this is really a problem created by our choice of map projection. The easternmost and westernmost coordinates in this map are exactly the same place, so it is no surprise that things are a bit funny.

We can "fix" this by putting in dummy coordinates at [-180, -90] and [180, -90] - but this simply patches a symptom rather than solving the problem correctly. We would then get the funny boundary dangle when plotted in a reasonable projection.

Here's a plot without the extra coordinates at the poles.

library(ggalt)
prj <- "+proj=laea +lat_0=-90 +ellps=WGS84 +lon_0=140"
ggplot(fortify(ant)) + aes(x = x, y = y, group = group, fill = group) + geom_polygon() + geom_path() + coord_proj(prj)

Here's one where the extra coordinates are present.

library(maptools)
data(wrld_simpl)
ant2 <- subset(wrld_simpl, NAME == "Antarctica")
rownames(ant2) <- NULL
## note it's now called "long"/"lat" not "x"/"y"
ggplot(fortify(ant2)) + aes(x = long, y = lat, group = group, fill = group) + geom_polygon() + geom_path() + coord_proj(prj) + guides(fill = FALSE)

The right answer is to accept that our map extending from 180W to 180E is really bizarre, and we should make a better choice. The 'maps' package does not include these dummy coordinates. Many other data sets just put them in. There's actually a long history about the birth and long dusk of the 'shapefile' specification and what was intended that is relevant here.

Problem 2

The second problem is that we are not drawing holes correctly. To see this zoom in on southern Africa, where the Kingdom of Lesotho punches out a hole in South Africa.

## for now, it's easier to subset by re-reading
afr <- map("world", regions = c("South Africa", "Lesotho"), plot = FALSE, fill = TRUE)
ggplot(fortify(afr)) + aes(x = x, y = y, group = group, fill = group) + geom_polygon() + geom_path()

That looks good, but see what happens when we plot just South Africa.

## for now, it's easier to subset by re-reading
library(ggpolypath)
safr <- map("world", regions = "South Africa", plot = FALSE, fill = TRUE)
ggplot(fortify(safr)) + aes(x = x, y = y, group =group, col = factor(group)) + geom_polypath() + geom_path() 

The data for South Africa is literally stored with the hole it needs to be plotted on its own, as well as the same "island" needed for Lesotho. Is this good? Is it what we want? (Think about it, I challenge you to not answer with "it depends").

We can get the hole with polypath, but this reminds us that we are double-storing a lot of coordinates - this is the same for any neighbouring polygons, all those shared boundaries are not "shared", they are stored explicitly.

Finally, show that sp plots the hole just fine without any effort.

library(dplyr)
sp_afr <- fortify(afr) %>% 
  transmute(x_ = x, y_ = y, object_ = unlist(lapply(strsplit(afr$names[id], ":"), "[", 1L)), branch_ = id, 
         island_ = !grepl("hole", afr$names[id]), order_ = row_number()) %>% 
  spbabel::sp()

plot(sp_afr, col = rainbow(nrow(sp_afr)))
plot(sp_afr[2, ], col = "grey")

spbabel

What else is missing?

There are three functions in spbabel to transfer between 'sp' and 'ggplot2'.

  1. sptable - decomposes the geometry to a single table (like fortify but with different column names)
  2. sp - a driver function to produce Spatial objects, in this case a 'sp.data.frame' method
  3. sptable<- - a replacement version of 'sptable' to do the changes "in place"

In simplest form, the journey goes like this.

library(spbabel)
wrld_geom <- sptable(wrld_simpl)
wrld_data <- as.data.frame(wrld_simpl) ## make clear that we need two tables

(sp_rtrip <- sp(wrld_geom, attr_tab = wrld_data))

The table 'wrld_geom' carried its map projection string with it as an 'attribute', but this is not reliable without creating methods for a special class. We can make sure it's carried using a manual step.

wrld_geom <- sptable(wrld_simpl)
wrld_data <- as.data.frame(wrld_simpl) ## make clear that we need two tables
wrld_prj <- proj4string(wrld_simpl)
(sp_rtrip <- sp(wrld_geom, attr_tab = wrld_data, crs = wrld_prj))

What is the point of doing this? We can modify the data using standard table tools.

wrld_geom <- sptable(wrld_simpl[9, ])
wrld_prj <- proj4string(wrld_simpl)
sp(wrld_geom, crs = wrld_prj)
wrld_geom$object_ <- wrld_geom$branch_
sp(wrld_geom, crs = wrld_prj)

See how we decomposed all the parts into their own object, and once we return to Spatial that is reflected in the number of objects.

The replacement method allows this to be done without breaking anything apart, so this is the more "user-level" function, where 'sp' and 'sptable' are for developers. With this version we can very easily pipe our geometry modifications in convenient syntax.

wrld <- wrld_simpl[9:10, ] ## pick two objects this time
sptable(wrld) <- sptable(wrld) %>% mutate(object_ = branch_)
wrld

There is a gotcha here in that if we had any "holes" they would now be "non-hole" objects, but with that caveat in mind it's a powerful way to work with these data.

Note that the attribute data were dropped as it is now ambiguous as to what values corresponde to which objects, and that information is external to the replacement methods. (The 'sp::disaggregate' function is really a better way to do that particular task.)

This is not done with a single object with a special class, but it could be. See document on an S3 class for fortified, single-nested, and double-nested forms.

Tables are systematic by default

Perhaps the best thing about having everything stored tables is that we can push it out to a database, pass that to someone else and everything is transferred correctly.

Let's send a fortified copy of the maps "county" data set to someone in a 'SQLite' database.

library(dplyr)
library(tibble)
library(maps)
library(ggplot2)
db <- src_sqlite("data-raw/maps_county.sqlite3", create = TRUE)
mp <- map("county", plot = FALSE, fill = TRUE)
## todo make sure the object ids are right 
## and that these are sp()-able - add interpreters for the fortify form (?)
geom <- copy_to(db, fortify(mp), name = "geom", temporary = FALSE)
## I can be a bit old-fashioned
statename <- substr(mp$name, 1, unlist(gregexpr(",", mp$name)) - 1L)
countyname <- substr(mp$name, unlist(gregexpr(",", mp$name)) + 1, nchar(mp$name))
dat <- copy_to(db, tibble(county = countyname, 
                          state = statename, group = seq_along(statename)), name = "metadata", temporary = FALSE)
geom
dat
rm(geom, dat, db)

Our friend in the other vignette just called an made some interesting suggestions, one is to store only the unique long lat pairs, and the other to properly store counties as parts of a state.

Being skilled in database normal form we apply these principles to get four separate tables.

object <- distinct(metadata, state)
object <- object %>% mutate(object_ = spbabel:::id_n(nrow(object)))

branch <- metadata %>% select(-state) %>% mutate(object_ = object$object_[group], branch_ = spbabel:::id_n(nrow(metadata)))
branch$group <- NULL



branchXvertex <- geom %>% inner_join(branch[, c("county", "branch_")], c("subregion" = "county")) %>% select(branch_, order)
branchXvertex <- branchXvertex %>% rename(order_ = order)

vertex <- geom %>% mutate(vertex_ = as.integer(factor(paste(geom$long, geom$lat, sep = "_"))))
vertex$vertex_ <- spbabel:::id_n(max(vertex$vertex_))[vertex$vertex_]
branchXvertex$vertex_ <- vertex$vertex_
vertex <- vertex %>%  transmute(x_ = long, y_ = lat, vertex_ = vertex_) %>% distinct(vertex_)

branchXvertex$vertex_ 

Another vignette that receives this database and explores nesting, demand-read is in progress "Geometry tables read on-demand".

Tidy tables

We have tidied up in one sense, we've removed a complicated object with specialist requirements and created two tables that can be plotted and manipulated with generic tools.

There is redundancy in the geometry table. 'Tidy' means '3NF' or 'third-normal-form', and 3NF means that we remove the repeated values that could be stored just once.

Notice what we actually have in the "fortified" version of an sp object.

allnumbers <- function(x) {x[] <- lapply(x, function(a) if(is.character(a)) as.integer(factor(a)) else as.numeric(a)); x}

oz <- subset(wrld_simpl, NAME == "Australia")
plot(allnumbers(fortify(oz)), pch = 19, cex = 0.2)

Clearly there's really not much information in the fortified table after you have dealt with the x and y coordinates, there's a lot of wasted storage here. It gets worse if we copy on attributes from the "id" attribute table, we simply copy out more redundant data.

'Spatial' classes avoid this wastage by not exploring the encoding explicitly on every coordinate, but in one of two ways, either

The implicit examples are 'order', the right order for the coordinates is just the order they come in, the 'piece' value is either 1, or 2+ depending on a list of length 1 or more, and id and group are the same. Group and piece are really aliases, just one is absolute (to the entire set) the other relative (to position within a geometry) .

Tags such whether it is a hole or not is an "attribute" tagged onto the object that holds the matrix of coordinates. We don't see the "comments" attribute here, which in sp (if present) tells us which piece 'belongs' to which other piece (i.e. when it's a hole, what island is it a hole of?).

Again, this is discussed elsewhere but the crux is the need for comments is a legacy of the "shapefile" era, no longer necessary in this the "simple features" era. More on that later.

(There are subtle rules and conventions about ordering and holes, which is discussed elsewhere).

How do we remove this redundancy?

Normal form by stages

Storing data in tables without unnessary redundancy is called "normal form". This goes a step further from the basics of tidy data being rows of observations with all columns as variables.

To convert spatial vector data to normal form we need the to first identify the entities that exist. These are

If we create a "Branches" table by normalizing from the geometry table, we will have three tables, one with all the x/y coordinates and their part ID and order within part, the parts table with attributes part ID, island status, and the object metadata table.

A benefit here is that the size of data may be reduced, though it depends on how much data we have versus the number of IDs we need to link the tables.

Four tables

Further normalization inserts a link table between branches and vertices, and puts the order and vertex id on this. This table should be thought of as the "instances of vertices", since this is where we explicitly store the use of an indvidual vertex by a branch. We don't need to store the same vertex again and again, just a link to it. This makes it easier to identify which coordinates are shared.

Other benefits include

Note that the vertex table by default will be de-duplicated on X and Y, so we can store another pair of these transformed e.g. Longitude and Latitude, but the de-duplication process would be different if performed on three coordinates, or four. By default for now we only discuss doing this on X and Y (which may be Longitude and Latitude), because this is what gives us topology for the most common 2D case, and is absolutely required by the triangulation process in RTriangle.

Let's send a normal form copy of the maps "county" data set in a 'SQLite' database.

library(dplyr)
library(tibble)
library(maps)
library(ggplot2)
db <- src_sqlite("data-raw/maps_county_nf.sqlite3", create = TRUE)
mp <- map("county", plot = FALSE, fill = TRUE)


Try the spbabel package in your browser

Any scripts or data that you put into this service are public.

spbabel documentation built on March 31, 2023, 11:55 p.m.