arealDB - Harmonize and integrate heterogeneous areal data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(arealDB)
library(tidyverse)

DBDir <- tempdir()
gazDir <- "directory/to/gazetteer.rds"
ontoDir <- "directory/to/ontology.rds"

Rationale

Areal data are any data that summarise a certain aspect related to a particular region. Those data are relevant in many applications in the environmental and socioeconomic sciences, such as biodiversity checklists, agricultural statistics, or socioeconomic surveys. For applications that surpass the spatial, temporal or thematic scope of any single data source, data must be integrated from several heterogeneous sources. Inconsistent concepts, definitions, or messy data tables make this a tedious and error-prone process.

arealDB has been developed for the purpose of providing an easy-to-use way to integrate areal data together with associated geometries into a standardised database. In the current, revised version, it makes use of the ontologics R-package to harmonise the names of territories (the geometries) and the target variables (the tables).

A previous version of this tutorial can be found in the replication script appendix/source of the pre-print. This version is not an exact replication script, as it is built mostly with dummy-code where paths on the harddisc would be needed. However, it should still serve to exemplify how to use arealDB in the current version.

The basics

To ensure that databases that are built with arealDB are properly standardized, the R-package makes use of a particular data-structure in which concepts and and their hierarchical relationships are recorded. The idea for this structure is based on a broader effort that shall make the internet more interoperable and machine-readable, the \"Semantic Web\". This can be useful for both, the names of territories (which are typically different across languages and across statistical agencies) and also the values of the variables that shall be recorded. For both registers, arealDB makes use of the R-package ontologics, that allows recording of various concepts in such a data-structure.

Ontology

[...] More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject. [wikipedia]

Any target variable can (and should be) recorded in a standardized fashion. It may be extremely tricky to identify a domain-specific standard because this usually involves social processes (people need to agree upon it), however this is not part of this tutorial. For example, concepts such as biological species are part of a complex taxonomy of terms that identify kingdoms, families, species, varieties and many more. Species are nested within families and other hierarchical and synonymous relationships exist in such taxonomies. All these issues are addressed in an ontology.

Gazetteer

A gazetteer is a geographical index or directory used in conjunction with a map or atlas... [wikipedia]

Territory names are recorded in a gazetteer, which can be regarded as an ontology of territorial names. Territorial names are typically nested within regions at several levels, such as counties that are nested in a nation. To showcase how a gazetteer (and also an ontology by extension) should be configured, we are building one for the United Nations geoscheme here.

# The regions and subregions we want to include into the gazetteer
un_region <- c("AFRICA", "AMERICAS", "ANTARCTICA", "ASIA", "EUROPE", "OCEANIA")

un_subregion <- tibble(concept = c(
  "Eastern Africa", "Middle Africa", "Northern Africa", "Southern Africa", "Western Africa",
  "Caribbean", "Central America", "Northern America", "Southern America",
  "Antarctica",
  "Central Asia", "Eastern Asia", "Southeastern Asia", "Southern Asia", "Western Asia",
  "Eastern Europe", "Northern Europe", "Southern Europe", "Western Europe",
  "Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
  broader = c(rep(un_region[1], 5), rep(un_region[2], 4), rep(un_region[3], 1),
             rep(un_region[4], 5), rep(un_region[5], 4), rep(un_region[6], 4)))

# To start building the gazetteer, we first need to record some meta-data ...
gazetteer <- 
  start_ontology(name = "gazetteer", 
                 path = paste0(DBDir, "/tables/"),
                 version = "1.0.0",
                 code = ".xxx",
                 description = "a UN geomscheme gazetteer",
                 homepage = "https://en.wikipedia.org/wiki/United_Nations_geoscheme",
                 license = "CC-BY-4.0",
                 notes = "This gazetteer nests each nation into the United Nations geoscheme.")

# ... then we can define some new source of information from which external concepts would be taken.
gazetteer <- 
  new_source(name = "gadm",
             date = Sys.Date(),
             description = "GADM wants to map the administrative areas of all countries, at all levels 
                            of sub-division. We provide data at high spatial resolutions that includes 
                            an extensive set of attributes.",
             homepage = "https://gadm.org/index.html",
             license = "CC-BY",
             ontology = gazetteer)

# All items stored in an ontology must have some class that is ontology-specific and thus needs to be
# defined manually. Here we define the three classes 'un_region', 'un_subregion' and 'nation'.
gazetteer <- 
  new_class(new = "un_region", 
            target = NA,
            description = "region according to the UN geoscheme", 
            ontology = gazetteer) %>%
  new_class(new = "un_subregion", 
            target = "un_region",                               # un_subregion is nested into un_region
            description = "sub-region according to the UN geoscheme", 
            ontology = .) %>%
  new_class(new = "al1", 
            target = "un_subregion",
            description = "the first administrative level of the GADM gazetteer", 
            ontology = .) %>%
  new_class(new = "al2", 
            target = "al1",
            description = "the second administrative level of the GADM gazetteer", 
            ontology = .) %>%
  new_class(new = "al3", 
            target = "al2",
            description = "the third administrative level of the GADM gazetteer", 
            ontology = .) %>%
  new_class(new = "nation", 
            target = "un_region",
            description = "groups of geometries that together form a nation (for example 'France', 
                           which is a combination of many territorial concepts across the whole 
                           globe), but which is different than 'al1' of GADM", 
            ontology = .)                                       # for convenience, one can pipe the 
                                                                # output of a previous function into 
                                                                # the next. 

# Finally, the new concepts can be added, they need to associated to one of the classes in the ontology
gazetteer <- 
  new_concept(new = un_region,
              class = "un_region",
              ontology =  gazetteer)

# when adding concepts that are nested in other concepts, we need to provide the full concept in 
# 'broader = '. We can get this with the 'get_concept()' function.
gazetteer <- 
  new_concept(new = un_subregion$concept,
              broader = get_concept(table = un_subregion %>% select(label = broader), 
                                    ontology = gazetteer),
              class = "un_subregion",
              ontology =  gazetteer)

write_rds(x = gazetteer, file = gazDir)

Thematic data

Essentially all data can be the thematic data of a database. For an areal database these are, however, typically data that are some attribute of the region or territory they are associated to or metrics that describe a population, inventory or other set of entities within an area. Thematic data come in all forms and shapes and standardization is often rather poor. This is largely due to the fact that there are no clear rules on how to set up tables, and most tools that are typically used for recording and storing data provide them in vastly different data-structures. To reshape data into a common format, arealDB makes use of schema-descriptions as they are built with the tabshiftr R-package (more on this below).

Spatial data

Any areal database connects the thematic data to some set of spatial data (they are called geometries here). Spatial data are typically already quite well standardized so that we only need to record how exactly the data are structured. Spatial data are primarily the geometries themselves, but most of the time they also come with an ID that identifies each geometry and a set of attributes associated to each geometry. For spatial data that are supposed to be useful for more than being visualized there should be a name of the territorial units or meta-data that allow identification of those names.

Working with arealDB

Start the database

To build a new database with arealDB, one has to start with the start_arealDB() function. Also when resuming to work with that database, it is required to run this function again, as it pastes the current root directory into the options of R, based on which paths for files are derived. The argument gazetteer = is mandatory because without the information recorded in a gazetteer it would not be possible to determine the spatial/topological relation between geometries. The argument top = takes that class in the ontology that shall be regarded as the top-most class at which territorial units are distinguished (this might be different to the top-most level in the ontology). It is, however, optional to provide an ontology. In case the thematic data do already contain well harmonized value terms, this can be omitted.

start_arealDB(root = DBDir,
              gazetteer = gazDir, top = "al1",
              ontology = list("targetVar" = ontoDir))

As there is sometimes more than one target variable recorded in an areal database, it is possible to provide more than one ontology. When providing the link between target variable and an ontology, it is important that the name of the target variable is a column in the (harmonized) thematic data. In the above example, targetVar would have to be the column in any of the thematic data tables that are stored in the database.

Register the data

In this example we are working with three dataseries, GADM (the Database of global administrative areas), IBGE (the Brazilian Statistical and Geographic agency) and USDA (the United States Department of Agriculture). The first step in any areal database is in registering the dataseries that are relevant. Thematic and spatial data are primarily described by the meta-data that are represented by the dataseries. It shall be noted here that any function in arealDB that would modify the database contains the argument update =, to indicate that the modifications that would follow from calling this function are intended to be stored in the database. If this is left at FALSE (the default), the function would merely return the entries that would be stored in the database, which can be used for debugging purposes

regDataseries(name = "gadm",
              description = "Database of Global Administrative Areas",
              homepage = "https://gadm.org/index.html",
              licence_link = "https://gadm.org/license.html",
              update = TRUE)

regDataseries(name = "ibge",
              description = "Instituto Brasileiro de Geografia e Estatistica",
              homepage = "https://sidra.ibge.gov.br",
              update = TRUE)

regDataseries(name = "usda",
              description = "US Dept. of Agriculture",
              homepage = "https://www.nass.usda.gov/Quick_Stats/Lite",
              update = TRUE)

# to debug whether all meta-data have been provided properly and/or the function works as expected
debug <- regDataseries(..., update = FALSE)
View(debug)

Next, the geometries are registered. Here, we need to provide the dataseries for geometries in gSeries = and various other meta-data. If the geometries are valid for only a particular territorial unit, this can be specified with a dynamic name = value combination. name must, in this case, be a class in the gazetteer (such as nation in our gazetteer). In case the geometries contain several territorial units that shall be handled separately, the territory names are taken from what has been defined as top = in start_arealDB(). Relations between the columns/attributes in the geometry and the gezatteer are defined in the argument label =, where a list that links them is provided.

# Often it makes sense to start with a global set of geometries, especially if databases encompas
# several nations
regGeometry(gSeries = "gadm",
            label = list(al1 = "NAME_0"),
            archive = "gadm36_levels_gpkg.zip|gadm36_levels.gpkg",
            archiveLink = "https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_levels_gpkg.zip",
            updateFrequency = "unknown",
            update = TRUE)

regGeometry(gSeries = "gadm",
            label = list(al1 = "NAME_0", al2 = "NAME_1"),
            archive = "gadm36_levels_gpkg.zip|gadm36_levels.gpkg",
            archiveLink = "https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_levels_gpkg.zip",
            updateFrequency = "unknown",
            update = TRUE)

regGeometry(gSeries = "gadm",
            label = list(al1 = "NAME_0", al2 = "NAME_1", al3 = "NAME_2"),
            archive = "gadm36_levels_gpkg.zip|gadm36_levels.gpkg",
            archiveLink = "https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_levels_gpkg.zip",
            updateFrequency = "unknown",
            update = TRUE)

# For Brazil, there are dedicated geometries available, not so for the US
regGeometry(nation = "Brazil",
            gSeries = "ibge",
            label = list(al2 = "NM_ESTADO"),
            archive = "BR.zip|BRUFE250GC_SIR.shp",
            archiveLink = "https://mapas.ibge.gov.br/bases-e-referenciais/bases-cartograficas/malhas-digitais",
            updateFrequency = "notPlanned",
            update = TRUE)

regGeometry(nation = "Brazil",
            gSeries = "ibge",
            label = list(al3 = "NM_MUNICIP"),
            archive = "BR.zip|BRMUE250GC_SIR.shp",
            archiveLink = "https://mapas.ibge.gov.br/bases-e-referenciais/bases-cartograficas/malhas-digitais",
            updateFrequency = "notPlanned",
            update = TRUE)

Finally, the thematic data are registered. Here, we first need to define the schema description. To learn more about this, consult the documentation of the tabshiftr R-package. In addition to the gSeries =, thematic data also have a dSeries =, which documents the thematic dataseries that can deviate from the geometry dataseries. In case several tables are available for one dataseries, the argument subset = can be used to specify dynamically what is reported in the respective table. As no two tables can have the same name at stage2 (the result of registering the data), this argument also serves the purpose to give the tables different names.

schema_ibge <- 
  setCluster(id = "year", left = 1, top = 3, height = 400536) %>%
  setIDVar(name = "al2", columns = 1, split = "(?<=\\().*(?=\\))") %>%
  setIDVar(name = "al3", columns = 1, split = "^.*?(?=\\s\\()") %>%
  setIDVar(name = "year", columns = 3) %>%
  setIDVar(name = "commodity", columns = 2) %>%
  setObsVar(name = "planted", unit = "ha", columns = 4) %>%
  setObsVar(name = "harvested", unit = "ha", columns = 5) %>%
  setObsVar(name = "production", unit = "t", columns = 6) %>%
  setObsVar(name = "yield", unit = "t/ha", factor = 0.001, columns = 7)

regTable(nation = "Brazil",
         subset = "crops",
         label = "al3",
         dSeries = "ibge",
         gSeries = "ibge",
         schema = schema_ibge,
         begin = 2000,
         end = 2018,
         archive = "ibge.7z|tabela5457_1990.csv",
         archiveLink = "https://sidra.ibge.gov.br/tabela/5457",
         nextUpdate = "unknown",
         updateFrequency = "annually",
         metadataLink = "https://metadados.ibge.gov.br/consulta/estatisticos/operacoes-estatisticas/PA",
         metadataPath = "unavailable",
         update = TRUE)

schema_usda <- 
  setFormat(na_values = "(D)", thousand = ",") %>%
  setIDVar(name = "al2", columns = 18) %>%
  setIDVar(name = "al3", columns = 23) %>%
  setIDVar(name = "year", columns = 32) %>%
  setIDVar(name = "commodities", columns = 5) %>%
  setObsVar(name = "planted", unit = "ha", factor = 0.4046856422, columns = 39,
            key = 9, value = "AREA PLANTED") %>%
  setObsVar(name = "harvested", unit = "ha", factor = 0.4046856422, columns = 39,
            key = 9, value = "AREA HARVESTED") %>%
  setObsVar(name = "production", unit = "t", factor = 0.907185, columns = 39,
            key = 9, value = "PRODUCTION") %>%
  setObsVar(name = "yield", unit = "kg/ha", factor = 2241.7, columns = 39,
            key = 9, value = "YIELD")

regTable(nation = "UnitedStatesOfAmerica",
         level = 3,
         subset = "surveyFieldCrops",
         dSeries = "usda",
         gSeries = "gadm",                                   # here, dSeries and gSeries deviate
         schema = schema_usda,
         begin = 1918,
         end = 2018,
         archive = "qs.crops_20220129.txt.gz|usda_tons_crops_l3.csv",
         archiveLink = "https://quickstats.nass.usda.gov/",
         updateFrequency = "annually",
         nextUpdate = "unknown",
         metadataLink = "https://www.nass.usda.gov/Quick_Stats/index.php",
         metadataPath = "unknown",
         update = TRUE)

Normalise the input

Finally, we need to align all data so that they follow the same structure. This is called "normalization", in arealDB, we normalize the registered data. This happens when calling the functions nromGeometry() and normTable(), for the respective input data. Geometries are normalized first, because the thematic data are associated to the standardized versions of them.

This process of normalization has been designed so that a user has to interact with the function as little as possible, but some interactions are still required. Especially when working with the names of territories, it is not always completely clear how one standard translates into another. The gazetteer is the place where this is documented, but it can only infer the information only to a certain degree. Essentially, when needs to be known for a consistent gazetteer is which concept translates into which other concept. However, with the occasional changes of administrative units, sometimes concepts change to a degree that they can't simply be translated into one-another. This is the case for example when on previous administrative unit is split up into several new units. None of the new units are identical to the previous units, but some of them may still have the same name. In this sense, "translation" isn't equivalent to the literal meaning most people know, but must be more understood as "what maps how to what else". When normalizing geometries with the argument priority = "ontology", the function will take names as literal interpretations, match those that are "the same" and ask the user to translate the ones that don't match manually. With priority = "spatial", the function instead determine spatial overlap to determine which concepts are nested into which other concepts to update the ontology accordingly. This requires less human interaction, but more computation time.

normGeometry(pattern = "brazil",
             outType = "gpkg", 
             update = TRUE)

normGeometry(pattern = "UnitedStatesOfAmerica",
             outType = "gpkg", 
             priority = "spatial",
             update = TRUE)

When finally normalizing the thematic tables, interaction is required when the ontoMatch = argument has been provided, otherwise no matching of terms of the target variables with the ontology is carried out. Also here, hierarchical relationships can be included by providing the target variable terms in the respective columns in the translation table (that opens automatically when required), or simple 1:1 translations or terms from other taxonomies or languages can be specified.

normTable(ontoMatch = "targetVar",
          outType = "rds",
          update = TRUE)


Try the arealDB package in your browser

Any scripts or data that you put into this service are public.

arealDB documentation built on July 9, 2023, 6:09 p.m.