knitr::opts_chunk$set(echo = TRUE)
The globaltoolbox
R package was developed to help with large-scale analysis projects that include numerous, often unstandarized location names. This package is particularly targeted for use with standardizing location names at various administrative levels from unstandardized sources like free-form surveys or large-scale data aggregation projects that involve multiple sources.
This tutorial is meant to demonstrate some of the applications and workflows that globaltoolbox
for which is well-suited.
Install the globaltoolbox package from github:
devtools::install_github("HopkinsIDD/globaltoolboxlite") # library(tidyverse) library(globaltoolboxlite)
There are three main types of use of the globaltoolbox package. These include:
For General
Before loading the database, please create it with the create_database
function.
globaltoolbox provides a load_gadm
function, which takes the ISO 3166-1 alpha 3 code of a country. This will download all locations from GADM and add them to the database.
globaltoolbox::load_gadm(countries = c('MWI','TZA'))
globaltoolbox provides a telescoping_standardize
function, which takes your best guess at the name of a location, and returns the standardized name (if it can find it).
# Baltimore, with know country globaltoolbox::telescoping_standardize(c("unitedstates::baltimore")) # Just Baltimore, no other information globaltoolbox::telescoping_standardize(c("baltimore")) # Misspelled globaltoolbox::telescoping_standardize(c("baltmore")) globaltoolbox::telescoping_standardize(c("afganistan::Kuranwamunjan")) globaltoolbox::telescoping_standardize(c("AFG::Kuranwamunjan")) globaltoolbox::telescoping_standardize(c("AFG::Kurawamujan")) globaltoolbox::telescoping_standardize(c("Kuranwamunjan")) globaltoolbox::standardize_name(c("Kuranwamunjan")) globaltoolbox::telescoping_standardize("togo::kara") globaltoolbox::telescoping_standardize("togo") globaltoolbox::standardize_name(c("togo")) globaltoolbox::standardize_name(c("togo", "baltimore")) globaltoolbox::telescoping_standardize(c("togo", "baltimore")) location_name <- c("usa::baltimore", "togo") # try multiple globaltoolbox::telescoping_standardize(c("togo", "baltimore", "kansas", "paris", "Cairo")) # try multiple, with some scope info globaltoolbox::telescoping_standardize(c("togo", "unitedstates::baltimore", "unitedstates::kansas", "france::paris", "Cairo")) # try multiple, with some scope info, and misspellings globaltoolbox::telescoping_standardize(c("togo", "baltmore", "unitedstates::kanss", "france::paris", "egytp::Kairo")) # \ globaltoolbox::standardize_name(c("togo", "baltmore", "unitedstates::kanss", "france::paris", "egytp::Kairo")) location_name <- c("togo", "baltmore", "unitedstates::kanss", "france::paris", "egytp::Kairo")
A demonstration dataset called testingdata_denguetycho
is included in the package for demonstration and testing.
data(testingdata_bgdairportsurvey) head(testingdata_bgdairportsurvey) airports_ <- data.frame(airports = testingdata_bgdairportsurvey$airports) %>% tidyr::separate(airports, into=paste0("airport_", 1:10), fill="right", sep=",") %>% dplyr::select(which(colMeans(is.na(.))<1)) standardize_name(airports_[1:50,2]) standardize_name("amman", db="GAUL") standardize_name("BAGDAD") telescoping_standardize("AMMAN", db="GAUL") telescoping_standardize("AMMAN", db="GADM") telescoping_standardize("BAGDAD") airports2_matches <- telescoping_standardize(airports_[1:50,2]) View(data.frame(airports2=names(airports2_matches), airports2_matches)) telescoping_standardize(airports_[,2])
Standardization of location names/identities is done through integration of multiple global spatial databases (GADM, GAUL), in addition to other datasets of cities and aliases. Globaltoolbox maintains a database representing locations including: locations (numeric id, standard name, readable name, metadata) Locations represent a named place that is somewhat consistent through time They are the most important part of the database, and everything else references one or more of them They are colon separated to show hierarchical relationships e.g., ::tanzania aliases (location id, alias) Aliases are alternate names for locations They are used primarily to help fix non-standard names e.g., tza is an alias for ::tanzania Include alternative standardized codes (ISO, UN, etc.) geometries (location id, time left, time right, geometry) Geometries are polygon or point representations of the location Since the polygon may change in time due to border changes, we store each polygon with a time range representing when it is valid e.g., https://gadm.org/maps/TZA.html is a geometry for ::tanzania hierarchy (ancestor location id, last location id, depth) Hierarchies represent containment relationships We use hierarchies internally to determine scope (all locations contained in a particular location), and to help correct names e.g., ::tanzania::pembanorth is contained in ::tanzania
Data Entry Functions database_add_location_alias database_add_descendent load_hierarchical_sf load_gadm Data Lookup Functions get_location_metadata get_all_aliases get_location_geometry (more coming soon) Name standardization functions standardize_name telescoping_standardize
Name standardization Encoding and transliteration (
Fuzzy matching (multiple used in conjunction) Match probability -- to be implemented, probability given location level and available locations within database Typos : telescoping_standardize("tanzani") -> “::tanzania” Misspelling : telescoping_standardize("tanzinia") -> “::tanzania” Nonstandard : telescoping_standardize("tza::kusiniunguja") -> "::tanzania::zabzibarsouthandcentral" depth first search
telescoping_standardize("tanzania::chake") -> “::tanzania::pembasouth::chake"
Functions for adding other publically available metadata, for example: WHO incidence data (measles) Others?
Name matching not perfect (and probably never will be) “::spane” -> “sapone” (“::burkinafaso::bazega::sapone”) without depth argument Should identify “Spain” ideally. Current R fuzzy matching agorithms are based on string length, number of character matches, etc. Phonetic matching relies on consonants. Similarly “Usa” -> “japan::oita::usa” Database not perfect “::tanzania” and “::unitedrepublicoftanzania” are different locations “Usa” is not an alias for “unitedstatesofamerica” Database large (34 GB for database with shapefiles up to 2nd subcountry level; only 34Mb without shapefiles) Location merging (to solve the two tanzanias problem). The function exists, but it’s not passing tests yet. Performance issues (on the full database, matching takes 5 seconds or so) If possible, use user input to collect additional aliases/alternate names/spellings
Is there specific functionality that others would find valuable? What use do people have for this sort of package? Does anyone have good testing data for this? This can be: Testing data: data with many, non-standardized location names Training data: data with non-standardized location names AND the fixed/standardized version Bug reports: a time where you tried to look up something, but got NA or something else instead (include what you looked up, what you got, and what you expected) Does anyone have data they think would be cool for us to incorporate into the database (lists of alternative names for places, lists of new places, lists of shapefiles for places etc)? Integration with other data? DHS, worldpop, others?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.