Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. Build Status


minimal R version CRAN_Status_Badge packageversion


Last-changedate

knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(tidyverse)

Introduction

The places package provides access to the daily data files from the geonames data dump. The aim of the package is to make the geonames dump consisting of over 11 million place names with geographic coordinates available for large scale mapping or text mining projects involving multiple countries or world regions.

places is intended to complement the rOpenSci geonames() package by Barry Rowlingson. The geonames package provides access to the geonames API and is recommended for smaller scale projects. The places package is better suited for larger scale projects involving text mining and mapping place names from literature or other projects involving federating different kinds of data with geoinformation at scale. The package also adds additional tables that make it easier to select subsets of data for geographic regions or subregions or economic status using United Nations and World Bank datasets.

The places package was developed with support from the Research Council of Norway (RCN project number 257631/E10) as part of the Biospolar Project.

The purpose of this walk through is to take you through how to access the data, what is available, and decisions on issues such as tidying the data.

Installing

places is not on CRAN but can be installed from Github with devtools:

devtools::install_github("poldham/places")

The geonames data dump

The geonames dump is a set of daily dump files available from the export home page http://download.geonames.org/export/dump/. This page provides .zip files for individual country data along with other data files (such as alternate names) in text files and links to other directories. The mix of files makes it messy to work with.

knitr::include_graphics("https://github.com/poldham/places/blob/master/inst/images/genames_part1.png")
knitr::include_graphics("https://github.com/poldham/places/blob/master/inst/images/geonames_part2.png")

The places_table() function downloads and parses the page into a tibble to make it easier to work with.

library(places)
places_table()

The original table contained the following columns:

To make the table easier to work with places_table() separates out the files into iso (for country files) and other for other files. File_name, file_type and url for file paths are added.

Looking Up Countries

The geonames data mainly works on two letter country codes (variously called iso and iso2c). Country names can be expressed in all kinds of different ways. If you don't know the two letter country code you can look it up with places_lookup().

places::places_lookup("Kenya")

We can also look up ambiguous names:

places::places_lookup(c("Peoples republic of china", "Viet Nam", "Vietnam", "Lao", "Laos"))

Behind the scenes the name matching is handled by the countrycode package by Vincent Arul Bundock and collaborators. Country names can be expressed in all kinds of different ways and the countrycode package does a good job of recognising them. At present the places package only implements country names in English.

We use the countrycode package for lookup because of its flexibility. However, Geonames produces its own countryinfo table that can be imported as follows.

countryinfo <- places::places_countryinfo()
countryinfo

The advantage of this table is that it includes information such as the capital city, the area in square kilometres, the population of the capital city, continent and so on. Some of this data is included in the regions table (below). However, for general lookup of country names places_lookup() will be more flexible and forgiving.

Individual Country data

You can download the latest raw data for an individual country using places_download() and import it as a data frame with tidy column names using places_import(). A download_date column is added by default to assist with keeping track of the file history.

Download and import and presently handled in two steps to avoid assigning to the global environment. The pipe %>% is built into places.

KE <- places_download(code = "KE") %>% 
  places_import()
KE

places_download uses places_lookup internally meaning that you can simply enter a country name using the country = argument.

anguilla <- places_download(country = "anguilla") %>% 
  places_import()
anguilla

places_download can handle different cases and will fail fast on common errors.

places_download(code = "Kenya")

or

places_download(country = "KE")

If you try and download more than one file at a time things will quickly go wrong.

places_download(code = c("AI", "GB"))

places_download() is not vectorised and only handles one country file at a time. For multiple countries or regions it is easier to work with the allcountries table.

All Countries

Geonames produces a daily file containing the data for all countries. The allcountries daily file contains over 11 million place names in a +330MB compressed file that is 1.4Gb when uncompressed.

For many purposes you may be happy with a highly compressed archive of the allcountries file. The archive was created on the 1st of January 2018 as a .rda file and is a 257MB compressed .rda file that can be called with:

places_archive()
load("allcountries.rda")

This will take a few minutes to download and then to load... so maybe take a break for a cup of tea.

If you would like to retrieve the latest data file, use:

allcountries <- places_download(country = "allcountries") %>% places_import()

You may want to have another cup of tea while waiting for this.

Place names

The geonames tables contains three name fields. The asciiname is the most useful for text mining and you may want to take a look at the alternatenames field for known variants. The built in Kenya dataset (KE) can be useful for getting to grips with the data.

library(tidyverse)
places::KE %>% 
  separate_rows(alternatenames, sep = ",")

Feature Codes

Geonames uses feature codes to describe the georeferenced data. Feature codes are divided between classes and codes. The classes are as follows:

An example of a corresponding code is:

Note that the original featurecode table concatenated the class and feature code as in the examples above, but in the actual country and allcountries files they are separated into feature_class and feature_code. This makes joining awkward. To solve this a new field called feature_full is created at import while dropping the feature_class and feature code fields to prevent duplicates when joining. That is simpler than it sounds.

To view the featurecodes use:

featurecodes
load("data/KE.rda")
load("data/featurecodes.rda")

To join the featurecodes table onto a dataset you could use:

KE <- dplyr::left_join(KE, places::featurecodes, by = "feature_full")

KE %>% dplyr::select(feature_full, feature_name, name, longitude, latitude)

The featurecodes table, includes the feature code table divided into four columns that were parsed from the original file.

You may wish to join this table to the main table or to investigate the codes to use as filters. The feature code for a mountain is "MT" but to get all mountainous features you may need to look up other codes (e.g. MTS and MTU)

KE %>% filter(feature_code == "MT" | feature_code == "MTS")

Regions, Sub-Regions, intermediate regions and continents

To aid with multicountry analysis the package includes two add on tables.

  1. The United Nations regions (M49) from the United Nations Statistical Division
  2. World Bank regional divisions (through the World Development Indicators (WDI) package).

You can call these table directly:

unregions

For the World Bank WDI indicators from the WDI package:

wdicountry

Additional regional information is available from the countrycode package and may be incorporated into places in future.

The two regional tables are combined with selected fields from the geonames countryinfo table.

The regions file can be called as follows:

regions

To join the regions table with a set of results try:

df <- dplyr::left_join(df, regions, by = "iso")


poldham/places documentation built on Aug. 19, 2019, 8:33 p.m.