title: 'Osmenrich: An R package to enrich geocoded data' tags: - R - geospatial data - goespatial - sf - openstreetmap - osmdata - osrm authors: - name: Leonardo Jaya Vida orcid: 0000-0002-0461-016X affiliation: "1, 2" - name: Jonathan de Bruin orcid: 0000-0002-4297-0502 affiliation: "1, 2" - name: Erik-Jan van Kesteren orcid: 0000-0003-1548-1663 affiliation: "1, 2" affiliations: - name: Utrecht University index: 1 - name: ODISSEI index: 2 date: 25 April 2021 bibliography: paper.bib
The osmenrich
package provides a user-friendly way to enrich geographic datasets in R
, for example observations on a map, with geographic features around those observations. Additionally, it can weigh these features by their (walking / driving / cycling) distance or duration from the observations. This package builds on existing infrastructure to interface with OpenStreetMap (osmdata
, Padgham et al., 2017), and works well with the existing ecosystem of packages for (geospatial) data analysis, namely sf
[@sf:2018] and the tidyverse
[@tidyverse:2019]. This package streamlines and standardizes the process going from a raw dataset with observations and locations to a tidy
[@tidy-data:2014], rich dataset with multiple relevant geographical features about those locations. \autoref{fig:workflow} shows graphically the basic workflow of osmenrich
.
The R package osmenrich
is available on GitHub.
Geographic data is valuable in research where the environment influences the process under investigation. For example, the osmenrich
package is useful in the analysis of data retrieved from citizen science projects such as plastic spotter or the great backyard bird count. At the same time, within the R ecosystem multiple software solutions exist for extracting data from geographic information systems [@osmdata:2017; @googleway:2020; @mapbox:2020]. However, to include these geographic data in further analysis (e.g. carrying out kriging in gstat
) [@gstat:2004] the data often need further processing and, crucially, aggregation. Within this problem space, osmenrich
contributes by:
Using our package, researchers can focus on investigating research questions, rather than spending time figuring out how to aggregate geographic data. The osmenrich
package is especially suited for questions surrounding interactions between a process and its close physical environment, such as gathering real-world feature objects within a determined distance from observations to improve a prediction process. Throughout this paper, objects with geocoded data are called "reference objects", while objects of interest are called "feature objects".
To enrich data, the osmenrich
package uses the main function enrich_osm()
. This function takes a dataset containing geocoded reference objects in sf
format, retrieves specified feature objects from a local or remote OpenStreetMap server (see Routing section), computes the enrichment using specified parameters and outputs an enriched tidy sf
dataset.
enrich_osm(
dataset = sf_dataset,
name = "waste",
key = "amenity", # Syntax borrowed from OpenStreetMap
value = "waste_basket", # Syntax borrowed from OpenStreetMap
r = 100
)
The code listing above shows an example of a basic enrichment of reference points with the number of waste baskets in the surrounding 100 meters. Specifically, the function uses the bounding box created by the reference objects from the input dataset and searches for the specified feature objects in OpenStreetMap with parameters key
and value
within the radius r
(in meters) around each of the reference objects. The key
and value
parameters are also used as tags in OpenStreetMap to describe physical features of map elements. The user is able to search for them using official OpenStreetMap documentation. Finally the enrich_osm
function creates a new column named after the parameter name
containing the enriched data. See Section Full usage example for an example usage of this function.
To convert the retrieved features to a single number per reference object, an aggregation step is performed by osmenrich
. In the enrich_osm
function, there are three parameters that control aggregation: kernel
, reduce_fun
, and measure
.
kernel
determines the weighing function applied on the distances (or durations) between the objects retrieved and the reference points. The osmenrich
package provides three different kernels to be used out-of-the-box (uniform
, gaussian
and parabola
) and defaults to kernel = uniform
. However, the package allows users to also specify custom-made kernels.
reduce_fun
, is used to reduce the weighted vectors of distances (or durations) into single numbers. This parameter defaults to reduce_fun = sum
, however it accepts any standard R
function, such as mean
or median
.Specifying these variables in the enrich_osm()
function allows the user to choose specific types of weights and aggregation to be applied on the feature objects retrieved from OpenStreetMap.
enrich_osm(
[...],
r = 100, # Radius for feature objects retrieval
kernel = "gaussian", # Weighing function
reduce_fun = "mean" # Aggregation function
)
To retrieve feature objects around the reference objects and the distances (or durations) between them, the osmenrich
package uses an OpenStreetMap server and, optionally, one or more instances of the Open Source Routing Machine (OSRM).
The package leverages publicly available servers to enable basic data enrichment without the need of setting up any local instance of these servers. However, for large data enrichment tasks and for tasks involving the computation of distances (or durations) between objects using specific profiles, the setup of one or more of these servers is required.
We created a GitHub repository hosting the instructions and the docker_compose.yml
files needed to set up these servers. To facilitate the routing of users to the right setup for their needs, we provide three use cases and their respective recommended setup.
Once the desired server is set up, the user can set the parameter measure
to specify which profile to use to compute distances (or durations) between the objects. Depending on the routing servers available, the osmenrich
package can retrieve metrics computed on three different types of profile (car
, bike
or foot
).
# If available, specify the address of local OSRM instance
# options(osrm.server = "http://localhost:<port>/")
# Specify the address of the public or local Overpass (OSM) instance
# osmdata::set_overpass_url("http://localhost:<port>/api/interpreter")
enrich_osm(
[...],
measure = "distance_by_foot" # Choose a specific metric that leverages the OSRM server
)
osmenrich
is available on GitHub and can be installed and loaded into the R session with the remotes
package.
library(osmenrich)
We have included an example dataset of common swift's nests provided openly by the city of Utrecht, the Netherlands. A researcher may want to investigate the effect of natural material availability on nesting behaviour. In this example, we use "trees" as a proxy for natural material availability.
head(common_swift)
# Simple feature collection with 6 features and 1 field
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 5.086195 ymin: 52.0879 xmax: 5.08662 ymax: 52.08866
# CRS: EPSG:4326
# # A tibble: 6 x 2
# nest_count geometry
# <dbl> <POINT [°]>
# 1 1 (5.086195 52.0884)
# 2 2 (5.086602 52.08866)
# 3 1 (5.086619 52.0884)
# 4 1 (5.086604 52.08837)
# 5 1 (5.08633 52.0879)
# 6 1 (5.08662 52.08826)
# Enrichment step
enriched_common_swift <-
common_swift %>%
enrich_osm(
name = "tree_1km",
key = "natural",
value = "tree",
kernel = "gaussian",
r = 1000
)
head(enriched_common_swift)
# Simple feature collection with 6 features and 2 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 5.086195 ymin: 52.0879 xmax: 5.08662 ymax: 52.08866
# CRS: EPSG:4326
# # A tibble: 6 x 3
# nest_count geometry tree_1km
# <dbl> <POINT [°]> <dbl>
# 1 1 (5.086195 52.0884) 0.00799
# 2 2 (5.086602 52.08866) 0.00692
# 3 1 (5.086619 52.0884) 0.00689
# 4 1 (5.086604 52.08837) 0.00693
# 5 1 (5.08633 52.0879) 0.00753
# 6 1 (5.08662 52.08826) 0.00688
As shown in the output of enriched_common_swift
, following the enrichment step the dataset gains an additional column named tree_1km
. This column contains the sum of the numbers of trees within 1km around each nest location, weighed by the gaussian
kernel.
This work was supported by the Open Data Infrastructure for Social Science and Economic Innovations (ODISSEI) in the Netherlands (https://ror.org/03m8v6t10).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.