knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE, message=FALSE, warning=FALSE )
library(glitter) library(WikidataR) library(dplyr) library(tidyr) library(sf) library(leaflet)
This first vignette shows how to use glitter
to extract data from the Wikidata SPARQL endpoint. We imagine here a case study in which one is interested in the Wikidata items available regarding the Lyon metro network.
To find the identifiers of items and properties of interest for a particular case study, you can:
WikidataR
(functions WikidataR::find_item()
, WikidataR::find_property()
). Here, we will explore that second optionLet's try and find the Wikidata identifier for the Lyon metro network:
WikidataR::find_item("Metro Lyon")
So you'd be interested, for instance, in all the subway stations that are part of this network. Let's try and find the property identifier that corresponds to this notion:
WikidataR::find_property("part of")
So you're looking for all the stations that are part of ("wdt:P16") the Lyon metro network ("wd:Q1552").
The glitter
functions might now be used to start exploring data.
We're looking for items (the "unknown" in our query below, hence the use of a "?") which are part of the Lyon metro network:
stations = spq_init() %>% spq_add("?items wdt:P16 wd:Q1552") %>% spq_perform() head(stations)
To also get the labels for stations, we can use spq_label()
:
stations = spq_init() %>% spq_add("?items wdt:P16 wd:Q1552") %>% spq_label(items) %>% spq_perform() head(stations)
The query above, with spq_label(items)
, will return a table comprising both items
(with the Wikidata identifiers) and items_label
(with the human-readable label corresponding to these items).
If the Wikidata unique identifier is not particularly useful, one can use the argument .overwrite = TRUE
so that only labels will be returned, under the shorter name items
:
stations=spq_init() %>% spq_add("?items wdt:P16 wd:Q1552") %>% spq_label(items, .overwrite = TRUE) %>% spq_perform() head(stations)
As it turns out, for now we get r nrow(stations)
items, which actually correspond not only to stations but also to other types of items such as metro lines.
Let's have a look at the item "Place Guichard - Bourse du Travail" ("wd:Q599865") which we know correspond to a station.
We can do that e.g. through the Wikidata url associated to this item{target="_blank"}.
Hence, the property called "wdt:P31" ("is an instance of") should enable us to collect specifically stations ("wd:Q928830") instead of any part of the Lyon metro network.
stations = spq_init() %>% spq_add("?station wdt:P16 wd:Q1552") %>% spq_add("?station wdt:P31 wd:Q928830") %>% # added instruction spq_label(station, .overwrite = TRUE) %>% spq_perform() dim(stations) head(stations)
If we want to get the geographical coordinate of these stations (property "wdt:P625") we can proceed this way:
stations_coords = spq_init() %>% spq_add("?station wdt:P16 wd:Q1552") %>% spq_add("?station wdt:P31 wd:Q928830") %>% spq_add("?station wdt:P625 ?coords") %>% # added instruction spq_label(station, .overwrite = TRUE) %>% spq_perform() dim(stations_coords) head(stations_coords)
This tibble can be transformed into a Simple feature collection (sfc) object using package sf
:
stations_sf = st_as_sf(stations_coords, wkt = "coords") head(stations_sf)
The resulting object may then be used easily with (for instance) package leaflet
:
leaflet(stations_sf) %>% addTiles() %>% addCircles(popup = ~station)
Now, we would like not only to view the stations but also the connecting lines. One property is of particular interest in this prospect: P197, which indicates which other stations one station is connected to. To form connecting lines, this information about the connection to other stations need to be complemented by the involved line and direction of that connection. Hence, we are not only interested in the values of the property P197, but also in the property qualifiers corresponding to the connecting line (P81) and direction (P5051)
We can thus complete our query this way:
stations_adjacency=spq_init() %>% spq_add("?station wdt:P16 wd:Q1552") %>% spq_add("?station wdt:P31 wd:Q928830") %>% spq_add("?station wdt:P625 ?coords") %>% spq_add("?station p:P197 ?statement") %>% # added instruction spq_add("?statement ps:P197 ?adjacent") %>% # added instruction spq_add("?statement pq:P81 ?line") %>% # added instruction spq_add("?statement pq:P5051 ?direction") %>% # added instruction spq_label("station", "adjacent", "line", "direction",.overwrite = TRUE) %>% spq_select(-statement) %>% spq_perform() %>% na.omit() %>% select(coords,station,adjacent,line,direction) head(stations_adjacency)
Now, we would like to put the stations in the right order so that we will be able to form the connecting lines.
This data-wrangling part is a bit tricky though not directly due to any glitter-related operation.
We define a function form_line()
which will put the rows in the table of stations in the correct order.
form_line = function(adjacencies, direction) { N = nrow(adjacencies) num = rep(NA,N) ind = which(adjacencies$adjacent == direction) i = N num[ind] = i while (i>1) { indnew = which(adjacencies$adjacent == adjacencies$station[ind]) ind = indnew i = i-1 num[ind] = i } adjacencies = adjacencies %>% mutate(num = num) %>% arrange(num) adjacencies = c(adjacencies$station, direction) return(adjacencies) }
Now let's apply this function to all lines and directions possible. Making full use of the tidyverse, we can use iteratively this function while not dropping the table-like structure of our data using a combination of tidyr::nest() and purrr::map().
stations_lines = stations_adjacency %>% sf::st_drop_geometry() %>% # make this a regular tibble, not sf group_by(direction,line) %>% na.omit() %>% tidyr::nest(.key = "adj") %>% # have nested "adj" table for each direction-line mutate(station = purrr::map(.x = adj, .y = direction, ~form_line(.x,.y))) %>% tidyr::unnest(cols = "station") %>% ungroup()
We use left_join() to complete the table ordering the stations into lines with the coordinates of stations:
stations_lines=stations_lines %>% left_join(unique(stations_coords), # get corresponding coordinates by=c("station")) %>% na.omit() head(stations_lines)
stations_lines
is now an sf points object which is properly formatted to be transformed into an sf lines object (the stations are in the right order for each line-direction, and the associated coordinates are provided in the table):
stations_lines_sf=stations_lines %>% sf::st_as_sf(wkt="coords") %>% group_by(direction,line) %>% summarise(do_union = FALSE) %>% # for each group, and keeping order of points, sf::st_cast("LINESTRING") # form a linestring geometry stations_lines_sf
We can now use this new object to display the Lyon metro lines on a leaflet map:
factpal <- colorFactor(topo.colors(8), unique(stations_lines$line)) leaflet(data=stations_sf) %>% addTiles() %>% addCircles(popup=~station) %>% addPolylines(data=stations_lines_sf, color=~factpal(line), popup=~line)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.