5. Matching different data sources

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  warning = FALSE,
  message = FALSE,
  out.width = "100%"
)
library(cubble)
library(dplyr)
library(ggplot2)
library(patchwork)

One common task when working with spatio-temporal data is to match nearby sites. For example, we may want to verify the location of an old list of stations with current stations, or we may want to match the data from different data sources. In this vignette, we will introduce the spatial and temporal matching in cubble using an example on matching river level data with precipitation in Victoria, Australia.

In cubble, spatial and temporal matching are performed using the functions match_spatial() and match_temporal(). The match_spatial() function calculates the spatial distance between observations in two cubble objects. Various distance measures are available (check sf::st_distance). Analysts can specify the number of matched groups to output using the spatial_n_group argument (default to 4 groups) and the number of matches per group using the spatial_n_group argument (default to 1, one-to-one matching). The syntax to use match_spatial() is:

match_spatial(<cubble_obj1>, <cubble_obj2>, ...)

The function match_temporal() calculates the similarity between time series within spatially matched groups. Two identifiers are required: one for separating each matched group (match_id) and one for separating the two data sources (data_id). The argument temporal_by uses the by syntax from dplyr's *_join to specify the temporal variables to match.

The similarity score between two time series is calculated using a matching function, which can be customised by the analysts. The matching function takes two time series as a list and returns a single numerical score. This allows for flexibility in using existing time series distance calculation implementation. By default, cubble implements a simple peak matching algorithm (match_peak) that counts the number of peaks in two time series that fall within a specified temporal window. The syntax to use match_temporal() is

match_temporal(
  <cubble_obj_from_match_spatial>, 
  data_id = , match_id = , 
  temporal_by = c("..." = "...")
)

Spatial matching

Now let's consider an example of matching water data from river gauges with precipitation. The [water level data]((http://www.bom.gov.au/metadata/catalogue/19115/ANZCW0503900528?template=full), collected by the Bureau of Meteorology, can be compared with the precipitation since rainfall can directly impact water level in river. Here is the location of available weather stations and water gauges in Victoria, Australia:

river <- cubble::river %>%  mutate(type = "river") %>% rename(id = station)

climate_vic <- climate_aus %>%  
  # subset for Victoria stations
  filter(between(as.numeric(stringr::str_sub(id, 7, 8)), 76, 90)) %>%  
  mutate(type = "climate")

vic_map <- ozmaps::abs_ste %>%  filter(NAME == "Victoria")

ggplot() + 
  geom_sf(data = vic_map) + 
  geom_point(data = dplyr::bind_rows(river, climate_vic), 
             aes(x = long, y = lat, color = type)) + 
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = "Longitude", y = "Latitude") + 
  scale_color_brewer(palette = "Dark2")

Both climate_vic and river are cubble objects, and we can obtain a summary of the 10 closest pairs between them:

(res_sp <- match_spatial(climate_vic, river, spatial_n_group = 10))

The result can also be returned as cubble objects by setting the argument return_cubble = TRUE. The output is be a list where each element is a paired cubble object. To combine all the results into a single cubble, you can use bind_rows(). In the case when a site in the second cubble (the river data here) is matched to two stations in the first cubble (climate_vic here), the binding may not be successful since cubble requires unique rows in the nested form. In the summary table above, the river station 226027 is matched to more than one weather station: ASN00085072 (group 3) and ASN00085298 (group 5). Similarly, the river station 230200 is matched in group 7 and 8). In such cases, you can either deselect one pair before combining, or work with the list output with the purrr::map syntax:

res_sp <- match_spatial(climate_vic, river, spatial_n_group = 10, return_cubble = TRUE)
str(res_sp, max.level = 0)
res_sp[[1]]
(res_sp <- res_sp[-c(5, 8)] %>% bind_rows())

Temporal matching

For temporal matching, we match the variable Water_course_level from the river data to prcp in the weather station data. The variable group and types identify the matching group and the two datasets:

(res_tm <- res_sp %>% 
  match_temporal(
    data_id = type, match_id = group,
    temporal_by = c("prcp" = "Water_course_level")))

Similarly, the cubble output can be returned using the argument return_cubble = TRUE. Here we select the four pairs with the highest number of matching peaks:

res_tm <- res_sp %>% 
  match_temporal(
    data_id = type, match_id = group,
    temporal_by = c("prcp" = "Water_course_level"),
    return_cubble = TRUE)
(res_tm <- res_tm %>% bind_rows() %>% filter(group %in% c(1, 7, 6, 9)))

And then we can visualise them in space or across time:

res_tm_long <- res_tm %>%  
  face_temporal() %>%  
  unfold(group, type) %>%  
  group_by(group, type) %>%
  mutate(matched = (matched - min(matched, na.rm = TRUE))/ 
           (max(matched, na.rm = TRUE) - min(matched, na.rm = TRUE))) 

vic_map <- ozmaps::abs_ste |> 
  filter(NAME == "Victoria")

p1 <-ggplot() + 
  geom_sf(data = vic_map, fill = "grey95", color = "white") + 
  geom_point(data = dplyr::bind_rows(river, climate_vic), 
             aes(x = long, y = lat, color = type), 
             alpha = 0.2, fill = 0.2) +
  geom_point(data = res_tm %>% as_tibble(), 
             aes(x = long, y = lat, color = type)) +
  ggrepel::geom_label_repel(
    data = res_tm %>%  filter(type == "climate") %>% as_tibble(), 
    aes(x = long, y = lat, label = group)) +
  scale_color_brewer(palette = "Dark2")  + 
  theme_void() + 
  ggplot2::theme(legend.position = "bottom") +
  ggplot2::labs(x = "Longitude", y = "Latitude") 

p2 <- res_tm_long %>%  
  ggplot(aes(x = date, y = matched, group = type,color = type)) + 
  geom_line() + 
  facet_wrap(vars(group)) + 
  scale_color_brewer(palette = "Dark2", guide = "none") + 
  theme_bw() + 
  labs(x=  "date") + 
  scale_x_date(date_labels = "%b") + 
  labs(x = "Week", y = "Precipitation/ water level")

(p1 | p2) + 
  patchwork::plot_layout(guides = "collect") + 
  plot_annotation(tag_levels = 'a')&
  theme(legend.position = "bottom") 


Try the cubble package in your browser

Any scripts or data that you put into this service are public.

cubble documentation built on July 9, 2023, 6:19 p.m.