Introducing conjecture"
In sift: Intelligently Peruse Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)

conjecture() is the black swan of the sift family. If you encounter the type of eccentric datasets conjecture is designed to tackle, you'll be glad you read this vignette.

At its heart, conjecture is a reshaping operation similar to tidyr::pivot_wider(). However, the intended application for conjecture is more idiosyncratic than that of pivot_wider. This vignette illustrates the basic aspects of such an application.

Example 1: Radio Transmissions

The comms dataset contains a time-series of radio transmissions.

library(sift)
library(dplyr)
library(tidyr)

comms

A few notes:

We are not interested in the interaction between the stations. Each station (A, B, C, D) can be regarded as 4 independent time-series.
The subject of each transmission is denoted by msg_code.
Each msg_code can be repeated multiple times (see below).
There is no guarantee that a sent transmission will be met with a response sharing the same msg_code.

comms %>% 
  filter(station == "C",
         msg_code == 3060)

Suppose we wish to restructure comms so that the "natural" pairing of send + receive transmissions is more apparent. Since there is no explicit information linking these rows together, we "conjecture" that, for a given send transmission (anterior), the corresponding receive transmission (posterior) is the closest observation measured by timestamp.

conjecture() always takes 4 arguments.

Dataset to reshape (comms).
Column, as a symbol, used to measure distance between observations (timestamp).
Column, as a symbol, that demarks observations as anterior or posterior (type).
Scalar quantity signifying anterior observation ("send").

comms_conjecture <- conjecture(comms,     # dataset to reshape.
                               timestamp, # <dttm> friendly. must be coercible to numeric.
                               type,      # any type of atomic vector is fine.
                               "send")    # we could flip our logic and supply "receive" instead.

comms_conjecture

We can partially achieve the same result with pivot_wider.

comms_pivot <- comms %>% 
  pivot_wider(names_from = type,
              values_from = timestamp,
              values_fn = first) %>% 
  filter(receive > send)

comms_pivot

Notice that pivot_wider produces r nrow(comms_pivot) rows compared to r nrow(comms_conjecture) in comms_conjecture. What pairs are found in comms_conjecture that aren't captured in comms_pivot?

First, there a quite a few transmissions that do not elicit a response. conjecture doesn't sweep these under the rug.

comms_pivot %>% 
  filter(is.na(receive))

comms_conjecture %>% 
  filter(is.na(receive))

Second, our call to pivot_wider only returned the "first viable pairs" within each combination of station + msg_code. On the other hand, comms_conjecture contains 3 (4 including missing value) viable pairs for the below combination.

comms_pivot %>% 
  filter(station == "A",
         msg_code == 221)

comms_conjecture %>% 
  filter(station == "A",
         msg_code == 221)

The inclusion of multiple pairs for a given station + msg_code combination is the touchstone of conjecture.

Underlying Logic

We'll use a small fragment from comms to illustrate how conjecture works.

comms_small <- comms %>% 
  filter(station == "A",
         msg_code == 221)

comms_small

We can readily identify the send/receive pairs from the above observations. But how does conjecture accomplish this programmatically?

timestamps (specified by sort_by = timestamps) are separated into two vectors (specified by names_from = type).

send <- comms_small %>% filter(type == "send") %>% pull(timestamp) %>% sort()
send

receive <- comms_small %>% filter(type == "receive") %>% pull(timestamp) %>% sort()
receive

Iterate through each element in send, with a nested loop for each element in receive. We can invert this hierarchy by setting names_first = "receive" instead.

output <- integer(length = length(send))

for (i in seq_along(send)) {
  output[i] <- NA_integer_

  for (j in seq_along(receive)) {
    if (is.na(receive[j])) {
      next
    } else if (receive[j] > send[i]) {
      output[i] <- j
      break
    } else {
      next
    }
  }
}

tibble(send, receive = receive[output])

Conceptually, the above process flow is an accurate depiction of conjecture - though the underlying structure is more robust:

conjecture reconciles the presence of additional columns, similar to conventional reshaping operations.
conjecture relies on C++ to execute the looping structure.

Duplicate Posterior Values

There is an important consequence associated with the above logic. We'll demonstrate by removing all but one of the receive elements from comms_small.

# from comms small
receive <- receive[3]

# rerun the algorithm
for (i in seq_along(send)) {
  output[i] <- NA_integer_

  for (j in seq_along(receive)) {
    if (is.na(receive[j])) {
      next
    } else if (receive[j] > send[i]) {
      output[i] <- j
      break
    } else {
      next
    }
  }
}

tibble(send, receive = receive[output])

Why does 1999-02-21 12:29:59 appear 3 times? Recall:

"for a given send transmission (anterior), the corresponding receive transmission (posterior) is the closest observation measured by timestamp."

The above result is in accordance with this statement. However, at some point in the future, I may add the ability to drop repeat occurrences of posterior timestamps, which would produce the following result instead.

tibble(send, receive = receive[c(1, NA, NA, NA)])

Example 2: Toll Lane Records

The express dataset contains toll records for northbound and southbound vehicles over the course of one business day.

library(readr)
library(mopac)

mopac::express

Suppose we are interested in vehicles using the express lane both North and South (i.e. commuting to work). It's up to us to designate an anterior direction. If we are only interested in vehicles commuting downtown, we set names_first = "South".

conjecture(express, time, direction, "South") %>% 
  drop_na() # We can't assume incomplete pairs are commuting to downtown

library(ggplot2)

conjecture(express, time, direction, "South") %>% 
  drop_na() %>% 
  mutate(trip_length = difftime(North, South, units = "hours")) %>% 
  ggplot(aes(trip_length)) +
  geom_histogram()

library(ggplot2)

conjecture(express, time, direction, "South") %>% 
  drop_na() %>% 
  mutate(trip_length = difftime(North, South, units = "hours")) %>% 
  ggplot(aes(trip_length)) +
  geom_histogram() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  labs(title = "Trip length distribution",
       subtitle = "Vehicles commuting downtown",
       x = "Round trip length [hours]",
       y = NULL)