knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE )
conjecture()
is the black swan of the sift
family. If you encounter the type of eccentric datasets conjecture is designed to tackle, you'll be glad you read this vignette.
At its heart, conjecture is a reshaping operation similar to tidyr::pivot_wider()
. However, the intended application for conjecture is more idiosyncratic than that of pivot_wider. This vignette illustrates the basic aspects of such an application.
The comms
dataset contains a time-series of radio transmissions.
library(sift) library(dplyr) library(tidyr) comms
A few notes:
msg_code
.msg_code
can be repeated multiple times (see below).msg_code
.comms %>% filter(station == "C", msg_code == 3060)
Suppose we wish to restructure comms
so that the "natural" pairing of send
+ receive
transmissions is more apparent. Since there is no explicit information linking these rows together, we "conjecture" that, for a given send
transmission (anterior), the corresponding receive
transmission (posterior) is the closest observation measured by timestamp
.
conjecture()
always takes 4 arguments.
comms
).timestamp
).type
)."send"
).comms_conjecture <- conjecture(comms, # dataset to reshape. timestamp, # <dttm> friendly. must be coercible to numeric. type, # any type of atomic vector is fine. "send") # we could flip our logic and supply "receive" instead. comms_conjecture
We can partially achieve the same result with pivot_wider.
comms_pivot <- comms %>% pivot_wider(names_from = type, values_from = timestamp, values_fn = first) %>% filter(receive > send) comms_pivot
Notice that pivot_wider produces r nrow(comms_pivot)
rows compared to r nrow(comms_conjecture)
in comms_conjecture
. What pairs are found in comms_conjecture
that aren't captured in comms_pivot
?
First, there a quite a few transmissions that do not elicit a response. conjecture doesn't sweep these under the rug.
comms_pivot %>% filter(is.na(receive)) comms_conjecture %>% filter(is.na(receive))
Second, our call to pivot_wider only returned the "first viable pairs" within each combination of station
+ msg_code
. On the other hand, comms_conjecture
contains 3 (4 including missing value) viable pairs for the below combination.
comms_pivot %>% filter(station == "A", msg_code == 221) comms_conjecture %>% filter(station == "A", msg_code == 221)
The inclusion of multiple pairs for a given station
+ msg_code
combination is the touchstone of conjecture.
We'll use a small fragment from comms
to illustrate how conjecture works.
comms_small <- comms %>% filter(station == "A", msg_code == 221) comms_small
We can readily identify the send/receive pairs from the above observations. But how does conjecture accomplish this programmatically?
sort_by = timestamps
) are separated into two vectors (specified by names_from = type
).send <- comms_small %>% filter(type == "send") %>% pull(timestamp) %>% sort() send receive <- comms_small %>% filter(type == "receive") %>% pull(timestamp) %>% sort() receive
send
, with a nested loop for each element in receive
. We can invert this hierarchy by setting names_first = "receive"
instead.output <- integer(length = length(send)) for (i in seq_along(send)) { output[i] <- NA_integer_ for (j in seq_along(receive)) { if (is.na(receive[j])) { next } else if (receive[j] > send[i]) { output[i] <- j break } else { next } } } tibble(send, receive = receive[output])
Conceptually, the above process flow is an accurate depiction of conjecture - though the underlying structure is more robust:
There is an important consequence associated with the above logic. We'll demonstrate by removing all but one of the receive
elements from comms_small
.
# from comms small receive <- receive[3] # rerun the algorithm for (i in seq_along(send)) { output[i] <- NA_integer_ for (j in seq_along(receive)) { if (is.na(receive[j])) { next } else if (receive[j] > send[i]) { output[i] <- j break } else { next } } } tibble(send, receive = receive[output])
Why does 1999-02-21 12:29:59
appear 3 times? Recall:
"for a given send
transmission (anterior), the corresponding receive
transmission (posterior) is the closest observation measured by timestamp
."
The above result is in accordance with this statement. However, at some point in the future, I may add the ability to drop repeat occurrences of posterior timestamps, which would produce the following result instead.
tibble(send, receive = receive[c(1, NA, NA, NA)])
The express
dataset contains toll records for northbound and southbound vehicles over the course of one business day.
library(readr) library(mopac) mopac::express
Suppose we are interested in vehicles using the express lane both North
and South
(i.e. commuting to work). It's up to us to designate an anterior direction
. If we are only interested in vehicles commuting downtown, we set names_first = "South"
.
conjecture(express, time, direction, "South") %>% drop_na() # We can't assume incomplete pairs are commuting to downtown
library(ggplot2) conjecture(express, time, direction, "South") %>% drop_na() %>% mutate(trip_length = difftime(North, South, units = "hours")) %>% ggplot(aes(trip_length)) + geom_histogram()
library(ggplot2) conjecture(express, time, direction, "South") %>% drop_na() %>% mutate(trip_length = difftime(North, South, units = "hours")) %>% ggplot(aes(trip_length)) + geom_histogram() + scale_y_continuous(expand = c(0, 0)) + theme_minimal() + theme(panel.grid.minor = element_blank(), plot.title.position = "plot") + labs(title = "Trip length distribution", subtitle = "Vehicles commuting downtown", x = "Round trip length [hours]", y = NULL)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.