In sccmckenzie/sift: Intelligently Peruse Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  warning = FALSE,
  dev = "svglite"
)

sift

sift facilitates intelligent & efficient exploration of datasets.

# install.packages("devtools")
devtools::install_github("sccmckenzie/sift")

sift is designed to work seamlessly with tidyverse.

library(tidyverse) # needed for below examples
library(sift)

1. `sift::sift()`

Imagine `dplyr::filter()` that includes neighboring observations.

Perhaps you remember the Utah monolith. The buzz surrounding its discovery (and disappearance) served as a welcome diversion from the otherwise upsetting twists and turns of 2020.

Suppose we are asked: what else was happening in the world around this time?

Let's peruse the nyt2020 dataset to refresh our memory.

nyt2020 %>% 
  filter(str_detect(headline, "Monolith")) %>% 
  glimpse()

The monolith story broke on 2020-11-24. Prior to writing this documentation, I certainly would not have remembered this happening in November specifically.

Let's take a peek at other headlines from ±2 days.

nyt2020 %>% 
  filter(pub_date > "2020-11-22",
         pub_date < "2020-11-26") %>% 
  select(headline, pub_date)

Notice that it took two steps to achieve the above result. We first had to find the date of the monolith story then perform a subsequent call to filter(). This procedure would quickly become a nuisance after a few iterations.

sift() provides an interface to perform this exact process in one step.

nyt2020 %>% 
  sift(pub_date, scope = 2, str_detect(headline, "Monolith")) %>% 
  select(headline, pub_date)

Under the hood, sift() passes str_detect(headline, "Monolith") to dplyr::filter(), then augments the filtered observations to include any rows falling in ±2 day window (specified by pub_date and scope = 2).

2. `sift::break_join()`

Harness combined power of `dplyr::left_join()` & `findInterval()`.

Take a look at the structure of us_uk_pop and us_uk_leaders below. How would you join these two datasets together? Specifically, we want each row in us_uk_pop to contain information (name, party) for the leader at that time.

us_uk_pop %>% 
  group_by(country) %>% 
  slice_head(n = 3)

us_uk_leaders

If you look closely at the dates in us_uk_pop, they typically fall around January 20th (US inauguration day). Joining by country & year(date/start) would sweep this inconvenient detail under the rug.

For one country alone, we could use findInterval.

us_uk_pop %>% 
  filter(country == "USA") %>% 
  mutate(name = filter(us_uk_leaders, country == "USA")$name[findInterval(date, filter(us_uk_leaders, country == "USA")$start)])

The above code is somewhat unintelligible. Additionally, there is no straightforward way to accommodate UK & USA rows.

break_join() provides a simple interface leveraging functionality of dplyr::left_join() and findInterval().

break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"))

Notice that country was detected as a common variable (courtesy of dplyr::left_join()).

Alternatively, we could have supplied by explicitly.

# effectively the same as above call
break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"), by = "country")

Additional arguments supplied to ... will be automatically directed to dplyr::left_join() and findInterval().

set.seed(1)
a <- tibble(x = 1:5, y = runif(5, 4, 6))
b <- tibble(y = c(4, 5), z = c("A", "B"))

break_join(a, b, brk = "y")

break_join(a, b, brk = "y", all.inside = TRUE)

3. `sift::kluster()`

Imagine 1D K-means, except K is chosen automatically.

Consider the faithful dataset.

Density plot below clearly demonstrates there are 2 clusters of eruptions.

ggplot(faithful, aes(eruptions)) +
  geom_density() +
  geom_rug() +
  theme_minimal()

Currently, these clusters are implicit, meaning we do not have a categorical variable associating each observation with a cluster. We could manually assign clusters by drawing a line at, say, 3.0.

kluster() does this automatically - no extra inputs needed.

k <- kluster(faithful$eruptions)

faithful$k <- k

ggplot(faithful, aes(eruptions)) +
  geom_density() +
  geom_rug(aes(color = factor(k))) +
  theme_minimal() +
  scale_color_discrete(name = "k")