knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", warning = FALSE, dev = "svglite" )
sift facilitates intelligent & efficient exploration of datasets.
# install.packages("devtools") devtools::install_github("sccmckenzie/sift")
sift is designed to work seamlessly with tidyverse.
library(tidyverse) # needed for below examples library(sift)
sift::sift()
dplyr::filter()
that includes neighboring observations.Perhaps you remember the Utah monolith. The buzz surrounding its discovery (and disappearance) served as a welcome diversion from the otherwise upsetting twists and turns of 2020.
Suppose we are asked: what else was happening in the world around this time?
Let's peruse the nyt2020
dataset to refresh our memory.
nyt2020 %>% filter(str_detect(headline, "Monolith")) %>% glimpse()
The monolith story broke on 2020-11-24. Prior to writing this documentation, I certainly would not have remembered this happening in November specifically.
Let's take a peek at other headlines from ±2 days.
nyt2020 %>% filter(pub_date > "2020-11-22", pub_date < "2020-11-26") %>% select(headline, pub_date)
Notice that it took two steps to achieve the above result. We first had to find the date of the monolith story then perform a subsequent call to filter()
. This procedure would quickly become a nuisance after a few iterations.
sift()
provides an interface to perform this exact process in one step.
nyt2020 %>% sift(pub_date, scope = 2, str_detect(headline, "Monolith")) %>% select(headline, pub_date)
Under the hood, sift()
passes str_detect(headline, "Monolith")
to dplyr::filter()
, then augments the filtered observations to include any rows falling in ±2 day window (specified by pub_date
and scope = 2
).
sift::break_join()
dplyr::left_join()
& findInterval()
.Take a look at the structure of us_uk_pop
and us_uk_leaders
below. How would you join these two datasets together? Specifically, we want each row in us_uk_pop
to contain information (name
, party
) for the leader at that time.
us_uk_pop %>% group_by(country) %>% slice_head(n = 3) us_uk_leaders
If you look closely at the dates in us_uk_pop
, they typically fall around January 20th (US inauguration day). Joining by country
& year(date/start)
would sweep this inconvenient detail under the rug.
For one country alone, we could use findInterval
.
us_uk_pop %>% filter(country == "USA") %>% mutate(name = filter(us_uk_leaders, country == "USA")$name[findInterval(date, filter(us_uk_leaders, country == "USA")$start)])
The above code is somewhat unintelligible. Additionally, there is no straightforward way to accommodate UK
& USA
rows.
break_join()
provides a simple interface leveraging functionality of dplyr::left_join()
and findInterval()
.
break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"))
Notice that country
was detected as a common variable (courtesy of dplyr::left_join()
).
Alternatively, we could have supplied by
explicitly.
# effectively the same as above call break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"), by = "country")
Additional arguments supplied to ...
will be automatically directed to dplyr::left_join()
and findInterval()
.
set.seed(1) a <- tibble(x = 1:5, y = runif(5, 4, 6)) b <- tibble(y = c(4, 5), z = c("A", "B")) break_join(a, b, brk = "y") break_join(a, b, brk = "y", all.inside = TRUE)
sift::kluster()
Consider the faithful
dataset.
Density plot below clearly demonstrates there are 2 clusters of eruptions.
ggplot(faithful, aes(eruptions)) + geom_density() + geom_rug() + theme_minimal()
Currently, these clusters are implicit, meaning we do not have a categorical variable associating each observation with a cluster. We could manually assign clusters by drawing a line at, say, 3.0.
kluster()
does this automatically - no extra inputs needed.
k <- kluster(faithful$eruptions)
faithful$k <- k ggplot(faithful, aes(eruptions)) + geom_density() + geom_rug(aes(color = factor(k))) + theme_minimal() + scale_color_discrete(name = "k")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.