knitr::opts_chunk$set(echo = TRUE) library(dplyr) library(tidyr) library(dtrackr)
Joining dtrackr
tracked data is supported and allows us to combine linked data
sets. In this toy example the data sets are characters from a popular film from
my youth.
# here we create a set of linked data from the starwars data # in a real example these data sets would have come from different places people = starwars %>% select(-films, -vehicles, -starships) vehicles = starwars %>% select(name,vehicles) %>% unnest(cols = c(vehicles)) starships = starwars %>% select(name,starships) %>% unnest(cols = c(starships)) films = starwars %>% select(name,films) %>% unnest(cols = c(films)) # these 4 data frames are linked together by the name attribute # we track both input data sets: tmp1 = people %>% track() %>% comment("People df {.total}") tmp2 = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # and here we (re)join the two data sets: tmp1 %>% inner_join(tmp2, by="name") %>% comment("joined {.total}") %>% flowchart() # The join message used by inner_join here is configurable but defaults to # {.count.lhs} on LHS # {.count.rhs} on RHS # {.count.out} in linked set
All dplyr
join types are supported by dtrackr
which allows us to report on the
numbers on either side of the join and on the resulting total. This can help
detect if any data items are lost during the join. However we do not yet capture
data that becomes excluded during joins, as the interpretation depends on the
type of join employed.
Another type of binary operator is a union. This is a simpler problem and works
as expected. In this example the early part of the pipeline is detected to be
the same on both branches of the data flow. This therefore results in a flow
that splits then subsequently joins again during the union (bind_rows
)
operator.
tmp = people %>% comment("start") tmp1 = tmp %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) tmp2 = tmp %>% include_any( species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans") tmp3 = bind_rows(tmp1,tmp2) %>% comment("{.count} human,droids and gungans") tmp3 %>% flowchart()
Other dplyr
set operations are supported such as setdiff()
, union()
,
union_all()
and intersect()
which are included in the function
documentation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.