This vignette deals with situations where you want to transform tables
of your dm
and then update an existing table or add a new table to the
dm
. There are two straightforward approaches:
dm
(or
replace an existing table) and establish the key relations.dm
object by zooming to a table and
manipulating it while maintaining the key relations whenever
possible.The first approach is rather self-explanatory, so let us have a closer look at the second way.
dm
“Zooming” to a table of a dm
means:
dm
is kept, including the
originally zoomed tabledm_zoomed
is produced, presenting a view of the
table for transformationsselect()
,
mutate()
and other table manipulation functions{dm} provides methods for many of the {dplyr}-verbs for a dm_zoomed
which behave the way you are used to, affecting only the zoomed table
and leaving the rest of the dm
untouched. When you are finished with
transforming the table, there are three options to proceed:
dm_update_zoomed()
if you want to replace the originally
zoomed table with the new tabledm_insert_zoomed()
if you are creating a new table for your
dm
dm_discard_zoomed()
if you do not need the result and want to
discard itWhen employing one of the first two options, the resulting table in the
dm
will have all the primary and foreign keys available that could be
tracked from the originally zoomed table.
So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data.
Imagine you want to have a column in flights
, specifying if a flight
left before noon or after. Just like with {dplyr}, we can tackle this
with mutate()
. Let us do this step by step:
library(dm)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#>
#> matches
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
flights_dm <- dm_nycflights13()
flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4
flights_zoomed <-
flights_dm %>%
dm_zoom_to(flights)
# The print output for a `dm_zoomed` looks very much like that from a normal `tibble`.
flights_zoomed
#> # Zoomed table: flights
#> # A tibble: 11,227 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # … with 11,217 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
flights_zoomed_mutate <-
flights_zoomed %>%
mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm")) %>%
# in order to see our changes in the output we use `select()` for reordering the columns
select(year:dep_time, am_pm_dep, everything())
flights_zoomed_mutate
#> # Zoomed table: flights
#> # A tibble: 11,227 x 20
#> year month day dep_time am_pm_dep sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <chr> <int> <dbl> <int>
#> 1 2013 1 10 3 am 2359 4 426
#> 2 2013 1 10 16 am 2359 17 447
#> 3 2013 1 10 450 am 500 -10 634
#> 4 2013 1 10 520 am 525 -5 813
#> 5 2013 1 10 530 am 530 0 824
#> 6 2013 1 10 531 am 540 -9 832
#> 7 2013 1 10 535 am 540 -5 1015
#> 8 2013 1 10 546 am 600 -14 645
#> 9 2013 1 10 549 am 600 -11 652
#> 10 2013 1 10 550 am 600 -10 649
#> # … with 11,217 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
# To update the original `dm` with a new `flights` table we use `dm_update_zoomed()`:
updated_flights_dm <-
flights_zoomed_mutate %>%
dm_update_zoomed()
# The only difference in the `dm` print output is the increased number of columns
updated_flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 54
#> Primary keys: 4
#> Foreign keys: 4
# The schematic view of the data model remains unchanged
dm_draw(updated_flights_dm)
The same course of action could, for example, be employed to create a
surrogate key for a table. We can do this for the weather
table.
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:dm':
#>
#> extract
#> The following object is masked from 'package:testthat':
#>
#> matches
weather_zoomed <-
flights_dm %>%
dm_zoom_to(weather)
weather_zoomed
#> # Zoomed table: weather
#> # A tibble: 861 x 15
#> origin year month day hour temp dewp humid wind_dir wind_speed
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 EWR 2013 1 10 0 41 32 70.1 230 8.06
#> 2 EWR 2013 1 10 1 39.0 30.0 69.9 210 9.21
#> 3 EWR 2013 1 10 2 39.0 28.9 66.8 230 6.90
#> 4 EWR 2013 1 10 3 39.9 27.0 59.5 270 5.75
#> 5 EWR 2013 1 10 4 41 26.1 55.0 320 6.90
#> 6 EWR 2013 1 10 5 41 26.1 55.0 300 12.7
#> 7 EWR 2013 1 10 6 39.9 25.0 54.8 280 6.90
#> 8 EWR 2013 1 10 7 41 25.0 52.6 330 6.90
#> 9 EWR 2013 1 10 8 43.0 25.0 48.7 330 8.06
#> 10 EWR 2013 1 10 9 45.0 23 41.6 320 17.3
#> # … with 851 more rows, and 5 more variables: wind_gust <dbl>,
#> # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
# Maybe there is some hidden candidate for a primary key that we overlooked
enum_pk_candidates(weather_zoomed)
#> # A tibble: 15 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin FALSE has duplicate values: EWR (287), JFK (287), LGA (2…
#> 2 year FALSE has duplicate values: 2013 (861)
#> 3 month FALSE has duplicate values: 1 (72), 2 (72), 4 (72), 5 (7…
#> 4 day FALSE has duplicate values: 10 (861)
#> 5 hour FALSE has duplicate values: 0 (36), 1 (36), 3 (36), 4 (3…
#> 6 temp FALSE has duplicate values: 62.06 (24), 75.92 (23), 33.0…
#> 7 dewp FALSE has duplicate values: 32.00 (53), 53.96 (47), 53.0…
#> 8 humid FALSE has duplicate values: 72.33 (8), 89.86 (8), 70.08 …
#> 9 wind_dir FALSE has duplicate values: 180 (39), 200 (39), 320 (39)…
#> 10 wind_speed FALSE has duplicate values: 6.90468 (82), 8.05546 (79), …
#> 11 wind_gust FALSE has 718 missing values, and duplicate values: 23.0…
#> 12 precip FALSE has duplicate values: 0.00 (791), 0.01 (16), 0.02 …
#> 13 pressure FALSE has 92 missing values, and duplicate values: 1015.…
#> 14 visib FALSE has duplicate values: 10 (698), 9 (45), 8 (38), 7 …
#> 15 time_hour FALSE has duplicate values: 2013-01-10 00:00:00 (3), 201…
# Seems we have to construct a column with unique values
# This can be done by combining column `origin` with `time_hour`, if the latter
# is converted to a single time zone first; all within the `dm`:
weather_zoomed_mutate <-
weather_zoomed %>%
# first convert all times to the same time zone:
mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
# paste together as character the airport code and the time
unite("origin_slot_id", origin, time_hour_fmt) %>%
select(origin_slot_id, everything())
# check if we the result is as expected:
enum_pk_candidates(weather_zoomed_mutate) %>% filter(candidate)
#> # A tibble: 1 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin_slot_id TRUE ""
flights_upd_weather_dm <-
weather_zoomed_mutate %>%
dm_update_zoomed() %>%
dm_add_pk(weather, origin_slot_id)
flights_upd_weather_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 3
# creating the coveted FK relation between `flights` and `weather`
extended_flights_dm <-
flights_upd_weather_dm %>%
dm_zoom_to(flights) %>%
mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
# need to keep `origin` as FK to airports, so `remove = FALSE`
unite("origin_slot_id", origin, time_hour_fmt, remove = FALSE) %>%
dm_update_zoomed() %>%
dm_add_fk(flights, origin_slot_id, weather)
extended_flights_dm %>% dm_draw()
dm
If you look at the dm
created by dm_nycflights13(cycle = TRUE)
, you
see that two columns of flights
relate to one and the same table,
airports
. One column stands for the departure airport and the other
for the arrival airport.
dm_draw(dm_nycflights13(cycle = TRUE))
In such cases it can be beneficial, to “disentangle” the
dm
by
duplicating the referred table. One way to do this in the {dm}-framework
is as follows:
disentangled_flights_dm <-
dm_nycflights13(cycle = TRUE) %>%
# zooming and immediately inserting essentially creates a copy of the original table
dm_zoom_to(airports) %>%
# reinserting the `airports` table under the name `destination`
dm_insert_zoomed("destination") %>%
# renaming the originally zoomed table
dm_rename_tbl(origin = airports) %>%
# Key relations are also duplicated, so the wrong ones need to be removed
dm_rm_fk(flights, dest, origin) %>%
dm_rm_fk(flights, origin, destination)
dm_draw(disentangled_flights_dm)
In a future update we will provide a more convenient way to
“disentangle” dm
objects, so that the individual steps will be done
automatically.
dm
Here is an example for adding a summary of a table as a new table to a
dm
(FK-relations are taken care of automatically):
dm_with_summary <-
flights_dm %>%
dm_zoom_to(flights) %>%
count(origin, carrier) %>%
dm_insert_zoomed("dep_carrier_count")
dm_draw(dm_with_summary)
If you would like to join some or all of the columns of one table to
another, you can make use of one of the join
-methods for a
dm_zoomed
. In addition to the usual arguments for the {dplyr}-joins,
by supplying the select
argument you can specify which columns of the
RHS-table you want to be included in the join. For the syntax please see
the example below. The LHS-table of a join is always the zoomed table.
joined_flights_dm <-
flights_dm %>%
dm_zoom_to(flights) %>%
# let's first reduce the number of columns of flights
select(-dep_delay:-arr_delay, -air_time:-time_hour) %>%
# in the {dm}-method for the joins you can specify which columns you want to add to the zoomed table
left_join(planes, select = c(tailnum, plane_type = type)) %>%
dm_insert_zoomed("flights_plane_type")
# this is how the table looks now
joined_flights_dm$flights_plane_type
#> # A tibble: 11,227 x 11
#> year month day dep_time sched_dep_time carrier flight tailnum origin
#> <int> <int> <int> <int> <int> <chr> <int> <chr> <chr>
#> 1 2013 1 10 3 2359 B6 727 N571JB JFK
#> 2 2013 1 10 16 2359 B6 739 N564JB JFK
#> 3 2013 1 10 450 500 US 1117 N171US EWR
#> 4 2013 1 10 520 525 UA 1018 N35204 EWR
#> 5 2013 1 10 530 530 UA 404 N815UA LGA
#> 6 2013 1 10 531 540 AA 1141 N5EAAA JFK
#> 7 2013 1 10 535 540 B6 725 N784JB JFK
#> 8 2013 1 10 546 600 B6 380 N337JB EWR
#> 9 2013 1 10 549 600 EV 6055 N19554 LGA
#> 10 2013 1 10 550 600 US 2114 N740UW LGA
#> # … with 11,217 more rows, and 2 more variables: dest <chr>,
#> # plane_type <chr>
# also here, the FK-relations are transferred to the new table
dm_draw(joined_flights_dm)
At each point you can retrieve the zoomed table by calling pull_tbl()
on a dm_zoomed
. To use our last example once more:
flights_dm %>%
dm_zoom_to(flights) %>%
select(-dep_delay:-arr_delay, -air_time:-time_hour) %>%
left_join(planes, select = c(tailnum, plane_type = type)) %>%
pull_tbl()
#> # A tibble: 11,227 x 11
#> year month day dep_time sched_dep_time carrier flight tailnum origin
#> <int> <int> <int> <int> <int> <chr> <int> <chr> <chr>
#> 1 2013 1 10 3 2359 B6 727 N571JB JFK
#> 2 2013 1 10 16 2359 B6 739 N564JB JFK
#> 3 2013 1 10 450 500 US 1117 N171US EWR
#> 4 2013 1 10 520 525 UA 1018 N35204 EWR
#> 5 2013 1 10 530 530 UA 404 N815UA LGA
#> 6 2013 1 10 531 540 AA 1141 N5EAAA JFK
#> 7 2013 1 10 535 540 B6 725 N784JB JFK
#> 8 2013 1 10 546 600 B6 380 N337JB EWR
#> 9 2013 1 10 549 600 EV 6055 N19554 LGA
#> 10 2013 1 10 550 600 US 2114 N740UW LGA
#> # … with 11,217 more rows, and 2 more variables: dest <chr>,
#> # plane_type <chr>
Currently not all of the {dplyr}-verbs have their own method for a
dm_zoomed
, so be aware that in some cases it will still be
necessary to resort to extracting one or more tables from a dm
and
reinserting a transformed version of theirs into the dm
eventually. The supported functions are: group_by()
, ungroup()
,
summarise()
, mutate()
, transmute()
, filter()
, select()
,
rename()
, distinct()
, arrange()
, slice()
, left_join()
,
inner_join()
, full_join()
, right_join()
, semi_join()
and
anti_join()
.
The same is true for {tidyr}-functions. Methods are provided for:
unite()
and separate()
.
There might be situations when you would like the key relations to
remain intact, but they are dropped nevertheless. This is because a
rigid logic is implemented, that does drop a key when its associated
column is acted upon with e.g. a mutate()
call. In these cases the
key relations will need to be established once more after finishing
with the manipulations.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.