knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
fansi::set_knit_hooks(knitr::knit_hooks)
options(crayon.enabled = TRUE, width = 75, cli.width = 75)

knit_print.grViz <- function(x, ...) {
  x %>%
    DiagrammeRsvg::export_svg() %>%
    c("`````{=html}\n", ., "\n`````\n") %>%
    knitr::asis_output()
}

The goal of both the package dm and the class dm - which comes along with it - is to make your life easier when you have to deal with data models the data therein.

Let's first have a look at the new class:

Class dm

The dm class contains a set of tables as well as information about their primary keys and the relations between the tables. You can have your dm object on different data sources, such as a local environment or a database. In the latter use case you need to have the package dbplyr installed.

dm is a wrapper around two classes, which are independent from each other:

  1. part of the object is a src-object from dplyr
  2. the other part is a class data_model object from the package datamodelr^[Use: devtools::install_github("https://github.com/bergant/datamodelr") to install it]

The src-object contains the information, where the tables of your data model live. And in the data_model object the meta-information is stored:

Last but not least, a third part of the object is a named list containing the tables on the src.

So much to the theory, let's see how to construct such an object and how it looks in R:

Examples of dm objects {#ex_dm}

Relevant functions for creating dm objects are:

  1. dm()
  2. as_dm()

We can use the tables from the famous package {nycflights13}:

library(dm)
library(dplyr)
flights_dm <- dm_from_src(src_df(pkg = "nycflights13"))
flights_dm

Here we make use of the fact, that the function dm(src, data_model = NULL) includes all available tables on a source in the dm object, should the data_model-argument be left NULL. Thus you can use this for example on a postgres database that you access via src_postgres() (with the appropriate arguments dbname, host, port, ...), to produce a dm object with all the tables on the database.

Another way of creating a dm object is calling as_dm() on a list of tbls:

trees_dm <- as_dm(list("trees1" = trees, "trees2" = trees))
trees_dm

Methods for dm objects

tbl(dm, "table_name") etc. (copy_to...) FIXME: complete

But how can we use dm-functions to manage primary keys of the tables in a dm object?

Primary keys of dm objects {#pk}

Useful functions for managing primary key settings are:

  1. dm_add_pk()
  2. dm_has_pk()
  3. dm_get_pk()
  4. dm_rm_pk()
  5. dm_enum_pk_candidates()

If you created a dm object according to the examples in "Examples of dm objects", your object does not yet have any primary keys set. So let's add one. By default dm_add_pk() checks if the column of the table given by the user is a unique key. Since the trees dataset does not have any unique one-column keys, we will use the nycflights13 tables, i.e. flights_dm.

dm_has_pk(flights_dm, airports)
flights_dm_with_key <- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key

Still looks exactly the same as before... let's check:

dm_has_pk(flights_dm_with_key, airports)

Get the name of the column that is marked as primary key of the table:

dm_get_pk(flights_dm_with_key, airports)

Remove a primary key by:

dm_rm_pk(flights_dm_with_key, airports) %>%
  dm_has_pk(airports)

If you still need to get to know your data better, and it is already available in the form of a dm object, you can use the function dm_enum_pk_candidates() in order to get information, which columns of a table are unique keys:

dm_enum_pk_candidates(flights_dm_with_key, airports)

Table flights does not have any one-column primary key candidates:

dm_enum_pk_candidates(flights_dm_with_key, flights) %>% count(candidate)

Foreign keys

Useful functions for managing foreign key relations are:

  1. dm_add_fk()
  2. dm_has_fk()
  3. dm_get_fk()
  4. dm_rm_fk()
  5. dm_enum_fk_candidates()

Now it gets (even more) interesting: we want to define relations between different tables. With the function dm_add_fk() you can define, which column of which table points to another table's column.

When dealing with data models, a foreign key from one table has to always point to a primary key of another table. The primary key of the referred table must be set with dm_add_pk(). dm_add_fk() will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.

flights_dm_with_key %>% dm_add_fk(flights, origin, airports)

This will throw an error:

flights_dm %>% dm_add_fk(flights, origin, airports)

Let's create a dm object with a foreign key relation to work with later on:

flights_dm_with_fk <- dm_add_fk(flights_dm_with_key, flights, origin, airports)

What if we try to add another foreign key relation from flights to airports to the object? Column dest might work, since it also contains airport codes:

flights_dm_with_fk %>% dm_add_fk(flights, dest, airports)

As you can see, behind the scenes, checks are executed automatically by the functions of dm to prevent steps that do not comply with the logic of data models.

Use dm_has_fk() for checking if a foreign key exists, which is pointing from one table to another:

flights_dm_with_fk %>% dm_has_fk(flights, planes)
flights_dm_with_fk %>% dm_has_fk(flights, airports)

If you want to access the name of the column, which acts as a foreign key of one table to another table's column, use dm_get_fk():

flights_dm_with_fk %>% dm_get_fk(flights, planes)
flights_dm_with_fk %>% dm_get_fk(flights, airports)

Remove foreign key relations with dm_rm_fk() (parameter column = NULL means removal of all relations):

flights_dm_with_fk %>%
  dm_rm_fk(table = flights, column = dest, ref_table = airports) %>%
  dm_get_fk(flights, airports)
flights_dm_with_fk %>%
  dm_rm_fk(flights, origin, airports) %>%
  dm_get_fk(flights, airports)
flights_dm_with_fk %>%
  dm_rm_fk(flights, NULL, airports) %>%
  dm_get_fk(flights, airports)

Another function for getting to know your data better (cf. dm_enum_pk_candidates() in "Primary keys of dm objects") is dm_enum_fk_candidates(). Use it to get an overview over foreign key candidates pointing from one table to another:

dm_enum_fk_candidates(flights_dm_with_key, weather, airports)

Visualising the data model

Relevant functions for visualising the data model:^[The functions in this section rely heavily on related functions of {datamodelr}.]

  1. dm_set_colors()
  2. dm_get_colors()
  3. dm_get_available_colors()
  4. dm_draw()

Once you have all primary keys set and all foreign key relations defined, it might be interesting to look at a graphical representation of your data model.

Let's first create a slightly more interconnected dm object from the {nycflights13} package tables:

flights_dm_w_many_keys <-
  flights_dm %>%
  dm_add_pk(planes, tailnum) %>%
  dm_add_pk(airports, faa) %>%
  dm_add_pk(airlines, carrier) %>%
  dm_add_fk(flights, origin, airports) %>%
  dm_add_fk(flights, carrier, airlines) %>%
  dm_add_fk(flights, tailnum, planes, check = FALSE) %>%
  dm_add_fk(flights, dest, airports, check = FALSE)
flights_dm_w_many_keys

You probably realized the option check = FALSE in the last two function calls when adding foreign keys. The check normally ensures, that the set of values of the column of the first table is a subset of the set of values of the primary key column of the referenced table. Since in our case this isn't the case, but we want complexity, we turned the check off. It is recommended not to do this, since you might run into errors or other unwanted effects, mainly when working on databases.

You can use colors to visually group your tables into families (which could reflect their logical grouping). Currently, there are only eight different colors available (plus the option of having borders displayed around the tables or not). To find out which colors exist, use dm_get_available_colors():

dm_get_available_colors()

The assignment is carried out with dm_set_colors(), using a syntax similar to switch(). Any table you don't specifically assign will be displayed in the default color.

Use dm_set_colors() to assign the respective colors. The result of dm_set_colors() is a dm object. The information about the color is stored in its data_model-part.

flights_dm_w_many_keys_and_colors <-
  flights_dm_w_many_keys %>%
  dm_set_colors(
    flights = "blue",
    airlines = ,
    airports = "orange",
    planes = "green_nb"
  )

Draw the schema with dm_draw().

dm_draw(flights_dm_w_many_keys_and_colors)

The colors can be queried with dm_get_colors().

dm_get_colors(flights_dm_w_many_keys_and_colors)

There are quite a few options for customization available. Please read the documentation to find out more. One argument of the function we'd like to draw your attention to, though, is table_names. This argument accepts a character vector with the table names which are to be printed, in case you would like to only see a subset of the whole data model.

dm_draw(flights_dm_w_many_keys_and_colors, table_names = c("flights", "airports", "planes"))

Joining tables

Filtering a dm object



krlmlr/dm documentation built on April 19, 2024, 5:23 p.m.