dm allows you to create your own relational data models from local data frames. Once your data model is complete, you can deploy it to a range of DBMSs using dm.
The example data set that we will be using is available through the
nycflights13
package. The
five tables that we are working with contain information about all
flights that departed from the airports of New York to other
destinations in the United States in 2013:
flights
represents the trips taken by planesairlines
includesname
)carrier
)airports
indicates the ports of departure (origin
) and of
destination (dest
)weather
contains meteorological information at each hourplanes
describes characteristics of the aircraftOnce we’ve loaded {nycflights13}, the aforementioned tables are all in our work environment, ready to be accessed.
library(nycflights13)
airports
#> # A tibble: 1,458 x 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.2 1593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.6 1000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with 1,448 more rows
Your own data will probably not be available as an R package. Whatever
format it is in, you will need to be able to load it as data frames into
your R session. If the data is too large, consider using dm to connect
to the database instead. See vignette("howto-dm-db")
for details on
using dm with databases.
Our first step will be to tell dm
which tables we want to work with
and how they are connected. For that we can use dm()
, passing in the
table names as arguments.
library(dm)
flights_dm_no_keys <- dm(airlines, airports, flights, planes, weather)
flights_dm_no_keys
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
The as_dm()
function is an alternative that works if you already have
a list of tables.
dm objects behave like lists with a user- and console-friendly print format. In fact, using a dm as a nicer list for organizing your data frames in your environment is an easy first step towards using dm and its data modeling functionality.
Member referencing, by subscript and by name, and list and slice manipulation functions work just as you would expect on a dm object.
names(flights_dm_no_keys)
#> [1] "airlines" "airports" "flights" "planes" "weather"
flights_dm_no_keys$airports
#> # A tibble: 1,458 x 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.2 1593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.6 1000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with 1,448 more rows
flights_dm_no_keys[c("airports", "flights")]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airports`, `flights`
#> Columns: 27
#> Primary keys: 0
#> Foreign keys: 0
Even though we now have a dm
object that contains all our data, we
have not specified how our five tables are connected. To do this we need
to define primary keys and foreign keys on the tables.
Primary keys and foreign keys are how relational database tables are
linked with each other. A primary key is a column or column tuple that
has a unique value for each row within a table. A foreign key is a
column or column tuple containing the primary key for a row in another
table. Foreign keys act as cross references between tables. They specify
the relationships that gives us the relational database. For more
information on keys and a crash course on databases, see
vignette("howto-dm-theory")
.
dm
offers dm_enum_pk_candidates()
to identify viable primary keys,
and dm_add_pk()
to add them.
dm_enum_pk_candidates(
dm = flights_dm_no_keys,
table = planes
)
#> # A tibble: 9 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 tailnum TRUE ""
#> 2 year FALSE "has duplicate values: 2001 (284), 2000 (244), 200…
#> 3 type FALSE "has duplicate values: Fixed wing multi engine (32…
#> 4 manufactur… FALSE "has duplicate values: BOEING (1630), AIRBUS INDUS…
#> 5 model FALSE "has duplicate values: 737-7H4 (361), A320-232 (25…
#> 6 engines FALSE "has duplicate values: 2 (3288), 1 (27), 4 (4), 3 …
#> 7 seats FALSE "has duplicate values: 149 (452), 140 (411), 55 (3…
#> 8 speed FALSE "has 3299 missing values, and duplicate values: 43…
#> 9 engine FALSE "has duplicate values: Turbo-fan (2750), Turbo-jet…
Now, we add the primary keys that we have identified:
flights_dm_only_pks <-
flights_dm_no_keys %>%
dm_add_pk(table = airlines, columns = carrier) %>%
dm_add_pk(airports, faa) %>%
dm_add_pk(planes, tailnum)
flights_dm_only_pks
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 0
Note that we demonstrate both named and positional arguments above.
To define how our tables are related, we use dm_add_fk()
to add
foreign keys. In calling dm_add_fk()
the table
argument is the table
that needs a foreign key to link it to a second table. ref_table
is
the table to be linked to and it needs a primary key already defined for
it.
dm_enum_fk_candidates(
dm = flights_dm_only_pks,
table = flights,
ref_table = airlines
)
#> # A tibble: 19 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 carrier TRUE ""
#> 2 year FALSE "Can't combine `value1` <integer> and `value1` <…
#> 3 month FALSE "Can't combine `value1` <integer> and `value1` <…
#> 4 day FALSE "Can't combine `value1` <integer> and `value1` <…
#> 5 dep_time FALSE "Can't combine `value1` <integer> and `value1` <…
#> 6 sched_dep_t… FALSE "Can't combine `value1` <integer> and `value1` <…
#> 7 dep_delay FALSE "Can't combine `value1` <double> and `value1` <c…
#> 8 arr_time FALSE "Can't combine `value1` <integer> and `value1` <…
#> 9 sched_arr_t… FALSE "Can't combine `value1` <integer> and `value1` <…
#> 10 arr_delay FALSE "Can't combine `value1` <double> and `value1` <c…
#> 11 flight FALSE "Can't combine `value1` <integer> and `value1` <…
#> 12 tailnum FALSE "values of `flights$tailnum` not in `airlines$ca…
#> 13 origin FALSE "values of `flights$origin` not in `airlines$car…
#> 14 dest FALSE "values of `flights$dest` not in `airlines$carri…
#> 15 air_time FALSE "Can't combine `value1` <double> and `value1` <c…
#> 16 distance FALSE "Can't combine `value1` <double> and `value1` <c…
#> 17 hour FALSE "Can't combine `value1` <double> and `value1` <c…
#> 18 minute FALSE "Can't combine `value1` <double> and `value1` <c…
#> 19 time_hour FALSE "Can't combine `value1` <datetime<America/New_Yo…
Having chosen a column from the successful candidates provided by
dm_enum_fk_candidates()
, we use the dm_add_fk()
function to
establish the foreign key linking the tables. In the second call to
dm_add_fk()
we complete the process for the flights
and airlines
tables that we started above. The carrier
column in the airlines
table will be added as the foreign key in flights
.
flights_dm_all_keys <-
flights_dm_only_pks %>%
dm_add_fk(table = flights, columns = tailnum, ref_table = planes) %>%
dm_add_fk(flights, carrier, airlines) %>%
dm_add_fk(flights, origin, airports)
flights_dm_all_keys
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 3
Having created the required primary and foreign keys to link all the tables together, we now have a relational data model we can work with.
Visualizing a data model is a quick and easy way to verify that we have
successfully created the model we were aiming for. We can use
dm_draw()
at any stage of the process to generate a visual
representation of the tables and the links between them:
flights_dm_no_keys %>%
dm_draw(rankdir = "TB", view_type = "all")
flights_dm_no_keys %>%
dm_add_pk(airlines, carrier) %>%
dm_draw()
flights_dm_only_pks %>%
dm_add_fk(flights, tailnum, planes) %>%
dm_draw()
flights_dm_all_keys %>%
dm_draw()
As well as checking our data model visually, dm can examine the constraints we have created by the addition of keys and verify that they are sensible.
flights_dm_no_keys %>%
dm_examine_constraints()
#> ℹ No constraints defined.
flights_dm_only_pks %>%
dm_examine_constraints()
#> ℹ All constraints satisfied.
flights_dm_all_keys %>%
dm_examine_constraints()
#> ! Unsatisfied constraints:
#> ● Table `flights`: foreign key tailnum into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (575), N722MQ (513), N723MQ (507), N713MQ (483), N735MQ (396), …
The results are presented in a human-readable form, and available as a tibble for programmatic inspection.
Helper functions are available to access details on keys and check results.
dm_get_all_pks()
returns a data frame with our primary keys:
flights_dm_only_pks %>%
dm_get_all_pks()
#> # A tibble: 3 x 2
#> table pk_col
#> <chr> <keys>
#> 1 airlines carrier
#> 2 airports faa
#> 3 planes tailnum
A data frame of foreign keys is retrieved with dm_get_all_fks()
:
flights_dm_all_keys %>%
dm_get_all_pks()
#> # A tibble: 3 x 2
#> table pk_col
#> <chr> <keys>
#> 1 airlines carrier
#> 2 airports faa
#> 3 planes tailnum
We can use tibble::as_tibble()
on the result of
dm_examine_constraints()
to programmatically inspect which constraints
are not satisfied:
flights_dm_all_keys %>%
dm_examine_constraints() %>%
tibble::as_tibble()
#> # A tibble: 6 x 6
#> table kind columns ref_table is_key problem
#> <chr> <chr> <keys> <chr> <lgl> <chr>
#> 1 flights FK tailnum planes FALSE "values of `flights$tailnum` not …
#> 2 airlin… PK carrier <NA> TRUE ""
#> 3 airpor… PK faa <NA> TRUE ""
#> 4 planes PK tailnum <NA> TRUE ""
#> 5 flights FK carrier airlines TRUE ""
#> 6 flights FK origin airports TRUE ""
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.