The goal of the {dm} package and the dm
class that comes with it, is
to make your life easier when you are dealing with data from several
different tables.
Let’s take a look at the dm
class.
dm
The dm
class consists of a collection of tables and metadata about the
tables, such as
All tables in a dm
must be obtained from the same data source; csv
files and spreadsheets would need to be imported to data frames in R.
dm
objectsThere are currently three options available for creating a dm
object.
The relevant functions for creating dm
objects are:
dm()
as_dm()
new_dm()
dm_from_con()
To illustrate these options, we will now create the same dm
in several
different ways. We can use the tables from the well-known {nycflights13}
package.
Create a dm
object directly by providing data frames to dm()
:
library(nycflights13)
library(dm)
dm(airlines, airports, flights, planes, weather)
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
dm
Start with an empty dm
object that has been created with dm()
or
new_dm()
, and add tables to that object:
library(nycflights13)
library(dm)
empty_dm <- dm()
empty_dm
#> dm()
dm(empty_dm, airlines, airports, flights, planes, weather)
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
Turn a named list of tables into a dm
with as_dm()
:
as_dm(list(airlines = airlines,
airports = airports,
flights = flights,
planes = planes,
weather = weather))
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
src
into a dm
Squeeze all (or a subset of) tables belonging to a src
object into a
dm
using dm_from_con()
:
sqlite_con <- dbplyr::remote_con(dbplyr::nycflights13_sqlite())
flights_dm <- dm_from_con(sqlite_con)
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.35.5 [/tmp/RtmpGalley/nycflights13.sqlite]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `sqlite_stat1`, … (7 total)
#> Columns: 62
#> Primary keys: 0
#> Foreign keys: 0
The function dm_from_con(con, table_names = NULL)
includes all
available tables on a source in the dm
object. This means that you can
use this, for example, on a Postgres database that you access via
DBI::dbConnect(RPostgres::Postgres())
(with the appropriate arguments dbname
, host
,
port
, …), to produce a dm
object with all the tables on the
database.
Another way of creating a dm
object is calling new_dm()
on a list of
tbl
objects:
base_dm <- new_dm(list(trees = trees, mtcars = mtcars))
base_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `trees`, `mtcars`
#> Columns: 14
#> Primary keys: 0
#> Foreign keys: 0
This constructor is optimized for speed and does not perform integrity
checks. Use with caution, validate using dm_validate()
if necessary.
dm_validate(base_dm)
We can get the list of tables with dm_get_tables()
and the src
object with dm_get_src()
.
In order to pull a specific table from a dm
, use:
tbl(flights_dm, "airports")
#> # Source: table<`airports`> [?? x 8]
#> # Database: sqlite 3.35.5 [/tmp/RtmpGalley/nycflights13.sqlite]
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.2 1593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.6 1000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with more rows
But how can we use {dm}-functions to manage the primary keys of the
tables in a dm
object?
dm
objectsSome useful functions for managing primary key settings are:
dm_add_pk()
dm_has_pk()
dm_get_pk()
dm_rm_pk()
dm_enum_pk_candidates()
dm_get_all_pks()
If you created a dm
object according to the examples in “Examples of
dm
objects”, your object does not yet have any primary keys
set. So let’s add one.
dm_add_pk()
has an option to check if the column of the table given by
the user is a unique key; for performance reasons, the check will not be
executed unless requested. We use the nycflights13
tables,
i.e. flights_dm
from above.
dm_has_pk(flights_dm, airports)
#> [1] FALSE
flights_dm_with_key <- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.35.5 [/tmp/RtmpGalley/nycflights13.sqlite]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `sqlite_stat1`, … (7 total)
#> Columns: 62
#> Primary keys: 1
#> Foreign keys: 0
The dm
now has a primary key. Let’s check:
dm_has_pk(flights_dm_with_key, airports)
#> [1] TRUE
Get the name of the column that is marked as primary key of the table:
dm_get_pk(flights_dm_with_key, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "faa"
Remove a primary key:
dm_rm_pk(flights_dm_with_key, airports) %>%
dm_has_pk(airports)
#> [1] FALSE
If you still need to get to know your data better, and it is already
available in the form of a dm
object, you can use the
dm_enum_pk_candidates()
function in order to get information about
which columns of the table are unique keys:
dm_enum_pk_candidates(flights_dm_with_key, airports)
#> # A tibble: 8 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 faa TRUE ""
#> 2 lon TRUE ""
#> 3 name FALSE "has duplicate values: Municipal Airport (5), All Airp…
#> 4 lat FALSE "has duplicate values: 38.88944 (2), 40.63975 (2)"
#> 5 alt FALSE "has duplicate values: 0 (51), 13 (13), 14 (12), 15 (1…
#> 6 tz FALSE "has duplicate values: -5 (521), -6 (342), -9 (240), -…
#> 7 dst FALSE "has duplicate values: A (1388), U (47), N (23)"
#> 8 tzone FALSE "has duplicate values: America/New_York (519), America…
The flights
table does not have any one-column primary key candidates:
dm_enum_pk_candidates(flights_dm_with_key, flights) %>% dplyr::count(candidate)
#> # A tibble: 1 x 2
#> candidate n
#> <lgl> <int>
#> 1 FALSE 19
To get an overview over all tables with primary keys, use
dm_get_all_pks()
:
dm_get_all_pks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 4 x 2
#> table pk_col
#> <chr> <keys>
#> 1 airlines carrier
#> 2 airports faa
#> 3 planes tailnum
#> 4 weather origin, time_hour
Here we used the prepared dm
object dm_nycflights13(cycle = TRUE)
as
an example. This object already has all keys pre-set.
Useful functions for managing foreign key relations include:
dm_add_fk()
dm_has_fk()
dm_get_fk()
dm_rm_fk()
dm_enum_fk_candidates()
dm_get_all_fks()
Now it gets (even more) interesting: we want to define relations between
different tables. With the dm_add_fk()
function you can define which
column of which table points to another table’s column.
This is done by choosing a foreign key from one table that will point to
a primary key of another table. The primary key of the referred table
must be set with dm_add_pk()
. dm_add_fk()
will find the primary key
column of the referenced table by itself and make the indicated column
of the child table point to it.
flights_dm_with_key %>% dm_add_fk(flights, origin, airports)
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.35.5 [/tmp/RtmpGalley/nycflights13.sqlite]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `sqlite_stat1`, … (7 total)
#> Columns: 62
#> Primary keys: 1
#> Foreign keys: 1
This will throw an error:
flights_dm %>% dm_add_fk(flights, origin, airports)
#> Error: ref_table `airports` needs a primary key first. Use `dm_enum_pk_candidates()` to find appropriate columns and `dm_add_pk()` to define a primary key.
Let’s create a dm
object with a foreign key relation to work with
later on:
flights_dm_with_fk <- dm_add_fk(flights_dm_with_key, flights, origin, airports)
What if we tried to add another foreign key relation from flights
to
airports
to the object? Column dest
might work, since it also
contains airport codes:
flights_dm_with_fk %>% dm_add_fk(flights, dest, airports, check = TRUE)
#> Error: Column `dest` of table `flights` contains values (see examples above) that are not present in column `faa` of table `airports`.
As you can see, behind the scenes, checks are executed automatically
(unless check = FALSE
) by the functions of dm
to prevent steps that
would result in inconsistent representations.
Use dm_has_fk()
for checking if a foreign key exists that is pointing
from one table to another:
flights_dm_with_fk %>% dm_has_fk(flights, planes)
#> [1] FALSE
flights_dm_with_fk %>% dm_has_fk(flights, airports)
#> [1] TRUE
If you want to access the name of the column which acts as a foreign key
of one table to another table’s column, use dm_get_fk()
:
flights_dm_with_fk %>% dm_get_fk(flights, planes)
#> <list_of<character>[0]>
flights_dm_with_fk %>% dm_get_fk(flights, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "origin"
Remove foreign key relations with dm_rm_fk()
(parameter
column = NULL
means that all relations will be removed):
flights_dm_with_fk %>%
dm_rm_fk(table = flights, column = dest, ref_table = airports) %>%
dm_get_fk(flights, airports)
#> Error: (`dest`) is not a foreign key of table `flights` into table `airports`.
flights_dm_with_fk %>%
dm_rm_fk(flights, origin, airports) %>%
dm_get_fk(flights, airports)
#> <list_of<character>[0]>
flights_dm_with_fk %>%
dm_rm_fk(flights, NULL, airports) %>%
dm_get_fk(flights, airports)
#> <list_of<character>[0]>
Since the primary keys are defined in the dm
object, you do not need
to provide the referenced column name of ref_table
. This is always the
primary key column of the table.
Another function for getting to know your data better
(cf. dm_enum_pk_candidates()
in “Primary keys of dm
objects”)
is dm_enum_fk_candidates()
. Use it to get an overview over foreign key
candidates that point from one table to another:
dm_enum_fk_candidates(flights_dm_with_key, weather, airports)
#> # A tibble: 15 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin TRUE ""
#> 2 year FALSE "values of `weather$year` not in `airports$faa`: 2…
#> 3 month FALSE "values of `weather$month` not in `airports$faa`: …
#> 4 day FALSE "values of `weather$day` not in `airports$faa`: 3 …
#> 5 hour FALSE "values of `weather$hour` not in `airports$faa`: 1…
#> 6 temp FALSE "values of `weather$temp` not in `airports$faa`: 3…
#> 7 dewp FALSE "values of `weather$dewp` not in `airports$faa`: 2…
#> 8 humid FALSE "values of `weather$humid` not in `airports$faa`: …
#> 9 wind_dir FALSE "values of `weather$wind_dir` not in `airports$faa…
#> 10 wind_speed FALSE "values of `weather$wind_speed` not in `airports$f…
#> 11 wind_gust FALSE "values of `weather$wind_gust` not in `airports$fa…
#> 12 precip FALSE "values of `weather$precip` not in `airports$faa`:…
#> 13 pressure FALSE "values of `weather$pressure` not in `airports$faa…
#> 14 visib FALSE "values of `weather$visib` not in `airports$faa`: …
#> 15 time_hour FALSE "values of `weather$time_hour` not in `airports$fa…
Get an overview of all foreign key relations withdm_get_all_fks()
:
dm_get_all_fks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 5 x 4
#> child_table child_fk_cols parent_table parent_key_cols
#> <chr> <keys> <chr> <keys>
#> 1 flights carrier airlines carrier
#> 2 flights origin airports faa
#> 3 flights dest airports faa
#> 4 flights tailnum planes tailnum
#> 5 flights origin, time_hour weather origin, time_hour
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.