In this article we’re going show you how easy it is to move from
connecting to the database holding your data to producing the results
you need. It’s meant to be a quick and friendly introduction to {dm}, so
it is low on details and caveats. Links to detailed documentation are
provided at the end. (If your data is in data frames instead of a
database and you’re in a hurry, jump over to vignette("howto-dm-df")
.)
dm objects can be created from individual tables or loaded directly from a relational data model on an RDBMS (relational database management system).
For this demonstration we’re going work with a model hosted on a public server. The first thing we need is a connection to the RDBMS hosting the data.
library(RMariaDB)
fin_db <- dbConnect(
MariaDB(),
username = 'guest',
password = 'relational',
dbname = 'Financial_ijs',
host = 'relational.fel.cvut.cz'
)
We create a dm object from an RDBMS using dm_from_con()
, passing in
the connection object we just created as the first argument.
library(dm)
fin_dm <- dm_from_con(fin_db)
fin_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src: mysql [guest@relational.fel.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total)
#> Columns: 57
#> Primary keys: 0
#> Foreign keys: 0
The dm object interrogates the RDBMS for table and column information and, where implemented, primary and foreign keys. Currently, primary and foreign keys are only available from Postgres and SQL Server.
The dm object can be accessed like a named list of tables:
names(fin_dm)
#> [1] "accounts" "cards" "clients" "disps" "districts" "loans"
#> [7] "orders" "tkeys" "trans"
fin_dm$loans
#> # Source: table<`Financial_ijs`.`loans`> [?? x 7]
#> # Database: mysql [guest@relational.fel.cvut.cz:NA/Financial_ijs]
#> id account_id date amount duration payments status
#> <int> <int> <date> <dbl> <int> <dbl> <chr>
#> 1 4959 2 1994-01-05 80952 24 3373 A
#> 2 4961 19 1996-04-29 30276 12 2523 B
#> 3 4962 25 1997-12-08 30276 12 2523 A
#> 4 4967 37 1998-10-14 318480 60 5308 D
#> 5 4968 38 1998-04-19 110736 48 2307 C
#> 6 4973 67 1996-05-02 165960 24 6915 A
#> 7 4986 97 1997-08-10 102876 12 8573 A
#> 8 4988 103 1997-12-06 265320 36 7370 D
#> 9 4989 105 1998-12-05 352704 48 7348 C
#> 10 4990 110 1997-09-08 162576 36 4516 C
#> # … with more rows
dplyr::count(fin_dm$trans)
#> # Source: lazy query [?? x 1]
#> # Database: mysql [guest@relational.fel.cvut.cz:NA/Financial_ijs]
#> n
#> <int64>
#> 1 1056320
At the same time, most dm
functions are pipe-friendly and support tidy
evaluation. We can use [
or the dm_select_tbl()
verb to derive a
smaller dm with the loans
, accounts
, districts
and trans
tables:
fin_dm_small <- fin_dm[c("loans", "accounts", "districts", "trans")]
fin_dm_small <-
fin_dm %>%
dm_select_tbl(loans, accounts, districts, trans)
In many cases, dm_from_con()
already returns a dm with all keys set.
If not, dm allows us to define primary and foreign keys ourselves.
In our data model, id
columns uniquely identify records in the
accounts
and loans
table, a primary key is added with dm_add_pk()
.
Each loan is linked to one account via the account_id
column in the
loans
table, the relationship is established with dm_add_fk()
.
fin_dm_keys <-
fin_dm_small %>%
dm_add_pk(accounts, id) %>%
dm_add_pk(loans, id) %>%
dm_add_fk(loans, account_id, accounts) %>%
dm_add_pk(trans, id) %>%
dm_add_fk(trans, account_id, accounts) %>%
dm_add_pk(districts, id) %>%
dm_add_fk(accounts, district_id, districts)
Having a diagram of the data model is the quickest way to verify we’re on the right track. We can display a visual summary of the dm at any time. The default is to display the table name, any defined keys and their links to other tables.
Visualizing the dm in its current state we see the keys we have created and how they link the tables together. Color guides the eye.
fin_dm_keys %>%
dm_set_colors(green = c(loans, accounts), darkblue = trans, grey = districts) %>%
dm_draw()
If we want to perform modeling or analysis on this relational model we
need to transform it into a tabular format that R functions can work
with. dm_squash_to_tbl()
will automatically follow foreign keys across
tables to gather all the available columns into a single table.
fin_dm_keys %>%
dm_squash_to_tbl(loans)
#> Renamed columns:
#> * date -> loans.date, accounts.date
#> # Source: lazy query [?? x 25]
#> # Database: mysql [guest@relational.fel.cvut.cz:NA/Financial_ijs]
#> id account_id loans.date amount duration payments status district_id
#> <int> <int> <date> <dbl> <int> <dbl> <chr> <int>
#> 1 4959 2 1994-01-05 80952 24 3373 A 1
#> 2 4961 19 1996-04-29 30276 12 2523 B 21
#> 3 4962 25 1997-12-08 30276 12 2523 A 68
#> 4 4967 37 1998-10-14 318480 60 5308 D 20
#> 5 4968 38 1998-04-19 110736 48 2307 C 19
#> 6 4973 67 1996-05-02 165960 24 6915 A 16
#> 7 4986 97 1997-08-10 102876 12 8573 A 74
#> 8 4988 103 1997-12-06 265320 36 7370 D 44
#> 9 4989 105 1998-12-05 352704 48 7348 C 21
#> 10 4990 110 1997-09-08 162576 36 4516 C 36
#> # … with more rows, and 17 more variables: frequency <chr>,
#> # accounts.date <date>, A2 <chr>, A3 <chr>, A4 <int>, A5 <int>,
#> # A6 <int>, A7 <int>, A8 <int>, A9 <int>, A10 <dbl>, A11 <int>,
#> # A12 <dbl>, A13 <dbl>, A14 <int>, A15 <int>, A16 <int>
Apart from the rows printed above, no data has been fetched from the
database. Use select()
to reduce the number of columns fetched, and
collect()
to retrieve the entire result for local processing.
loans_df <-
fin_dm_keys %>%
dm_squash_to_tbl(loans) %>%
select(id, amount, duration, A3) %>%
collect()
#> Renamed columns:
#> * date -> loans.date, accounts.date
model <- lm(amount ~ duration + A3, data = loans_df)
model
#>
#> Call:
#> lm(formula = amount ~ duration + A3, data = loans_df)
#>
#> Coefficients:
#> (Intercept) duration A3east Bohemia A3north Bohemia
#> 10196 4109 -16204 -28933
#> A3north Moravia A3Prague A3south Bohemia A3south Moravia
#> 1467 4044 -1896 -12463
#> A3west Bohemia
#> -28572
We don’t need to take the extra step of exporting the data to work with it. Through the dm object we have complete access to dplyr’s data manipulation verbs. These operate on the data within individual tables.
To work with a particular table we use dm_zoom_to()
to set the context
to our chosen table. Then we can perform any of the operations we want.
fin_dm_total <-
fin_dm_keys %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
dm_insert_zoomed("total_loans")
fin_dm_total$total_loans
#> # Source: lazy query [?? x 2]
#> # Database: mysql [guest@relational.fel.cvut.cz:NA/Financial_ijs]
#> account_id total_amount
#> <int> <dbl>
#> 1 2 80952
#> 2 19 30276
#> 3 25 30276
#> 4 37 318480
#> 5 38 110736
#> 6 67 165960
#> 7 97 102876
#> 8 103 265320
#> 9 105 352704
#> 10 110 162576
#> # … with more rows
Note that in the above example we use dm_insert_zoomed()
to add the
results as a new table to our data model. This table is temporary and
will be deleted when our session ends. If you want to make permanent
changes to your data model on an RDBMS please see the “Persisting
results” section in vignette("howto-dm-db")
.
It’s always smart to check that your data model follows its specifications. When building our own model or changing existing models by adding tables or keys, it is even more important that the new model is validated.
dm_examine_constrains()
checks all primary and foreign keys and
reports if they violate their expected constraints.
fin_dm_total %>%
dm_examine_constraints()
#> ℹ All constraints satisfied.
For more on constraint checking, including cardinality, finding
candidate columns for keys, and normalization, see
vignette("tech-dm-low-level")
.
Now that you have been introduced to the basic operation of dm, the next step is to learn more about the dm methods that your particular use case requires.
Is your data in an RDBMS? Then move on to vignette("howto-dm-db")
for
a more detailed looking at working with an existing relational data
model.
If your data is in data frames, then you want to read
vignette("howto-dm-df")
next.
If you would like to know more about relational data models in order
to get the most out of dm, check out vignette("howto-dm-theory")
.
If you’re familiar with relational data models but want to know how to
work with them in dm, then any of vignette("tech-dm-join")
,
vignette("tech-dm-filter")
, or vignette("tech-dm-zoom")
is a good
next step.
vignette("howto-dm-db")
-- This article covers accessing and working
with RDBMSs within your R session, including manipulating data, filling
in missing relationships between tables, getting data out of the RDBMS
and into your model, and deploying your data model to an RDBMS.
vignette("howto-dm-df")
-- Is your data in local data frames? This
article covers creating a data model from your local data frames,
including building the relationships in your data model, verifying your
model, and leveraging the power of dplyr to operate on your data model.
vignette("howto-dm-theory")
-- Do you know all about data frames but
very little about relational data models? This quick introduction will
walk you through the key similarities and differences, and show you how
to move from individual data frames to a relational data model.
vignette("tech-dm-join")
-- Joining is how separate, linked tables are
virtually combined in order to perform operations, generate summaries,
or prepare data for extraction. This article covers dm’s methods for
creating the different kind of joins along with usage examples.
vignette("tech-dm-zoom")
-- Learn how to manipulate the data in the
tables in your dm using dplyr verbs.
vignette("tech-dm-filter")
-- Need a subset of your data across tables?
This introduction to dm’s filtering methods will show you how to apply
them to data that is separated into multiple tables.
vignette("tech-dm-draw")
-- A quick guide to dm’s methods for drawing
your data model for exploration, verification or documentation purposes.
Learn how to customize the diagrams.
vignette("tech-dm-low-level")
-- If your data model needs to be rock
solid, this article covers the dm methods for checking every aspect of
the model. It covers key constraints, cardinalities between tables, and
methods for normalization as you construct your data model.
vignette("tech-dm-class")
-- A detailed overview of the "dm"
class.
vignette("tech-dm-naming")
-- Making sense of dm’s API.
vignette("tech-dm-cdm")
-- If you’re an early adopter, used dm version
0.0.5 or lower and want to migrate your code, this is for you.
The {dm} package follows the tidyverse principles:
dm
objects are immutable (your data will never be overwritten in
place)dm
objects are pipeable (i.e., return new
dm
or table objects)The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!
The {polyply} package has a similar intent with a slightly different interface.
The {data.cube} package has
quite the same intent using array
-like interface.
Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.
The {tidygraph} package stores
a network as two related tables of nodes
and edges
, compatible with
{dplyr} workflows.
In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.