In this tutorial we introduce {dm} methods and techniques for copying
individual tables and entire relational data models into an RDBMS. This
is an integral part of the {dm} workflow. Copying tables to an RDBMS is
often a step in the process of building a relational data model from
locally hosted data. If your data model is complete, copying it to an
RDBMS in a single operation allows you to leverage the power of the
database and make it accessible to others. For modifying and persisting
changes to your data at the row-level see vignette("howto-dm-rows")
.
Using {dm} you can persist an entire relational data model with a single
function call. copy_dm_to()
will move your entire model into a
destination RDBMS. This may be all you need to deploy a new model. You
may want to add new tables to an existing model on an RDBMS. These
requirements can be handled using the compute()
and copy_to()
methods.
Calling compute()
or copy_to()
requires write permission on the
RDBMS, otherwise an error is returned. Therefore for the following
examples we will instantiate a test dm and move it into a local SQLite
database with full permissions. {dm} and {dbplyr} are designed so there
is no difference between the code used to manipulate a local SQLite
database and a remote RDBMS. The steps for this were already introduced
in vignette("howto-dm-db")
and will be discussed in more detail in the
Copying a relational model section.
library(dm)
library(tidyverse)
#> ── Attaching packages ────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.3 ✔ purrr 0.3.4
#> ✔ tibble 3.1.1 ✔ dplyr 1.0.5
#> ✔ tidyr 1.1.3 ✔ stringr 1.4.0
#> ✔ readr 1.4.0 ✔ forcats 0.5.1
#> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
#> ✖ purrr::%@%() masks dm::%@%()
#> ✖ purrr::as_function() masks dm::as_function()
#> ✖ tidyr::extract() masks dm::extract()
#> ✖ dplyr::filter() masks dm::filter(), stats::filter()
#> ✖ purrr::flatten() masks dm::flatten()
#> ✖ purrr::flatten_chr() masks dm::flatten_chr()
#> ✖ purrr::flatten_dbl() masks dm::flatten_dbl()
#> ✖ purrr::flatten_int() masks dm::flatten_int()
#> ✖ purrr::flatten_lgl() masks dm::flatten_lgl()
#> ✖ purrr::flatten_raw() masks dm::flatten_raw()
#> ✖ purrr::invoke() masks dm::invoke()
#> ✖ purrr::is_null() masks dm::is_null(), testthat::is_null()
#> ✖ dplyr::lag() masks dm::lag(), stats::lag()
#> ✖ purrr::list_along() masks dm::list_along()
#> ✖ dplyr::matches() masks tidyr::matches(), dm::matches(), testthat::matches()
#> ✖ purrr::modify() masks dm::modify()
#> ✖ purrr::prepend() masks dm::prepend()
#> ✖ purrr::splice() masks dm::splice()
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
#> The following objects are masked from 'package:dm':
#>
#> ident, sql
fin_dm <-
dm_financial() %>%
dm_select_tbl(-trans) %>%
collect()
local_db <- DBI::dbConnect(RSQLite::SQLite())
deployed_dm <- copy_dm_to(local_db, fin_dm, temporary = FALSE)
As part of your data analysis you may combine tables from multiple
sources and create links to existing tables via foreign keys, or create
new tables holding data summaries. The example below, already discussed
in vignette("howto-dm-db")
, computes the total amount of all loans for
each account.
my_dm_total <-
deployed_dm %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
dm_insert_zoomed("total_loans")
The derived table total_loans
is a lazy table powered by the
{dbplyr} package: the results are not
materialized, instead an SQL query is built and executed each time the
data is requested.
my_dm_total$total_loans %>%
sql_render()
#> <SQL> SELECT `account_id`, SUM(`amount`) AS `total_amount`
#> FROM `loans`
#> GROUP BY `account_id`
To avoid recomputing the query every time you use total_loans
, call
compute()
right before inserting the derived table with
dm_insert_tbl()
. compute()
forces the computation of a query and
stores the full results in a table on the RDBMS.
my_dm_total_computed <-
deployed_dm %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
compute() %>%
dm_insert_zoomed("total_loans")
my_dm_total_computed$total_loans %>%
sql_render()
#> <SQL> SELECT *
#> FROM `dbplyr_001`
Note the differences in queries returned by sql_render()
.
my_dm_total$total_loans
is still being lazily evaluated and the full
query constructed from the chain of operations that generated it, and is
required to run to access it, is still in place. Contrast that with
my_dm_total_computed$total_loans
where the query has been realized and
accessing its rows requires a simple SELECT *
statement. The table
name, `, was automatically generated as the
name` argument was not
supplied.
The default is to create a temporary table. If you want results to
persist across sessions in permanent tables, compute()
must be called
with the argument temporary = FALSE
and a table name for the name
argument. See ?compute
for more details.
When called on a whole dm object (without zoom), compute()
materializes all tables into new (temporary or persistent) tables by
executing the associated SQL query and storing the full results.
Depending on the size of your data this may take considerable time or be
infeasible. It may be useful occasionally to create snapshots of data
that is subject to change.
my_dm_total_snapshot <-
my_dm_total %>%
compute()
If you need to add local data frames to an existing dm object, use the
copy_to()
method. It takes the same arguments as copy_dm_to()
,
except the second argument takes a data frame rather than a dm. The
result is a derived dm object that contains the new table.
To demonstrate the use of copy_to()
the example below will use {dm} to
pull consolidated data from several tables out of an RDBMS, estimate a
linear model from the data, then insert the residuals back into the
RDBMS and link it to the existing tables. This is all done with a local
SQLite database, but the process would work unchanged on any supported
RDBMS.
loans_df <-
deployed_dm %>%
dm_squash_to_tbl(loans) %>%
select(id, amount, duration, A3) %>%
collect()
#> Renamed columns:
#> * date -> loans.date, accounts.date
Please note the use of dm_squash_to_tbl()
. This method gathers all
linked information into a single wide table. It follows foreign key
relations starting from the table supplied as its argument and gathers
all the columns from related tables, disambiguating column names as it
goes.
In the above code, the select()
statement isolates the columns we need
for our model. collect()
works similarly to compute()
by forcing the
execution of the underlying SQL query, but it returns the results as a
local tibble.
Below, the local tibble, loans_df
, is used to estimate the linear
model and the residuals are stored along with the original associated
id
in a new tibble, loans_residuals
. The id
column is necessary to
link the new tibble to the tables in the dm it was collected from.
model <- lm(amount ~ duration + A3, data = loans_df)
loans_residuals <- tibble::tibble(
id = loans_df$id,
resid = unname(residuals(model))
)
loans_residuals
#> # A tibble: 682 x 2
#> id resid
#> <int> <dbl>
#> 1 4959 -31912.
#> 2 4961 -27336.
#> 3 4962 -30699.
#> 4 4967 63621.
#> 5 4968 -94811.
#> 6 4973 59036.
#> 7 4986 41901.
#> 8 4988 123392.
#> 9 4989 147157.
#> 10 4990 33377.
#> # … with 672 more rows
Adding loans_residuals
to the dm is done using copy_to()
. The call
to the method includes the argument temporary = FALSE
because we want
this table to persist beyond our current session. In the same pipeline
we create the necessary primary and foreign keys to integrate the table
with the rest of our relational model. For more information on key
creation see vignette("howto-dm-db")
and
vignette("howto-dm-theory")
.
my_dm_sqlite_resid <-
copy_to(deployed_dm, loans_residuals, temporary = FALSE) %>%
dm_add_pk(loans_residuals, id) %>%
dm_add_fk(loans_residuals, id, loans)
my_dm_sqlite_resid %>%
dm_set_colors(violet = loans_residuals) %>%
dm_draw()
my_dm_sqlite_resid %>%
dm_examine_constraints()
#> ℹ All constraints satisfied.
my_dm_sqlite_resid$loans_residuals
#> # Source: table<loans_residuals_2020_08_28_07_13_03_12345_1> [?? x 2]
#> # Database: sqlite 3.35.5 []
#> id resid
#> <int> <dbl>
#> 1 4959 -31912.
#> 2 4961 -27336.
#> 3 4962 -30699.
#> 4 4967 63621.
#> 5 4968 -94811.
#> 6 4973 59036.
#> 7 4986 41901.
#> 8 4988 123392.
#> 9 4989 147157.
#> 10 4990 33377.
#> # … with more rows
copy_dm_to()
Persistence, because it is intended to make permanent changes, requires
write access to the source RDBMS. The code below is a repeat of the code
that opened the Copying and persisting individual
tables section at the beginning of the tutorial. It
uses the {dm} convenience function dm_financial()
to create a dm
object corresponding to a data model from a public dataset repository.
The dm object is downloaded locally first, before deploying it to a
local SQLite database.
dm_select_tbl()
is used to exclude the transaction table trans
due
to its size, then the collect()
method retrieves the remaining tables
and returns them as a local dm object.
dm_financial() %>%
dm_nrow()
#> accounts cards clients disps districts loans orders
#> 4500 892 5369 5369 77 682 6471
#> tkeys trans
#> 234 1056320
fin_dm <-
dm_financial() %>%
dm_select_tbl(-trans) %>%
collect()
fin_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (8 total)
#> Columns: 47
#> Primary keys: 7
#> Foreign keys: 6
It is just as simple to move a local relational model into an RDBMS.
destination_db <- DBI::dbConnect(RSQLite::SQLite())
deployed_dm <-
copy_dm_to(destination_db, fin_dm, temporary = FALSE)
deployed_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.35.5 []
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (8 total)
#> Columns: 47
#> Primary keys: 7
#> Foreign keys: 6
Note that in the call to copy_dm_to()
the argument temporary = FALSE
is supplied. Without this argument the model would still be copied into
the database, but the argument would default to temporary = TRUE
and
the data would be deleted once your session ends.
In the output you can observe that the src
for deployed_dm
is
SQLite, while for fin_dm
the source is not indicated because it is a
local data model.
Copying a relational model into an empty database is the simplest use
case for copy_dm_to()
. If you want to copy a model into an RDBMS that
is already populated, be aware that copy_dm_to()
will not overwrite
pre-existing tables. In this case you will need to use the table_names
argument to give the tables unique names.
table_names
can be a named character vector, with the names matching
the table names in the dm object and the values containing the desired
names in the RDBMS, or a function or one-sided formula. In the example
below, paste0()
is used to add a prefix to the table names to provide
uniqueness.
dup_dm <-
copy_dm_to(destination_db, fin_dm, temporary = FALSE, table_names = ~ paste0("dup_", .x))
dup_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.35.5 []
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (8 total)
#> Columns: 47
#> Primary keys: 7
#> Foreign keys: 6
remote_name(dup_dm$accounts)
#> <IDENT> `dup_accounts`
remote_name(deployed_dm$accounts)
#> <IDENT> `accounts`
Note the different table names for dup_dm$accounts
and
deployed_dm$accounts
. For both, the table name is accounts
in the
dm, but they link to different tables on the database. In dup_dm
the
table is backed by the table dup_accounts
in the RDBMS.
dm_deployed$accounts
shows us that this table is still backed by the
accounts
table from the copy_dm_to()
operation we performed in the
preceding example.
Managing tables in the RDBMS is outside the scope of dm
. If you find
you need to remove tables or perform operations directly on the RDBMS,
see the {DBI} package.
dm
makes it straightforward to deploy your complete relational model
to an RDBMS using the copy_dm_to()
function. For tables that are
created from a relational model during analysis or development,
compute()
and copy_to()
can be used to persist them between sessions
or to copy local tables to a database dm. The collect()
method
downloads an entire dm object that fits into memory from the database.
If you need finer-grained control over modifications to your relational
model, see vignette("howto-dm-rows")
for an introduction to row level
operations, including updates, insertions, deletions and patching.
If you would like to know more about relational data models in order
to get the most out of dm, check out vignette("howto-dm-theory")
.
If you’re familiar with relational data models but want to know how to
work with them in dm, then any of vignette("tech-dm-join")
,
vignette("tech-dm-filter")
, or vignette("tech-dm-zoom")
is a good
next step.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.