knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(DBmaps) library(data.table)
This vignette provides an introduction to the DBmaps
package and its approach to simplifying the analysis of data from relational databases within R. A common and often repetitive challenge in data analysis involves aggregating detailed "fact" tables (like transactions or event logs) before they can be meaningfully joined with descriptive "dimension" tables (like customers or products). DBmaps
is designed to streamline and automate this entire workflow.
The core principle of DBmaps
is a metadata-driven approach. Instead of writing complex and manual data.table
join and aggregation code, the user first describes the analytical potential of their tables in a structured format called metadata. The package then uses this metadata to automatically discover join paths, create executable plans, and produce the final merged dataset. This makes the entire process more efficient, scalable, and less prone to error.
This vignette walks through a complete, end-to-end example based on a typical e-commerce dataset. We will perform the following steps:
table_info()
.create_metadata_registry() & add_table()
.map_join_paths()
.create_join_plan()
.plot_join_plan()
.execute_join_plan()
.The example uses four CSV files that represent common tables in an e-commerce database. You can download them directly from the project's GitHub repository:
First, let's load these tables into our R session.
transactions[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")] views[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")] cat("--- All 4 Raw Data Tables Loaded Successfully ---\n") cat("---Customers Data---\n") print(head(customers, 5)) cat("---products Data---\n") print(head(products, 5)) cat("---Transactions Data---\n") print(head(transactions, 5)) cat("---Views Data---\n") print(head(views, 5))
table_info()
The table_info()
function is the primary tool for creating metadata. It captures descriptive information about a table and its analytical potential.
The key argument is key_outcome_specs
, a list that defines how a table can be aggregated. A crucial rule in DBmaps
is that
every aggregation method must include one or more GroupingVariables
, as this is what makes the resulting aggregated data mergeable.
customers_info <- table_info( table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id", key_outcome_specs = list(list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region") ))) ) products_info <- table_info( table_name = "products", source_identifier = "products.csv", identifier_columns = "product_id", key_outcome_specs = list(list(OutcomeName = "ProductCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "ProductsPerCategory", AggregationFunction = "sum", GroupingVariables = "category") ))) ) transactions_info <- table_info( table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"), key_outcome_specs = list( list(OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list( list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"), list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id"), list(AggregatedName = "DailyRevenueByCustomerProduct", AggregationFunction = "sum", GroupingVariables = c("customer_id", "product_id", "time")) )), list(OutcomeName = "UnitsSold", ValueExpression = quote(quantity), AggregationMethods = list( list(AggregatedName = "TotalUnitsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id") )) ) ) views_info <- table_info( table_name = "views", source_identifier = "views.csv", identifier_columns = c("customer_id", "product_id", "time"), key_outcome_specs = list(list(OutcomeName = "ViewCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "ViewsByProduct", AggregationFunction = "count", GroupingVariables = "product_id"), list(AggregatedName = "ViewsByCustomer", AggregationFunction = "count", GroupingVariables = "customer_id") ))) ) cat("---Metadata for transactions---\n") print(transactions_info)
The output of table_info() has several key columns that drive the entire DBmaps system. Here is a brief explanation of what each one does:
table_name: The conceptual name of the table (e.g., "transactions").
identifier_columns: This list-column contains the name(s) of the column(s) that form the primary key of the raw table. This is what other tables can join TO.
outcome_name: A high-level description of the metric being calculated (e.g., "Revenue").
value_expression: The R code (as a string) that defines how to calculate the raw value for the outcome from the source table's columns (e.g., "price * quantity").
aggregated_name: The final name for the new, aggregated column in the summarized table (e.g., "RevenueByCustomer"). This is the name you will use in your selections when creating a plan.
aggregation_function: The function to apply to the value_expression during grouping (e.g., "sum").
grouping_variable: This is the most critical column for joins. It is a list-column containing the name(s) of the column(s) to group by. The values in this column define the "join key" that can be used to join this aggregated data FROM.
A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable with other tables.
To manage metadata for multiple tables, DBmaps provides a simple registry system. create_metadata_registry() initializes a registry object, and add_table() adds the metadata for each table to it. This avoids the need to manually combine data.table objects.
meta <- create_metadata_registry() meta <- add_table(meta, customers_info) meta <- add_table(meta, products_info) meta <- add_table(meta, views_info) meta <- add_table(meta, transactions_info) print(meta)
map_join_paths()
In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together.
This typically involves matching foreign keys in one table to primary keys in another. The map_join_paths()
function automates this discovery process, making it faster and less error-prone.
This function operates in two powerful modes:
Metadata-Driven Discovery: It can find join paths based solely on a metadata definition, matching tables where a defined grouping_variable
identically matches another table's identifier_columns
(primary key). This is extremely fast and useful for well-documented schemas.
Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find "inferred" joins where key names do not match. This is invaluable for exploring new or messy datasets.
The true power of the function is when it combines both methods. This provides a complete and reliable map of all possible connections in your data ecosystem.
# Create a named list of the actual data tables for the functions to use all_tables <- list( customers = customers, products = products, transactions = transactions, views = views ) # Generate the join map paths <- map_join_paths(meta, all_tables) print(paths)
The resulting "Join Map" shows every valid, directional, many-to-one join that can be performed after aggregation.
create_join_plan()
Now we define our analytical goal. Let's say we want to create a product-level summary table that includes the product's category, its total revenue, and the total number of times it was viewed.
The create_join_plan()
function translates this high-level request into a concrete, step-by-step recipe of data.table
code.
# Define our desired output selections <- list( products = c("product_id", "category"), # Base columns from the products table transactions = "RevenueByProduct", # The aggregated revenue by product views = "ViewsByProduct" # The aggregated view count by product ) # Generate the plan plan <- create_join_plan( base_table = "products", selections = selections, metadata_dt = meta, join_map = paths ) print(plan)
The output of the planner is a data.table that acts as an executable recipe. Each row represents a step, but the final column, code, is the most powerful feature. It contains the precise, executable R code for each step of the data manipulation process. Let's break down how the plan achieves our goal:
1. Steps 1 & 2 (AGGREGATE): The planner analyzed our request for RevenueByProduct and ViewsByProduct. It correctly determined from the metadata that these require aggregating the transactions and views tables, respectively. It then generated the exact data.table code to create two new, summarized tables: agg_transactions and agg_views.
2. Steps 3 & 4 (MERGE): The planner then constructed a sequence of join operations. It first merges the aggregated transactions (agg_transactions) onto our products base table. Then, it merges the aggregated views (agg_views) onto the result of the first merge. This sequential chaining correctly combines all the necessary data.
3. Step 5 (SELECT): Finally, the planner generates the code to select only the columns we originally requested from the final, fully merged table.
This demonstrates the core value of the planner: it automates the tedious and error-prone task of determining the correct sequence of aggregations and joins, providing a concrete set of steps that will join the data in our preferred manner.
Handling Invalid Requests
A key feature of the planner is its ability to validate user requests. What happens if we ask for an aggregation that cannot logically be joined to our base table?
Let's ask for RevenueByProduct
(grouped by product_id
) to be joined to the customers
table (keyed by customer_id
). This is not a valid join.
# Add product metadata for this example products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category"))))) transactions_meta_v2 <- table_info("transactions", "t.csv", "trans_id", list( list(OutcomeName="Revenue", ValueExpression=quote(price*qty), AggregationMethods=list( # This aggregation is by product_id, not customer_id list(AggregatedName="RevenueByProduct", AggregationFunction="sum", GroupingVariables="product_id") )) )) invalid_metadata <- create_metadata_registry() invalid_metadata <- add_table(invalid_metadata, products_meta) invalid_metadata <- add_table(invalid_metadata, transactions_meta_v2) # The invalid request invalid_selections <- list( customers = "customer_id", transactions = "RevenueByProduct" )
Instead of producing a faulty plan or a cryptic error, create_join_plan
stops with a clear, informative message.
create_join_plan( base_table = "customers", selections = invalid_selections, metadata_dt = invalid_metadata )
The reason this is invalid is that the join key of the selected aggregation does not match the join key of the base table.
customers
, whose primary join key is customer_id
.transactions
table.product_id
.create_join_plan()
, correctly sees that there is no direct path to join a table keyed by product_id
to a table keyed on customer_id
.This strict validation ensures that only logical and correct data manipulation plans are generated, preventing common data analysis errors.
To make the plan even clearer, we can visualize it as a flowchart using plot_join_plan()
.
# This requires the DiagrammeR package if (requireNamespace("DiagrammeR", quietly = TRUE)) { # Generate the plot object visualize <- plot_join_plan(plan) visualize } else { cat("Install the 'DiagrammeR' package to visualize the join plan.") }
The graph shows the flow of data from the source tables (blue boxes) through intermediate aggregations (gray ellipses) to the final merged result and selection (yellow diamond).
The final step is to execute the plan using execute_join_plan()
. This function takes the plan and the list of source data tables and runs
the generated code, returning the final dataset.
# The executor runs the plan in a clean environment final_dt <- execute_join_plan(plan, all_tables) # Show the first few rows of the final, merged data.table print(head(final_dt))
The DBmaps
workflow successfully automated the entire process. By investing a small amount of time to define metadata, we were able
to automatically discover relationships, generate a complex join plan, visualize it, and execute it with just a few high-level function calls.
This demonstrates a powerful, scalable, and reproducible approach to data preparation and analysis in R.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.