Introduction to DBmaps
In DBmaps: A Metadata-Driven Framework for Streamlining Database Joins

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(DBmaps)
library(data.table)

Introduction

This vignette provides an introduction to the DBmaps package and its approach to simplifying the analysis of data from relational databases within R. A common and often repetitive challenge in data analysis involves aggregating detailed "fact" tables (like transactions or event logs) before they can be meaningfully joined with descriptive "dimension" tables (like customers or products). DBmaps is designed to streamline and automate this entire workflow.

The core principle of DBmaps is a metadata-driven approach. Instead of writing complex and manual data.table join and aggregation code, the user first describes the analytical potential of their tables in a structured format called metadata. The package then uses this metadata to automatically discover join paths, create executable plans, and produce the final merged dataset. This makes the entire process more efficient, scalable, and less prone to error.

A Complete Workflow Example

This vignette walks through a complete, end-to-end example based on a typical e-commerce dataset. We will perform the following steps:

Load the four raw data tables.
Define the metadata for each table using table_info().
Functions to initialize and populate a registry that stores all your metadata definitions create_metadata_registry() & add_table().
Automatically discover all possible join paths with map_join_paths().
Define a desired final dataset and generate an execution plan with create_join_plan().
Visualize the plan with plot_join_plan().
Execute the plan to generate the final data.table with execute_join_plan().

1. The Example Data

The example uses four CSV files that represent common tables in an e-commerce database. You can download them directly from the project's GitHub repository:

First, let's load these tables into our R session.

transactions[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")]
views[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")]

cat("--- All 4 Raw Data Tables Loaded Successfully ---\n")
cat("---Customers Data---\n")
print(head(customers, 5))
cat("---products Data---\n")
print(head(products, 5))
cat("---Transactions Data---\n")
print(head(transactions, 5))
cat("---Views Data---\n")
print(head(views, 5))

2. Defining Metadata with `table_info()`

The table_info() function is the primary tool for creating metadata. It captures descriptive information about a table and its analytical potential. The key argument is key_outcome_specs, a list that defines how a table can be aggregated. A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable.

customers_info <- table_info(
  table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id",
  key_outcome_specs = list(list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list(
    list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region")
  )))
)

products_info <- table_info(
  table_name = "products", source_identifier = "products.csv", identifier_columns = "product_id",
  key_outcome_specs = list(list(OutcomeName = "ProductCount", ValueExpression = 1, AggregationMethods = list(
    list(AggregatedName = "ProductsPerCategory", AggregationFunction = "sum", GroupingVariables = "category")
  )))
)

transactions_info <- table_info(
  table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"),
  key_outcome_specs = list(
    list(OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list(
      list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"),
      list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id"),
      list(AggregatedName = "DailyRevenueByCustomerProduct", AggregationFunction = "sum", GroupingVariables = c("customer_id", "product_id", "time"))
    )),
    list(OutcomeName = "UnitsSold", ValueExpression = quote(quantity), AggregationMethods = list(
      list(AggregatedName = "TotalUnitsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id")
    ))
  )
)
views_info <- table_info(
  table_name = "views", source_identifier = "views.csv", identifier_columns = c("customer_id", "product_id", "time"),
  key_outcome_specs = list(list(OutcomeName = "ViewCount", ValueExpression = 1, AggregationMethods = list(
    list(AggregatedName = "ViewsByProduct", AggregationFunction = "count", GroupingVariables = "product_id"), 
    list(AggregatedName = "ViewsByCustomer", AggregationFunction = "count", GroupingVariables = "customer_id")
  )))
)
cat("---Metadata for transactions---\n")
print(transactions_info)

Understanding the Metadata Columns

The output of table_info() has several key columns that drive the entire DBmaps system. Here is a brief explanation of what each one does:

table_name: The conceptual name of the table (e.g., "transactions").

identifier_columns: This list-column contains the name(s) of the column(s) that form the primary key of the raw table. This is what other tables can join TO.

outcome_name: A high-level description of the metric being calculated (e.g., "Revenue").

value_expression: The R code (as a string) that defines how to calculate the raw value for the outcome from the source table's columns (e.g., "price * quantity").

aggregated_name: The final name for the new, aggregated column in the summarized table (e.g., "RevenueByCustomer"). This is the name you will use in your selections when creating a plan.

aggregation_function: The function to apply to the value_expression during grouping (e.g., "sum").

grouping_variable: This is the most critical column for joins. It is a list-column containing the name(s) of the column(s) to group by. The values in this column define the "join key" that can be used to join this aggregated data FROM.

A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable with other tables.

3. create_metadata_registry() and add_table()

To manage metadata for multiple tables, DBmaps provides a simple registry system. create_metadata_registry() initializes a registry object, and add_table() adds the metadata for each table to it. This avoids the need to manually combine data.table objects.

meta <- create_metadata_registry()
meta <- add_table(meta, customers_info)
meta <- add_table(meta, products_info)
meta <- add_table(meta, views_info)
meta <- add_table(meta, transactions_info)
print(meta)

4. Discovering Join Paths with `map_join_paths()`

In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The map_join_paths() function automates this discovery process, making it faster and less error-prone.

This function operates in two powerful modes:

Metadata-Driven Discovery: It can find join paths based solely on a metadata definition, matching tables where a defined grouping_variable identically matches another table's identifier_columns (primary key). This is extremely fast and useful for well-documented schemas.
Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find "inferred" joins where key names do not match. This is invaluable for exploring new or messy datasets.

Combining Both Methods

The true power of the function is when it combines both methods. This provides a complete and reliable map of all possible connections in your data ecosystem.

# Create a named list of the actual data tables for the functions to use
all_tables <- list(
  customers = customers,
  products = products,
  transactions = transactions,
  views = views
)

# Generate the join map
paths <- map_join_paths(meta, all_tables)
print(paths)

The resulting "Join Map" shows every valid, directional, many-to-one join that can be performed after aggregation.

5. Creating an Execution Plan with `create_join_plan()`

Now we define our analytical goal. Let's say we want to create a product-level summary table that includes the product's category, its total revenue, and the total number of times it was viewed.

The create_join_plan() function translates this high-level request into a concrete, step-by-step recipe of data.table code.

# Define our desired output
selections <- list(
  products = c("product_id", "category"),  # Base columns from the products table
  transactions = "RevenueByProduct",        # The aggregated revenue by product
  views = "ViewsByProduct"                  # The aggregated view count by product
)

# Generate the plan
plan <- create_join_plan(
  base_table = "products",
  selections = selections,
  metadata_dt = meta,
  join_map = paths
)

print(plan)

The output of the planner is a data.table that acts as an executable recipe. Each row represents a step, but the final column, code, is the most powerful feature. It contains the precise, executable R code for each step of the data manipulation process. Let's break down how the plan achieves our goal:

1. Steps 1 & 2 (AGGREGATE): The planner analyzed our request for RevenueByProduct and ViewsByProduct. It correctly determined from the metadata that these require aggregating the transactions and views tables, respectively. It then generated the exact data.table code to create two new, summarized tables: agg_transactions and agg_views.

2. Steps 3 & 4 (MERGE): The planner then constructed a sequence of join operations. It first merges the aggregated transactions (agg_transactions) onto our products base table. Then, it merges the aggregated views (agg_views) onto the result of the first merge. This sequential chaining correctly combines all the necessary data.

3. Step 5 (SELECT): Finally, the planner generates the code to select only the columns we originally requested from the final, fully merged table.

This demonstrates the core value of the planner: it automates the tedious and error-prone task of determining the correct sequence of aggregations and joins, providing a concrete set of steps that will join the data in our preferred manner.

Handling Invalid Requests

A key feature of the planner is its ability to validate user requests. What happens if we ask for an aggregation that cannot logically be joined to our base table?

Let's ask for RevenueByProduct (grouped by product_id) to be joined to the customers table (keyed by customer_id). This is not a valid join.

# Add product metadata for this example
products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))
transactions_meta_v2 <- table_info("transactions", "t.csv", "trans_id", list(
  list(OutcomeName="Revenue", ValueExpression=quote(price*qty), AggregationMethods=list(
    # This aggregation is by product_id, not customer_id
    list(AggregatedName="RevenueByProduct", AggregationFunction="sum", GroupingVariables="product_id")
  ))
))
invalid_metadata <- create_metadata_registry()
invalid_metadata <- add_table(invalid_metadata, products_meta)
invalid_metadata <- add_table(invalid_metadata, transactions_meta_v2)

# The invalid request
invalid_selections <- list(
  customers = "customer_id",
  transactions = "RevenueByProduct"
)

Instead of producing a faulty plan or a cryptic error, create_join_plan stops with a clear, informative message.

create_join_plan(
  base_table = "customers",
  selections = invalid_selections,
  metadata_dt = invalid_metadata
)

The reason this is invalid is that the join key of the selected aggregation does not match the join key of the base table.

The base_table is customers, whose primary join key is customer_id.
The selection asks for the RevenueByProduct aggregation from the transactions table.
According to our metadata, the RevenueByProduct aggregation is grouped by (and therefore keyed on) product_id.
The planner function, create_join_plan(), correctly sees that there is no direct path to join a table keyed by product_id to a table keyed on customer_id.

This strict validation ensures that only logical and correct data manipulation plans are generated, preventing common data analysis errors.

6. Visualizing the Plan

To make the plan even clearer, we can visualize it as a flowchart using plot_join_plan().

# This requires the DiagrammeR package
if (requireNamespace("DiagrammeR", quietly = TRUE)) {
  # Generate the plot object
  visualize <- plot_join_plan(plan)
  visualize  
} else {
  cat("Install the 'DiagrammeR' package to visualize the join plan.")
}

The graph shows the flow of data from the source tables (blue boxes) through intermediate aggregations (gray ellipses) to the final merged result and selection (yellow diamond).

7. Executing the Plan

The final step is to execute the plan using execute_join_plan(). This function takes the plan and the list of source data tables and runs the generated code, returning the final dataset.

# The executor runs the plan in a clean environment
final_dt <- execute_join_plan(plan, all_tables) 

# Show the first few rows of the final, merged data.table
print(head(final_dt))

Conclusion

The DBmaps workflow successfully automated the entire process. By investing a small amount of time to define metadata, we were able to automatically discover relationships, generate a complex join plan, visualize it, and execute it with just a few high-level function calls. This demonstrates a powerful, scalable, and reproducible approach to data preparation and analysis in R.

Any scripts or data that you put into this service are public.

DBmaps documentation built on Sept. 9, 2025, 5:44 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DBmaps
A Metadata-Driven Framework for Streamlining Database Joins

Introduction to DBmaps
In DBmaps: A Metadata-Driven Framework for Streamlining Database Joins

Introduction

A Complete Workflow Example

1. The Example Data

2. Defining Metadata with `table_info()`

Understanding the Metadata Columns

3. create_metadata_registry() and add_table()

4. Discovering Join Paths with `map_join_paths()`

Combining Both Methods

5. Creating an Execution Plan with `create_join_plan()`

6. Visualizing the Plan

7. Executing the Plan

Conclusion

Try the DBmaps package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DBmaps A Metadata-Driven Framework for Streamlining Database Joins

Introduction to DBmaps In DBmaps: A Metadata-Driven Framework for Streamlining Database Joins

Introduction

A Complete Workflow Example

1. The Example Data

2. Defining Metadata with table_info()

Understanding the Metadata Columns

3. create_metadata_registry() and add_table()

4. Discovering Join Paths with map_join_paths()

Combining Both Methods

5. Creating an Execution Plan with create_join_plan()

6. Visualizing the Plan

7. Executing the Plan

Conclusion

Try the DBmaps package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DBmaps
A Metadata-Driven Framework for Streamlining Database Joins

Introduction to DBmaps
In DBmaps: A Metadata-Driven Framework for Streamlining Database Joins

2. Defining Metadata with `table_info()`

4. Discovering Join Paths with `map_join_paths()`

5. Creating an Execution Plan with `create_join_plan()`