knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(data.table) library(DBmaps)
In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The map_join_paths()
function automates this discovery process, making it faster and less error-prone.
This function operates in two powerful modes:
Metadata-Driven Discovery: It can find join paths based solely on a metadata definition, matching tables where a defined grouping_variable
identically matches another table's identifier_columns
(primary key). This is extremely fast and useful for well-documented schemas.
Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find "inferred" joins where key names do not match. This is invaluable for exploring new or messy datasets.
First, let's define metadata for a standard retail scenario with customers
, products
, and transactions
tables.
# Define metadata for each table customers_meta <- table_info("customers", "c.csv", "customer_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region"))))) products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category"))))) transactions_meta <- table_info("transactions", "t.csv", "trans_id", list( list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list( # This grouping variable will match the primary key of 'customers' list(AggregatedName="a", AggregationFunction="sum", GroupingVariables="customer_id"), # This one will match the primary key of 'products' list(AggregatedName="b", AggregationFunction="sum", GroupingVariables="product_id") )) )) # Combine into a master metadata object master_meta <- rbindlist(list(customers_meta, products_meta, transactions_meta))
Now, we can find the join paths using only this metadata.
# Find paths without looking at the data metadata_paths <- map_join_paths(master_meta) print(metadata_paths)
The function correctly identifies two METADATA
paths: transactions
can be joined to customers
on customer_id
, and to products
on product_id
.
The function also handles composite keys. Let's imagine a table daily_promos
whose primary key is the combination of product_id
and region
.
daily_promos_meta <- table_info("daily_promos", "d.csv", c("product_id", "region"), list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region"))))) # Add a grouping variable to transactions that matches this composite key transactions_multi_meta <- table_info("transactions", "t.csv", "trans_id", list( list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list( list(AggregatedName="promo_rev", AggregationFunction="sum", GroupingVariables=c("product_id", "region")) )) )) multi_key_meta <- rbindlist(list(daily_promos_meta, transactions_multi_meta))
multi_key_paths <- map_join_paths(multi_key_meta) print(multi_key_paths)
The function correctly identifies the single, multi-variable join path.
What if our data is messy and key names don't align? Consider an inventory
table where the product key is called sku
, and an orders
table that refers to it as product_code
.
# Define the data inventory_data <- data.table(sku = c("s1", "s2", "s3"), stock = c(10, 20, 5)) orders_data <- data.table(order_id = 1:2, customer_ref = "c1", product_code = c("s1", "s2")) data_list <- list( inventory = inventory_data, orders = orders_data ) # Define the metadata. Note the mismatched names. inventory_meta <- table_info("inventory", "i.csv", "sku", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="stock"))))) orders_meta <- table_info("orders", "o.csv", "order_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="product_code"))))) inferred_meta <- rbindlist(list(inventory_meta, orders_meta))
Running map_join_paths
with only metadata would fail to find a path here. But by providing the data_list
, we enable the data-driven search.
inferred_paths <- map_join_paths(inferred_meta, data_list = data_list) print(inferred_paths)
Success! The function scanned the values and found that orders$product_code
can be joined to inventory$sku
. It correctly marks this path as INFERRED
.
The true power of the function is when it combines both methods, automatically de-duplicating and prioritizing explicit metadata joins over inferred ones. This provides a complete and reliable map of all possible connections in your data ecosystem.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.