knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette describes how to benchmark CDMConnector::generateCohortSet (old, CIRCE-based) against atlasCohortGenerator::generateCohortSet2 (new, DAG-optimized batch) across multiple database platforms. The benchmarking script:
Supported platforms include PostgreSQL, Redshift, Snowflake, Spark, and SQL Server. You provide a named list of CDM reference objects; the script handles timing, comparison, and CSV output.
The new approach (generateCohortSet2) uses a DAG-based batch optimizer that:
As a result, wall-clock time typically decreases as the number of cohorts and the overlap in concept sets increase. The ratio (new time / old time) is often below 1.0, with larger batches showing greater speedups. The exact improvement depends on:
The benchmarking script records time_old_sec, time_new_sec, and
ratio_new_over_old per database so you can measure the speedup on your own
data and platforms.
devtools::load_all()
for the latter).CDMConnector::readCohortSet("path/to/cohorts")).For one CDM, use the single-database benchmark and optional equivalence check:
source("extras/benchmark_cohort_generation.R") cohort_set <- CDMConnector::readCohortSet("path/to/cohorts") result <- benchmark_cohort_generation(cdm, cohort_set, cohort_path = "path/to/cohorts") # Compare old vs new cohort tables (identical rows, order ignored) cmp <- compare_cohort_tables(result$cdm, name_old = "cohort_bench_old", name_new = "cohort_bench_new") cmp$identical # TRUE if same set of rows cmp$per_cohort # Per-cohort row counts and match status
Pass a named list of CDM objects; names are used as the database identifier
in the output CSVs (e.g. postgres, redshift, snowflake, spark, sql_server):
source("extras/benchmark_cohort_generation.R") source("extras/benchmark_multi_database.R") cohort_set <- CDMConnector::readCohortSet("path/to/cohorts") cdms <- list( postgres = cdm_postgres, redshift = cdm_redshift, snowflake = cdm_snowflake, spark = cdm_spark, sql_server = cdm_sqlserver ) run_benchmark_multi_database( cdms = cdms, cohort_set = cohort_set, cohort_path = "path/to/cohorts", results_csv = "benchmark_results.csv", equivalence_csv = "benchmark_equivalence.csv" )
database, time_old_sec,
time_new_sec, ratio_new_over_old, n_cohorts, files_included, status.n_old, n_new, rows_identical, status.The timing CSV has one row per database. Example structure:
results_example <- data.frame( database = c("postgres", "redshift", "snowflake", "sql_server"), time_old_sec = c(120.5, 95.2, 88.1, 110.3), time_new_sec = c(45.2, 38.0, 32.5, 42.1), ratio_new_over_old = c(0.38, 0.40, 0.37, 0.38), n_cohorts = 4L, files_included = "cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json", status = "ok", stringsAsFactors = FALSE ) knitr::kable(results_example, digits = 2)
The equivalence CSV confirms that the old and new cohort tables contain the same rows (order ignored). Each database has:
cohort_definition_id NA): total row counts and whether
the full tables match.n_old), from the new
table (n_new), and whether the set of rows for that cohort is identical
(rows_identical).Example:
equiv_example <- data.frame( database = c("postgres", "postgres", "postgres", "postgres", "redshift", "redshift"), cohort_definition_id = c(NA, 1L, 2L, 3L, NA, 1L), n_old = c(15000L, 5000L, 6000L, 4000L, 15000L, 5000L), n_new = c(15000L, 5000L, 6000L, 4000L, 15000L, 5000L), rows_identical = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), status = "ok", stringsAsFactors = FALSE ) knitr::kable(equiv_example)
When rows_identical is TRUE for all cohorts (and the overall row), the new approach produces exactly the same cohort membership and dates as the old CIRCE-based method; only execution strategy and performance differ.
| Aspect | Description |
|--------|-------------|
| Performance | The new batch optimizer typically reduces wall-clock time (ratio < 1) by sharing vocabulary and domain work across cohorts. |
| Correctness | The benchmarking pipeline compares old and new cohort tables row-by-row (order ignored) and writes equivalence results to CSV. |
| Platforms | Run the same cohort set on Postgres, Redshift, Snowflake, Spark, and SQL Server by passing a named list of CDMs to run_benchmark_multi_database(). |
Use the generated CSVs to document speedups and to confirm identical results across databases and between the two cohort generation methods.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.