clinCompare is an R package for comparing datasets at the dataset, variable, and observation level. For clinical trial data, an optional CDISC validation layer checks SDTM and ADaM conformance automatically. The package is designed for statistical programmers, data managers, and regulatory professionals who need to ensure data quality and compliance with industry standards.
library(clinCompare)
The compare_datasets() function gives a comprehensive overview: dimension
checks, variable comparison, type mismatches, and row-level value differences.
baseline <- data.frame( USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"), AGE = c(45, 52, 38), SEX = c("M", "F", "M"), RACE = c("WHITE", "WHITE", "ASIAN"), stringsAsFactors = FALSE ) updated <- data.frame( USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"), AGE = c(45, 53, 38), SEX = c("M", "F", "F"), RACE = c("WHITE", "WHITE", "ASIAN"), stringsAsFactors = FALSE ) result <- compare_datasets(baseline, updated) result
The result is a structured list you can drill into programmatically:
# Per-column difference counts result$observation_comparison$discrepancies # Row-level details for a specific variable result$observation_comparison$details$SEX
Use compare_variables() to focus on structural differences between two
datasets -- column names, data types, and variable ordering.
df_a <- data.frame( USUBJID = c("SUBJ01", "SUBJ02"), AGE = c(45, 52), SEX = c("M", "F"), stringsAsFactors = FALSE ) df_b <- data.frame( USUBJID = c("SUBJ01", "SUBJ02"), AGE = c(45L, 52L), WEIGHT = c(75.5, 80.2), stringsAsFactors = FALSE ) compare_variables(df_a, df_b)
Use compare_observations() for row-by-row value comparison on common columns:
df1 <- data.frame( ID = c(1, 2, 3), SCORE = c(80, 90, 70), stringsAsFactors = FALSE ) df2 <- data.frame( ID = c(1, 2, 3), SCORE = c(80, 95, 70), stringsAsFactors = FALSE ) compare_observations(df1, df2)
Remove duplicates and standardize text case before comparing:
messy <- data.frame( NAME = c("Alice", "alice", "Bob", "Bob"), SCORE = c(100, 100, 85, 85), stringsAsFactors = FALSE ) clean_dataset(messy, remove_duplicates = TRUE, convert_to_case = "upper")
Prepare two datasets identically before comparison:
df_unsorted1 <- data.frame( REGION = c("West", "East", "North"), SALES = c(150, 200, 180) ) df_unsorted2 <- data.frame( REGION = c("East", "North", "West"), SALES = c(210, 185, 160) ) prepped <- prepare_datasets(df_unsorted1, df_unsorted2, sort_columns = "REGION") prepped$df1 prepped$df2
Compare datasets within specific subgroups. Useful for multi-site or multi-arm studies:
site_data_v1 <- data.frame( SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"), SUBJID = c("S01", "S02", "S03", "S04"), AGE = c(45, 52, 38, 61) ) site_data_v2 <- data.frame( SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"), SUBJID = c("S01", "S02", "S03", "S04"), AGE = c(45, 53, 38, 62) ) by_site <- compare_by_group(site_data_v1, site_data_v2, group_vars = "SITEID") names(by_site)
CDISC (Clinical Data Interchange Standards Consortium) provides standardized formats for regulatory submissions:
CDISC validation ensures that datasets meet industry standards and regulatory requirements. For official CDISC standards documentation, see https://www.cdisc.org/standards.
clinCompare auto-detects the CDISC domain of a dataset using column matching, ADaM indicator columns, and filename hints:
dm_data <- data.frame( STUDYID = rep("STUDY01", 3), USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"), AGE = c(45, 62, 51), SEX = c("M", "F", "M"), RACE = c("WHITE", "BLACK", "ASIAN"), ARMCD = c("TRT", "PBO", "TRT"), ARM = c("Treatment", "Placebo", "Treatment"), stringsAsFactors = FALSE ) detect_cdisc_domain(dm_data)
cdisc_compare() is the flagship function. It compares two datasets,
auto-detects the CDISC domain and key variables, performs key-based row
matching, and validates against CDISC standards -- all in one call.
dm_v1 <- data.frame( STUDYID = rep("STUDY01", 3), USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"), AGE = c(45, 62, 51), SEX = c("M", "F", "M"), RACE = c("WHITE", "BLACK", "ASIAN"), ARMCD = c("TRT", "PBO", "TRT"), ARM = c("Treatment", "Placebo", "Treatment"), RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17"), stringsAsFactors = FALSE ) dm_v2 <- data.frame( STUDYID = rep("STUDY01", 3), USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"), AGE = c(45, 62, 52), SEX = c("M", "F", "M"), RACE = c("WHITE", "BLACK", "ASIAN"), ARMCD = c("TRT", "PBO", "TRT"), ARM = c("Treatment", "Placebo", "Treatment"), RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17"), stringsAsFactors = FALSE ) cdisc_result <- cdisc_compare(dm_v1, dm_v2, domain = "DM", standard = "SDTM") cdisc_result
Use validate_cdisc() to check a dataset against CDISC standards without
comparing it to another dataset:
validation <- validate_cdisc(dm_v1, domain = "DM", standard = "SDTM")
get_all_differences() returns every value-level difference as a single
long-format data frame, making it easy to filter, count, or export:
diffs <- get_all_differences(cdisc_result) diffs
export_report() auto-detects the output format from the file extension:
# HTML report export_report(cdisc_result, file.path(tempdir(), "dm_report.html")) # Text report export_report(cdisc_result, file.path(tempdir(), "dm_report.txt"))
Excel export requires the openxlsx package:
# Excel workbook with Summary, Variable Diffs, Value Diffs, and CDISC tabs export_report(cdisc_result, file.path(tempdir(), "dm_report.xlsx"))
compare_submission() scans two directories, matches files by name, and runs
cdisc_compare() on every matched pair. Domain, standard, and key variables
are all auto-detected per file.
results <- compare_submission( base_dir = "submission_v1/", compare_dir = "submission_v2/", output_file = "submission_diff.xlsx" )
clinCompare ships with hand-curated metadata for 51 SDTM domains (IG 3.4, with 3.3 support) and 14 ADaM datasets (IG 1.3, with 1.2/1.1 provenance tracking).
SDTM domains: AE, AG, BE, BS, CE, CM, CO, CP, DA, DD, DM, DS, DV, EC, EG, EX, FA, GF, HO, IE, IS, LB, MB, MH, MI, ML, MS, PC, PE, PP, PR, QS, RELREC, RS, SC, SE, SM, SS, SU, SUPPQUAL, SV, TA, TD, TE, TI, TM, TR, TS, TU, TV, VS.
ADaM datasets: ADAE, ADCM, ADEG, ADEFF, ADEX, ADLB, ADMH, ADPC, ADPP, ADRS, ADSL, ADTR, ADTTE, ADVS.
Disclaimer: clinCompare is a quality-assurance and exploratory analysis tool. It is not a substitute for official CDISC compliance validation software (e.g., Pinnacle 21). For regulatory submissions, always cross-reference with your organization's validated tools.
clinCompare provides a complete workflow for dataset comparison in clinical
trials: compare any two data frames with compare_datasets(), add CDISC
validation with cdisc_compare(), batch process entire submissions with
compare_submission(), and export results to HTML, text, or Excel with
export_report().
For more information and additional examples, visit the GitHub repository.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.