compare_datasets: Compare Two Datasets
In clinCompare: Dataset Comparison with 'CDISC' Validation for Clinical Trial Data

compare_datasets

R Documentation

Compare Two Datasets

Description

Compares two datasets at three levels in a single call:

Dataset level – dimensions, column overlap, missing-value totals.
Variable level – column name discrepancies and data-type mismatches (delegates to compare_variables()).
Observation level – row-by-row value differences on common columns. Uses positional matching by default, or key-based matching when id_vars is provided.

The return value is a list with class "dataset_comparison", which has a tidy print() method. The same object is accepted by generate_summary_report(), generate_detailed_report(), and compare_by_group().

Usage

compare_datasets(df1, df2, tolerance = 0, vars = NULL, id_vars = NULL)

Arguments

`df1`	A data frame (the base dataset).
`df2`	A data frame (the compare dataset).
`tolerance`	Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance.
`vars`	Optional character vector of variable names to compare. When provided, only these columns are included in the observation-level comparison. Structural comparison (extra columns, type mismatches) still covers all columns. Default is NULL (compare all common columns).
`id_vars`	Optional character vector of column names to use as matching keys. When provided, rows are matched by these key columns instead of by position. This allows comparison of datasets with different row counts or different row orders. Rows that exist in only one dataset are reported in `unmatched_rows`. Default is NULL (positional matching).

Value

A dataset_comparison list containing:

`nrow_df1`, `ncol_df1`	Dimensions of df1.
`nrow_df2`, `ncol_df2`	Dimensions of df2.
`common_columns`	Character vector of columns present in both.
`extra_in_df1`	Columns only in df1.
`extra_in_df2`	Columns only in df2.
`type_mismatches`	Data frame of columns whose class differs (columns: `column`, `type_df1`, `type_df2`), or `NULL` if none.
`missing_values`	Data frame summarising NA counts per column per dataset (columns: `column`, `na_df1`, `na_df2`), or `NULL` if no missingness.
`variable_comparison`	Output of `compare_variables()`.
`observation_comparison`	Output of `compare_observations()`, or a list with a `message` element when row counts differ.
`id_vars`	Character vector of key columns used for matching, or `NULL` if positional matching was used.
`unmatched_rows`	List with `df1_only` and `df2_only` data frames of rows with no match in the other dataset (key-based matching only), or `NULL`.

Examples


# Positional matching (default)
df1 <- data.frame(id = 1:3, val = c(10, 20, 30))
df2 <- data.frame(id = 1:3, val = c(10, 25, 30))
result <- compare_datasets(df1, df2)
result

# Key-based matching (for different row counts or row orders)
df1 <- data.frame(id = c(1, 2, 3), val = c(10, 20, 30))
df2 <- data.frame(id = c(2, 3, 4), val = c(20, 35, 40))
result <- compare_datasets(df1, df2, id_vars = "id")
result
result$unmatched_rows

clinCompare documentation built on Feb. 19, 2026, 1:07 a.m.