compare_datasets: Compare Two Datasets

View source: R/compare_datasets.R

compare_datasetsR Documentation

Compare Two Datasets

Description

Compares two datasets at three levels in a single call:

  1. Dataset level – dimensions, column overlap, missing-value totals.

  2. Variable level – column name discrepancies and data-type mismatches (delegates to compare_variables()).

  3. Observation level – row-by-row value differences on common columns. Uses positional matching by default, or key-based matching when id_vars is provided.

The return value is a list with class "dataset_comparison", which has a tidy print() method. The same object is accepted by generate_summary_report(), generate_detailed_report(), and compare_by_group().

Usage

compare_datasets(df1, df2, tolerance = 0, vars = NULL, id_vars = NULL)

Arguments

df1

A data frame (the base dataset).

df2

A data frame (the compare dataset).

tolerance

Numeric tolerance value for floating-point comparisons (default 0). When tolerance > 0, numeric values are considered equal if their absolute difference is within the tolerance threshold. Character and factor columns always use exact matching regardless of tolerance.

vars

Optional character vector of variable names to compare. When provided, only these columns are included in the observation-level comparison. Structural comparison (extra columns, type mismatches) still covers all columns. Default is NULL (compare all common columns).

id_vars

Optional character vector of column names to use as matching keys. When provided, rows are matched by these key columns instead of by position. This allows comparison of datasets with different row counts or different row orders. Rows that exist in only one dataset are reported in unmatched_rows. Default is NULL (positional matching).

Value

A dataset_comparison list containing:

nrow_df1, ncol_df1

Dimensions of df1.

nrow_df2, ncol_df2

Dimensions of df2.

common_columns

Character vector of columns present in both.

extra_in_df1

Columns only in df1.

extra_in_df2

Columns only in df2.

type_mismatches

Data frame of columns whose class differs (columns: column, type_df1, type_df2), or NULL if none.

missing_values

Data frame summarising NA counts per column per dataset (columns: column, na_df1, na_df2), or NULL if no missingness.

variable_comparison

Output of compare_variables().

observation_comparison

Output of compare_observations(), or a list with a message element when row counts differ.

id_vars

Character vector of key columns used for matching, or NULL if positional matching was used.

unmatched_rows

List with df1_only and df2_only data frames of rows with no match in the other dataset (key-based matching only), or NULL.

Examples


# Positional matching (default)
df1 <- data.frame(id = 1:3, val = c(10, 20, 30))
df2 <- data.frame(id = 1:3, val = c(10, 25, 30))
result <- compare_datasets(df1, df2)
result

# Key-based matching (for different row counts or row orders)
df1 <- data.frame(id = c(1, 2, 3), val = c(10, 20, 30))
df2 <- data.frame(id = c(2, 3, 4), val = c(20, 35, 40))
result <- compare_datasets(df1, df2, id_vars = "id")
result
result$unmatched_rows


clinCompare documentation built on Feb. 19, 2026, 1:07 a.m.